Files
Vetting/docs/architecture.md
T
josh 42da48864f
CI / Lint + build + test (push) Failing after 5m15s
Remove operator auth — trust the LAN
Can't log in from a fresh LXC deploy, and the service is LAN-only by
design. Rip out the whole bcrypt-password / signed-cookie session
layer: internal/auth, login templates, gen-admin-password binary +
Makefile targets, auth config block, login/logout routes and the
RequireSession middleware wrap. Agent bearer-token auth on
/api/v1/runs/{id}/* is untouched.

Operators who want a password can front the service with a reverse
proxy — noted in README and docs/operations.md.
2026-04-17 22:31:49 -04:00

8.7 KiB
Raw Blame History

Architecture

A single Go binary runs the orchestrator. A second Go binary runs inside a custom Debian live image (built with mkosi) and becomes the per-run test agent. The two talk over HTTP + SSE.

Operator browser (HTMX + SSE, admin login)
   │ HTTPS
   ▼
┌───────────────────────────────────────────────────────────────┐
│  Orchestrator LXC — single Go binary `vetting`                │
│                                                               │
│   UI (Templ) ─┬─ Agent API ─┬─ SSE hub                        │
│               │             │                                 │
│         Orchestrator core (state machine, dispatcher sem=3,   │
│         stage executors, WoL sender, token issuer)            │
│               │                                               │
│         ┌─────┴─────┬──────────┐                              │
│         ▼           ▼          ▼                              │
│     SQLite   flat-file logs   dnsmasq subprocess              │
│                                (DHCP+TFTP+HTTP, MAC allowlist)│
│                                                               │
│         Janitor goroutine (retention-based cleanup)           │
│         Notifier registry (ntfy/discord/smtp)                 │
└─────────────────────────────────────────┬─────────────────────┘
                                          │ LAN
                                          ▼
                               Host under test (×23)
                               PXE → iPXE → Linux live image
                                 └─ vetting-agent (HTTP+SSE back)

Packages

Package Purpose
cmd/vetting Orchestrator entrypoint. Wires config, stores, runner, dispatcher, iperf supervisor, PXE supervisor, janitor, HTTP router.
cmd/vetting-agent In-image agent entrypoint. Reads kernel cmdline params, starts the agent loop.
internal/config YAML loader + types.
internal/db SQLite open + embedded migrations. Pure Go via modernc.org/sqlite.
internal/model Plain structs: Host, Run, Stage, Measurement, SpecDiff, Artifact.
internal/store Repository layer; SQL is hand-written.
internal/orchestrator State machine, dispatcher, per-run runner, WoL sender, HMAC run tokens, iperf supervisor.
internal/api HTTP handlers: agent_handlers.go (the agent-facing API) and ui_handlers.go (HTMX fragments + SSE).
internal/httpserver chi router assembly — lives here to avoid api ↔ orchestrator cyclic imports.
internal/web Embedded static assets + compiled Templ templates.
internal/pxe dnsmasq subprocess supervisor + per-MAC iPXE script generator.
internal/events In-process SSE hub (fan-out to live browser clients).
internal/logs Per-run flat-file writer + SSE fan-out of live log tail.
internal/spec Expected-vs-actual diff engine with severity classification.
internal/notify Pluggable notifier registry (ntfy, Discord webhook, SMTP).
internal/report HTML + JSON report generation (html/template, self-contained).
internal/hold Per-run SSH key issuance for FailedHolding.
internal/janitor Retention-based cleanup of old artifact files + log files.
agent/ In-image agent: claim loop, stage dispatch, heartbeat, log forwarder, thermal sidecar.
agent/probes lshw, dmidecode, smartctl, lspci, hwmon, nvidia-smi wrappers.
agent/tests Per-stage test implementations (SMART, CPUStress, Storage, Network, GPU, PSU).
live-image/ mkosi config + postinst for the Debian live image.
deploy/ systemd unit + example config + install.sh.
test/e2e/ Build-tagged (-tags=e2e) QEMU + PXE full-stack test.

State machine

Per-run state is the single source of truth; the UI is a pure projection of DB + event stream.

Registered → Queued → WaitingWoL → Booting → InventoryCheck
  → SpecValidate → SMART → CPUStress → Storage → Network
  → GPU → PSU → Reporting → Completed

any stage → Failed → FailedHolding → Released

Key points:

  • Transitions are table-driven (internal/orchestrator/statemachine.go). Each (state, event) → (next, action) is encoded once.
  • Orchestrator-owned stages resolve inside /result: SpecValidate and Reporting flip state forward as part of the preceding stage's result handler, so the agent never sees them as "its turn".
  • Stage rows persist before SSE fan-out — the UI can re-derive state by reading SQLite, and an SSE reconnect mid-run just fetches fresh tile fragments.

Agent ↔ orchestrator protocol

GET  /ipxe/{MAC}                     → per-MAC iPXE script
POST /api/v1/runs/{id}/hello         → "I booted; here's my address"
POST /api/v1/runs/{id}/claim         → validate token, receive stage list
POST /api/v1/runs/{id}/heartbeat     → liveness ping; response carries cmd
POST /api/v1/runs/{id}/log           → batch of log lines
POST /api/v1/runs/{id}/sensor        → batch of measurements (thermals, throughput)
POST /api/v1/runs/{id}/result        → stage result; response says next_state
POST /api/v1/runs/{id}/hold          → on FailedHolding, receive authorized_key

Auth on every /api/v1/* call: the bearer token is stored as a bcrypt hash in runs.agent_token_hash and compared in constant time. The plaintext is in the kernel cmdline — unforgeable by anyone not on the trusted bridge, because the iPXE script is issued per-MAC and the MAC must already be in the dnsmasq allowlist.

Heartbeat control channel

The heartbeat response carries a cmd field the agent acts on:

cmd When fired Agent action
continue Normal case No-op; keep running current stage
shutdown Run reached Completed systemctl poweroff
abort Run in FailedHolding or Released Stop heartbeat loop; let the operator drive
retry_stage Operator pressed "Override wipe" Re-enter the named stage with override_flags armed

Safety: destructive disk tests

Four layered gates:

  1. MAC allowlist — dnsmasq only answers DHCP for registered MACs.
  2. Signed run token — orchestrator issues a per-run HMAC token in the iPXE kernel cmdline; the agent submits it on /claim and the orchestrator verifies before handing back the stage list.
  3. Wipe probe — before badblocks, the agent scans for filesystem signatures / LVM metadata / partition tables. Anything found → FailedHolding on Storage. The operator explicitly clicks Override wipe-probe to proceed.
  4. Device allowlist — the agent only targets block devices matching the inventory's expected_disks. USB sticks and surprise disks are skipped.

Notifications

Fire-and-forget. The orchestrator fires four event kinds:

Kind Severity When
StageFailed critical Any stage returns passed=false
SpecMismatch critical SpecValidate finds critical diffs
HoldingOpened critical Agent POSTs /hold (operator can SSH in)
RunCompleted info Pipeline reaches Completed

The config maps event kinds and severities to one or more notifiers (ntfy, Discord webhook, SMTP). Each notifier gets one attempt per event with a 10s timeout; delivery failures are logged, nothing is persisted.

Why a separate notify package?

Keeps the /result and /hold handlers non-blocking. Each dispatch starts a goroutine per target; a slow ntfy server doesn't back up an SMTP notifier or delay the HTTP response to the agent.

Data retention

The janitor goroutine (internal/janitor) runs a sweep every janitor.interval_minutes (default 60) and deletes:

  • artifact files older than artifacts.retention_days, plus their artifacts table rows
  • log files older than logs.retention_days

runs, hosts, stages, measurements, spec_diffs rows are never deleted by the janitor — host histories and aggregate metrics survive cleanups.

Reproducible builds

The orchestrator and agent are pure Go; make orchestrator-linux cross-compiles to linux-amd64 from Windows or macOS.

The live image requires Linux-side tooling (mkosi, debootstrap, squashfs-tools) so make live-image fails loudly on Windows and redirects to wsl make live-image. Pinning to snapshot.debian.org in live-image/mkosi.conf keeps image bits stable across time for a given git SHA.