Add 4 new doc files (configuration reference, development guide, API reference with full request/response schemas, database schema), expand the README with a feature list and how-it-works walkthrough, fix missing Firmware and Burn stages in architecture.md and test-suite.md, add threshold engine and host-mode agent sections, and add godoc comments to 11 packages and 6 model types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11 KiB
Architecture
A single Go binary runs the orchestrator. A second Go binary runs inside a custom Debian live image (built with mkosi) and becomes the per-run test agent. The two talk over HTTP + SSE.
Operator browser (HTMX + SSE, admin login)
│ HTTPS
▼
┌───────────────────────────────────────────────────────────────┐
│ Orchestrator LXC — single Go binary `vetting` │
│ │
│ UI (Templ) ─┬─ Agent API ─┬─ SSE hub │
│ │ │ │
│ Orchestrator core (state machine, dispatcher sem=3, │
│ stage executors, WoL sender, token issuer) │
│ │ │
│ ┌─────┴─────┬──────────┐ │
│ ▼ ▼ ▼ │
│ SQLite flat-file logs dnsmasq subprocess │
│ (DHCP+TFTP+HTTP, MAC allowlist)│
│ │
│ Janitor goroutine (retention-based cleanup) │
│ Notifier registry (ntfy/discord/smtp) │
└─────────────────────────────────────────┬─────────────────────┘
│ LAN
▼
Host under test (×2–3)
PXE → iPXE → Linux live image
└─ vetting-agent (HTTP+SSE back)
Packages
| Package | Purpose |
|---|---|
cmd/vetting |
Orchestrator entrypoint. Wires config, stores, runner, dispatcher, iperf supervisor, PXE supervisor, janitor, HTTP router. |
cmd/vetting-agent |
In-image agent entrypoint. Reads kernel cmdline params, starts the agent loop. |
internal/config |
YAML loader + types. ProfileRegistry holds the quick/deep/soak profile definitions, threshold defaults, and per-stage probe knobs. |
internal/db |
SQLite open + embedded migrations. Pure Go via modernc.org/sqlite. |
internal/model |
Plain structs: Host, Run, Stage, Measurement, SpecDiff, Artifact. |
internal/store |
Repository layer; SQL is hand-written (no ORM). Stores for hosts, runs, stages, sub-steps, artifacts, spec diffs, measurements, thresholds, firmware. |
internal/orchestrator |
State machine, dispatcher, per-run runner, WoL sender, HMAC run tokens, iperf supervisor. |
internal/api |
HTTP handlers: agent_handlers.go (the agent-facing API) and ui_handlers.go (HTMX fragments + SSE). |
internal/httpserver |
chi router assembly — lives here to avoid api ↔ orchestrator cyclic imports. |
internal/web |
Embedded static assets + compiled Templ templates. |
internal/pxe |
dnsmasq subprocess supervisor + per-MAC iPXE script generator. |
internal/events |
In-process SSE hub (fan-out to live browser clients). |
internal/logs |
Per-run flat-file writer + SSE fan-out of live log tail. |
internal/spec |
Expected-vs-actual diff engine with severity classification. |
internal/notify |
Pluggable notifier registry (ntfy, Discord webhook, SMTP). |
internal/report |
HTML + JSON report generation (html/template, self-contained). |
internal/hold |
Per-run SSH key issuance for FailedHolding. |
internal/janitor |
Retention-based cleanup of old artifact files + log files. |
agent/ |
In-image agent: claim loop, stage dispatch, heartbeat, log forwarder, thermal sidecar. |
agent/probes |
lshw, dmidecode, smartctl, lspci, hwmon, nvidia-smi wrappers. |
agent/tests |
Per-stage test implementations (SMART, CPUStress, Storage, Network, GPU, PSU). |
live-image/ |
mkosi config + postinst for the Debian live image. |
deploy/ |
systemd unit + example config + install.sh. |
test/e2e/ |
Build-tagged (-tags=e2e) QEMU + PXE full-stack test. |
State machine
Per-run state is the single source of truth; the UI is a pure projection of DB + event stream.
Registered → Queued → WaitingWoL / WaitingReboot → Booting
→ InventoryCheck → Firmware → SpecValidate → SMART
→ CPUStress → Storage → Network → Burn → GPU → PSU
→ Reporting → Completed
any stage → Failed → FailedHolding → Released
any active state → Cancelled
Key points:
- Transitions are table-driven (
internal/orchestrator/statemachine.go). Each(state, event) → (next, action)is encoded once. - Orchestrator-owned stages resolve inside
/result:SpecValidateandReportingflip state forward as part of the preceding stage's result handler, so the agent never sees them as "its turn". - Stage rows persist before SSE fan-out — the UI can re-derive state by reading SQLite, and an SSE reconnect mid-run just fetches fresh tile fragments.
Agent ↔ orchestrator protocol
GET /ipxe/{MAC} → per-MAC iPXE script
POST /api/v1/runs/{id}/hello → "I booted; here's my address"
POST /api/v1/runs/{id}/claim → validate token, receive stage list
POST /api/v1/runs/{id}/heartbeat → liveness ping; response carries cmd
POST /api/v1/runs/{id}/log → batch of log lines
POST /api/v1/runs/{id}/sensor → batch of measurements (thermals, throughput)
POST /api/v1/runs/{id}/result → stage result; response says next_state
POST /api/v1/runs/{id}/hold → on FailedHolding, receive authorized_key
See api-reference.md for full request/response schemas and SSE event types.
Auth on every /api/v1/runs/* call: the bearer token is stored as a bcrypt
hash in runs.agent_token_hash and compared in constant time. The
plaintext is in the kernel cmdline — unforgeable by anyone not on the
trusted bridge, because the iPXE script is issued per-MAC and the MAC
must already be in the dnsmasq allowlist.
Heartbeat control channel
The heartbeat response carries a cmd field the agent acts on:
| cmd | When fired | Agent action |
|---|---|---|
continue |
Normal case | No-op; keep running current stage |
shutdown |
Run reached Completed |
systemctl poweroff |
abort |
Run in FailedHolding or Released |
Stop heartbeat loop; let the operator drive |
retry_stage |
Operator pressed "Override wipe" | Re-enter the named stage with override_flags armed |
Safety: destructive disk tests
Four layered gates:
- MAC allowlist — dnsmasq only answers DHCP for registered MACs.
- Signed run token — orchestrator issues a per-run HMAC token in
the iPXE kernel cmdline; the agent submits it on
/claimand the orchestrator verifies before handing back the stage list. - Wipe probe — before
badblocks, the agent scans for filesystem signatures / LVM metadata / partition tables. Anything found →FailedHoldingonStorage. The operator explicitly clicks Override wipe-probe to proceed. - Device allowlist — the agent only targets block devices matching
the inventory's
expected_disks. USB sticks and surprise disks are skipped.
Notifications
Fire-and-forget. The orchestrator fires four event kinds:
| Kind | Severity | When |
|---|---|---|
StageFailed |
critical | Any stage returns passed=false |
SpecMismatch |
critical | SpecValidate finds critical diffs |
HoldingOpened |
critical | Agent POSTs /hold (operator can SSH in) |
RunCompleted |
info | Pipeline reaches Completed |
The config maps event kinds and severities to one or more notifiers (ntfy, Discord webhook, SMTP). Each notifier gets one attempt per event with a 10s timeout; delivery failures are logged, nothing is persisted.
Why a separate notify package?
Keeps the /result and /hold handlers non-blocking. Each dispatch
starts a goroutine per target; a slow ntfy server doesn't back up an
SMTP notifier or delay the HTTP response to the agent.
Data retention
The janitor goroutine (internal/janitor) runs a sweep every
janitor.interval_minutes (default 60) and deletes:
- artifact files older than
artifacts.retention_days, plus theirartifactstable rows - log files older than
logs.retention_days
runs, hosts, stages, measurements, spec_diffs rows are
never deleted by the janitor — host histories and aggregate
metrics survive cleanups.
Threshold engine
Every /sensor batch is evaluated against rules seeded per-run at
creation time from the ProfileRegistry + per-host overrides. Rules
are immutable for the life of a run — a late config edit can't
retroactively pass or fail an in-flight run.
Operators: lt, lte, gt, gte, within_pct. Key matching is
glob-ish: * matches all keys, cpu/* matches any key starting with
cpu/, exact strings for specific keys. Stage matching works the same
way (* for global, exact name for stage-specific).
Severity drives the action:
- critical — fail the run immediately. The current stage is marked
failed, the run enters
FailedHolding, and aStageFailednotification fires. - warning — record the breach for the report. The stage continues.
Every evaluation (pass or fail) is persisted as a
threshold_evaluations row so the report can render per-sample
verdict badges. See configuration.md § thresholds
for the config-level reference.
Host-mode agent
The vetting-agent host binary runs as a systemd service on
installed hosts. It heartbeats to POST /api/v1/hosts/{mac}/heartbeat
every 30 s so the dashboard shows online/offline status.
The quick-register one-liner (GET /register/quick.sh) downloads the
agent binary from /assets/vetting-agent-linux-amd64, installs it as
a systemd service, and auto-POSTs to POST /api/v1/hosts to register
the host — no manual MAC entry needed.
When the operator clicks Start Vetting, the orchestrator's
dispatcher sets cmd=reboot_for_vetting on the next heartbeat
response. The host-mode agent reboots the host, which PXE-boots into
the live image and enters the normal vetting flow.
Host API
These endpoints are LAN-trusted (no bearer token) and share the same threat model as the browser UI:
POST /api/v1/hosts → JSON host registration (quick-register)
POST /api/v1/hosts/{mac}/heartbeat → host-mode liveness + command channel
Reproducible builds
The orchestrator and agent are pure Go; make orchestrator-linux
cross-compiles to linux-amd64 from Windows or macOS.
The live image requires Linux-side tooling (mkosi, debootstrap,
squashfs-tools) so make live-image fails loudly on Windows and
redirects to wsl make live-image. Pinning to snapshot.debian.org in
live-image/mkosi.conf keeps image bits stable across time for a
given git SHA.