Files
Vetting/docs/architecture.md
josh 8367ec2a9f
CI / Lint + build + test (push) Successful in 1m36s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 49s
docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API
reference with full request/response schemas, database schema), expand
the README with a feature list and how-it-works walkthrough, fix
missing Firmware and Burn stages in architecture.md and test-suite.md,
add threshold engine and host-mode agent sections, and add godoc
comments to 11 packages and 6 model types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-23 18:37:26 -04:00

11 KiB
Raw Permalink Blame History

Architecture

A single Go binary runs the orchestrator. A second Go binary runs inside a custom Debian live image (built with mkosi) and becomes the per-run test agent. The two talk over HTTP + SSE.

Operator browser (HTMX + SSE, admin login)
   │ HTTPS
   ▼
┌───────────────────────────────────────────────────────────────┐
│  Orchestrator LXC — single Go binary `vetting`                │
│                                                               │
│   UI (Templ) ─┬─ Agent API ─┬─ SSE hub                        │
│               │             │                                 │
│         Orchestrator core (state machine, dispatcher sem=3,   │
│         stage executors, WoL sender, token issuer)            │
│               │                                               │
│         ┌─────┴─────┬──────────┐                              │
│         ▼           ▼          ▼                              │
│     SQLite   flat-file logs   dnsmasq subprocess              │
│                                (DHCP+TFTP+HTTP, MAC allowlist)│
│                                                               │
│         Janitor goroutine (retention-based cleanup)           │
│         Notifier registry (ntfy/discord/smtp)                 │
└─────────────────────────────────────────┬─────────────────────┘
                                          │ LAN
                                          ▼
                               Host under test (×23)
                               PXE → iPXE → Linux live image
                                 └─ vetting-agent (HTTP+SSE back)

Packages

Package Purpose
cmd/vetting Orchestrator entrypoint. Wires config, stores, runner, dispatcher, iperf supervisor, PXE supervisor, janitor, HTTP router.
cmd/vetting-agent In-image agent entrypoint. Reads kernel cmdline params, starts the agent loop.
internal/config YAML loader + types. ProfileRegistry holds the quick/deep/soak profile definitions, threshold defaults, and per-stage probe knobs.
internal/db SQLite open + embedded migrations. Pure Go via modernc.org/sqlite.
internal/model Plain structs: Host, Run, Stage, Measurement, SpecDiff, Artifact.
internal/store Repository layer; SQL is hand-written (no ORM). Stores for hosts, runs, stages, sub-steps, artifacts, spec diffs, measurements, thresholds, firmware.
internal/orchestrator State machine, dispatcher, per-run runner, WoL sender, HMAC run tokens, iperf supervisor.
internal/api HTTP handlers: agent_handlers.go (the agent-facing API) and ui_handlers.go (HTMX fragments + SSE).
internal/httpserver chi router assembly — lives here to avoid api ↔ orchestrator cyclic imports.
internal/web Embedded static assets + compiled Templ templates.
internal/pxe dnsmasq subprocess supervisor + per-MAC iPXE script generator.
internal/events In-process SSE hub (fan-out to live browser clients).
internal/logs Per-run flat-file writer + SSE fan-out of live log tail.
internal/spec Expected-vs-actual diff engine with severity classification.
internal/notify Pluggable notifier registry (ntfy, Discord webhook, SMTP).
internal/report HTML + JSON report generation (html/template, self-contained).
internal/hold Per-run SSH key issuance for FailedHolding.
internal/janitor Retention-based cleanup of old artifact files + log files.
agent/ In-image agent: claim loop, stage dispatch, heartbeat, log forwarder, thermal sidecar.
agent/probes lshw, dmidecode, smartctl, lspci, hwmon, nvidia-smi wrappers.
agent/tests Per-stage test implementations (SMART, CPUStress, Storage, Network, GPU, PSU).
live-image/ mkosi config + postinst for the Debian live image.
deploy/ systemd unit + example config + install.sh.
test/e2e/ Build-tagged (-tags=e2e) QEMU + PXE full-stack test.

State machine

Per-run state is the single source of truth; the UI is a pure projection of DB + event stream.

Registered → Queued → WaitingWoL / WaitingReboot → Booting
  → InventoryCheck → Firmware → SpecValidate → SMART
  → CPUStress → Storage → Network → Burn → GPU → PSU
  → Reporting → Completed

any stage → Failed → FailedHolding → Released
any active state → Cancelled

Key points:

  • Transitions are table-driven (internal/orchestrator/statemachine.go). Each (state, event) → (next, action) is encoded once.
  • Orchestrator-owned stages resolve inside /result: SpecValidate and Reporting flip state forward as part of the preceding stage's result handler, so the agent never sees them as "its turn".
  • Stage rows persist before SSE fan-out — the UI can re-derive state by reading SQLite, and an SSE reconnect mid-run just fetches fresh tile fragments.

Agent ↔ orchestrator protocol

GET  /ipxe/{MAC}                     → per-MAC iPXE script
POST /api/v1/runs/{id}/hello         → "I booted; here's my address"
POST /api/v1/runs/{id}/claim         → validate token, receive stage list
POST /api/v1/runs/{id}/heartbeat     → liveness ping; response carries cmd
POST /api/v1/runs/{id}/log           → batch of log lines
POST /api/v1/runs/{id}/sensor        → batch of measurements (thermals, throughput)
POST /api/v1/runs/{id}/result        → stage result; response says next_state
POST /api/v1/runs/{id}/hold          → on FailedHolding, receive authorized_key

See api-reference.md for full request/response schemas and SSE event types.

Auth on every /api/v1/runs/* call: the bearer token is stored as a bcrypt hash in runs.agent_token_hash and compared in constant time. The plaintext is in the kernel cmdline — unforgeable by anyone not on the trusted bridge, because the iPXE script is issued per-MAC and the MAC must already be in the dnsmasq allowlist.

Heartbeat control channel

The heartbeat response carries a cmd field the agent acts on:

cmd When fired Agent action
continue Normal case No-op; keep running current stage
shutdown Run reached Completed systemctl poweroff
abort Run in FailedHolding or Released Stop heartbeat loop; let the operator drive
retry_stage Operator pressed "Override wipe" Re-enter the named stage with override_flags armed

Safety: destructive disk tests

Four layered gates:

  1. MAC allowlist — dnsmasq only answers DHCP for registered MACs.
  2. Signed run token — orchestrator issues a per-run HMAC token in the iPXE kernel cmdline; the agent submits it on /claim and the orchestrator verifies before handing back the stage list.
  3. Wipe probe — before badblocks, the agent scans for filesystem signatures / LVM metadata / partition tables. Anything found → FailedHolding on Storage. The operator explicitly clicks Override wipe-probe to proceed.
  4. Device allowlist — the agent only targets block devices matching the inventory's expected_disks. USB sticks and surprise disks are skipped.

Notifications

Fire-and-forget. The orchestrator fires four event kinds:

Kind Severity When
StageFailed critical Any stage returns passed=false
SpecMismatch critical SpecValidate finds critical diffs
HoldingOpened critical Agent POSTs /hold (operator can SSH in)
RunCompleted info Pipeline reaches Completed

The config maps event kinds and severities to one or more notifiers (ntfy, Discord webhook, SMTP). Each notifier gets one attempt per event with a 10s timeout; delivery failures are logged, nothing is persisted.

Why a separate notify package?

Keeps the /result and /hold handlers non-blocking. Each dispatch starts a goroutine per target; a slow ntfy server doesn't back up an SMTP notifier or delay the HTTP response to the agent.

Data retention

The janitor goroutine (internal/janitor) runs a sweep every janitor.interval_minutes (default 60) and deletes:

  • artifact files older than artifacts.retention_days, plus their artifacts table rows
  • log files older than logs.retention_days

runs, hosts, stages, measurements, spec_diffs rows are never deleted by the janitor — host histories and aggregate metrics survive cleanups.

Threshold engine

Every /sensor batch is evaluated against rules seeded per-run at creation time from the ProfileRegistry + per-host overrides. Rules are immutable for the life of a run — a late config edit can't retroactively pass or fail an in-flight run.

Operators: lt, lte, gt, gte, within_pct. Key matching is glob-ish: * matches all keys, cpu/* matches any key starting with cpu/, exact strings for specific keys. Stage matching works the same way (* for global, exact name for stage-specific).

Severity drives the action:

  • critical — fail the run immediately. The current stage is marked failed, the run enters FailedHolding, and a StageFailed notification fires.
  • warning — record the breach for the report. The stage continues.

Every evaluation (pass or fail) is persisted as a threshold_evaluations row so the report can render per-sample verdict badges. See configuration.md § thresholds for the config-level reference.

Host-mode agent

The vetting-agent host binary runs as a systemd service on installed hosts. It heartbeats to POST /api/v1/hosts/{mac}/heartbeat every 30 s so the dashboard shows online/offline status.

The quick-register one-liner (GET /register/quick.sh) downloads the agent binary from /assets/vetting-agent-linux-amd64, installs it as a systemd service, and auto-POSTs to POST /api/v1/hosts to register the host — no manual MAC entry needed.

When the operator clicks Start Vetting, the orchestrator's dispatcher sets cmd=reboot_for_vetting on the next heartbeat response. The host-mode agent reboots the host, which PXE-boots into the live image and enters the normal vetting flow.

Host API

These endpoints are LAN-trusted (no bearer token) and share the same threat model as the browser UI:

POST /api/v1/hosts                  → JSON host registration (quick-register)
POST /api/v1/hosts/{mac}/heartbeat  → host-mode liveness + command channel

Reproducible builds

The orchestrator and agent are pure Go; make orchestrator-linux cross-compiles to linux-amd64 from Windows or macOS.

The live image requires Linux-side tooling (mkosi, debootstrap, squashfs-tools) so make live-image fails loudly on Windows and redirects to wsl make live-image. Pinning to snapshot.debian.org in live-image/mkosi.conf keeps image bits stable across time for a given git SHA.