Initial commit: full Phases 1-6 implementation
CI / Lint + build + test (push) Has been cancelled

Post-repair hardware validation pipeline for Proxmox cluster hosts.
Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq
PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
This commit is contained in:
2026-04-17 21:32:10 -04:00
commit 9bb4b09a04
98 changed files with 11960 additions and 0 deletions
+178
View File
@@ -0,0 +1,178 @@
# Architecture
A single Go binary runs the orchestrator. A second Go binary runs
inside a custom Debian live image (built with mkosi) and becomes the
per-run test agent. The two talk over HTTP + SSE.
```
Operator browser (HTMX + SSE, admin login)
│ HTTPS
┌───────────────────────────────────────────────────────────────┐
│ Orchestrator LXC — single Go binary `vetting` │
│ │
│ UI (Templ) ─┬─ Agent API ─┬─ SSE hub │
│ │ │ │
│ Orchestrator core (state machine, dispatcher sem=3, │
│ stage executors, WoL sender, token issuer) │
│ │ │
│ ┌─────┴─────┬──────────┐ │
│ ▼ ▼ ▼ │
│ SQLite flat-file logs dnsmasq subprocess │
│ (DHCP+TFTP+HTTP, MAC allowlist)│
│ │
│ Janitor goroutine (retention-based cleanup) │
│ Notifier registry (ntfy/discord/smtp) │
└─────────────────────────────────────────┬─────────────────────┘
│ LAN
Host under test (×23)
PXE → iPXE → Linux live image
└─ vetting-agent (HTTP+SSE back)
```
## Packages
| Package | Purpose |
|---|---|
| `cmd/vetting` | Orchestrator entrypoint. Wires config, stores, runner, dispatcher, iperf supervisor, PXE supervisor, janitor, HTTP router. |
| `cmd/vetting-agent` | In-image agent entrypoint. Reads kernel cmdline params, starts the agent loop. |
| `internal/config` | YAML loader + types. |
| `internal/db` | SQLite open + embedded migrations. Pure Go via modernc.org/sqlite. |
| `internal/model` | Plain structs: `Host`, `Run`, `Stage`, `Measurement`, `SpecDiff`, `Artifact`. |
| `internal/store` | Repository layer; SQL is hand-written. |
| `internal/orchestrator` | State machine, dispatcher, per-run runner, WoL sender, HMAC run tokens, iperf supervisor. |
| `internal/api` | HTTP handlers: `agent_handlers.go` (the agent-facing API) and `ui_handlers.go` (HTMX fragments + SSE). |
| `internal/httpserver` | chi router assembly — lives here to avoid `api ↔ orchestrator` cyclic imports. |
| `internal/web` | Embedded static assets + compiled Templ templates. |
| `internal/auth` | Single-admin bcrypt + signed-cookie sessions. |
| `internal/pxe` | dnsmasq subprocess supervisor + per-MAC iPXE script generator. |
| `internal/events` | In-process SSE hub (fan-out to live browser clients). |
| `internal/logs` | Per-run flat-file writer + SSE fan-out of live log tail. |
| `internal/spec` | Expected-vs-actual diff engine with severity classification. |
| `internal/notify` | Pluggable notifier registry (ntfy, Discord webhook, SMTP). |
| `internal/report` | HTML + JSON report generation (html/template, self-contained). |
| `internal/hold` | Per-run SSH key issuance for `FailedHolding`. |
| `internal/janitor` | Retention-based cleanup of old artifact files + log files. |
| `agent/` | In-image agent: claim loop, stage dispatch, heartbeat, log forwarder, thermal sidecar. |
| `agent/probes` | lshw, dmidecode, smartctl, lspci, hwmon, nvidia-smi wrappers. |
| `agent/tests` | Per-stage test implementations (SMART, CPUStress, Storage, Network, GPU, PSU). |
| `live-image/` | mkosi config + postinst for the Debian live image. |
| `deploy/` | systemd unit + example config + install.sh. |
| `test/e2e/` | Build-tagged (`-tags=e2e`) QEMU + PXE full-stack test. |
## State machine
Per-run state is the single source of truth; the UI is a pure
projection of DB + event stream.
```
Registered → Queued → WaitingWoL → Booting → InventoryCheck
→ SpecValidate → SMART → CPUStress → Storage → Network
→ GPU → PSU → Reporting → Completed
any stage → Failed → FailedHolding → Released
```
Key points:
- **Transitions are table-driven** (`internal/orchestrator/statemachine.go`).
Each `(state, event) → (next, action)` is encoded once.
- **Orchestrator-owned stages resolve inside `/result`:** `SpecValidate`
and `Reporting` flip state forward as part of the preceding stage's
result handler, so the agent never sees them as "its turn".
- **Stage rows persist before SSE fan-out** — the UI can re-derive
state by reading SQLite, and an SSE reconnect mid-run just fetches
fresh tile fragments.
## Agent ↔ orchestrator protocol
```
GET /ipxe/{MAC} → per-MAC iPXE script
POST /api/v1/runs/{id}/hello → "I booted; here's my address"
POST /api/v1/runs/{id}/claim → validate token, receive stage list
POST /api/v1/runs/{id}/heartbeat → liveness ping; response carries cmd
POST /api/v1/runs/{id}/log → batch of log lines
POST /api/v1/runs/{id}/sensor → batch of measurements (thermals, throughput)
POST /api/v1/runs/{id}/result → stage result; response says next_state
POST /api/v1/runs/{id}/hold → on FailedHolding, receive authorized_key
```
Auth on every `/api/v1/*` call: the bearer token is stored as a bcrypt
hash in `runs.agent_token_hash` and compared in constant time. The
plaintext is in the kernel cmdline — unforgeable by anyone not on the
trusted bridge, because the iPXE script is issued per-MAC and the MAC
must already be in the dnsmasq allowlist.
### Heartbeat control channel
The heartbeat response carries a `cmd` field the agent acts on:
| cmd | When fired | Agent action |
|---|---|---|
| `continue` | Normal case | No-op; keep running current stage |
| `shutdown` | Run reached `Completed` | `systemctl poweroff` |
| `abort` | Run in `FailedHolding` or `Released` | Stop heartbeat loop; let the operator drive |
| `retry_stage` | Operator pressed "Override wipe" | Re-enter the named stage with `override_flags` armed |
## Safety: destructive disk tests
Four layered gates:
1. **MAC allowlist** — dnsmasq only answers DHCP for registered MACs.
2. **Signed run token** — orchestrator issues a per-run HMAC token in
the iPXE kernel cmdline; the agent submits it on `/claim` and the
orchestrator verifies before handing back the stage list.
3. **Wipe probe** — before `badblocks`, the agent scans for filesystem
signatures / LVM metadata / partition tables. Anything found →
`FailedHolding` on `Storage`. The operator explicitly clicks
**Override wipe-probe** to proceed.
4. **Device allowlist** — the agent only targets block devices matching
the inventory's `expected_disks`. USB sticks and surprise disks are
skipped.
## Notifications
Fire-and-forget. The orchestrator fires four event kinds:
| Kind | Severity | When |
|---|---|---|
| `StageFailed` | critical | Any stage returns `passed=false` |
| `SpecMismatch` | critical | `SpecValidate` finds critical diffs |
| `HoldingOpened` | critical | Agent POSTs `/hold` (operator can SSH in) |
| `RunCompleted` | info | Pipeline reaches `Completed` |
The config maps event kinds and severities to one or more notifiers
(ntfy, Discord webhook, SMTP). Each notifier gets one attempt per
event with a 10s timeout; delivery failures are logged, nothing is
persisted.
## Why a separate notify package?
Keeps the `/result` and `/hold` handlers non-blocking. Each dispatch
starts a goroutine per target; a slow ntfy server doesn't back up an
SMTP notifier or delay the HTTP response to the agent.
## Data retention
The janitor goroutine (`internal/janitor`) runs a sweep every
`janitor.interval_minutes` (default 60) and deletes:
- artifact files older than `artifacts.retention_days`, plus their
`artifacts` table rows
- log files older than `logs.retention_days`
`runs`, `hosts`, `stages`, `measurements`, `spec_diffs` rows are
**never** deleted by the janitor — host histories and aggregate
metrics survive cleanups.
## Reproducible builds
The orchestrator and agent are pure Go; `make orchestrator-linux`
cross-compiles to `linux-amd64` from Windows or macOS.
The live image requires Linux-side tooling (mkosi, debootstrap,
squashfs-tools) so `make live-image` fails loudly on Windows and
redirects to `wsl make live-image`. Pinning to snapshot.debian.org in
`live-image/mkosi.conf` keeps image bits stable across time for a
given git SHA.