Initial commit: full Phases 1-6 implementation

Post-repair hardware validation pipeline for Proxmox cluster hosts. Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
2026-04-17 21:32:10 -04:00
commit 9bb4b09a04
98 changed files with 11960 additions and 0 deletions
@@ -0,0 +1,178 @@
+# Architecture
+
+A single Go binary runs the orchestrator. A second Go binary runs
+inside a custom Debian live image (built with mkosi) and becomes the
+per-run test agent. The two talk over HTTP + SSE.
+
+```
+Operator browser (HTMX + SSE, admin login)
+   │ HTTPS
+   ▼
+┌───────────────────────────────────────────────────────────────┐
+│  Orchestrator LXC — single Go binary `vetting`                │
+│                                                               │
+│   UI (Templ) ─┬─ Agent API ─┬─ SSE hub                        │
+│               │             │                                 │
+│         Orchestrator core (state machine, dispatcher sem=3,   │
+│         stage executors, WoL sender, token issuer)            │
+│               │                                               │
+│         ┌─────┴─────┬──────────┐                              │
+│         ▼           ▼          ▼                              │
+│     SQLite   flat-file logs   dnsmasq subprocess              │
+│                                (DHCP+TFTP+HTTP, MAC allowlist)│
+│                                                               │
+│         Janitor goroutine (retention-based cleanup)           │
+│         Notifier registry (ntfy/discord/smtp)                 │
+└─────────────────────────────────────────┬─────────────────────┘
+                                          │ LAN
+                                          ▼
+                               Host under test (×2–3)
+                               PXE → iPXE → Linux live image
+                                 └─ vetting-agent (HTTP+SSE back)
+```
+
+## Packages
+
+| Package | Purpose |
+|---|---|
+| `cmd/vetting` | Orchestrator entrypoint. Wires config, stores, runner, dispatcher, iperf supervisor, PXE supervisor, janitor, HTTP router. |
+| `cmd/vetting-agent` | In-image agent entrypoint. Reads kernel cmdline params, starts the agent loop. |
+| `internal/config` | YAML loader + types. |
+| `internal/db` | SQLite open + embedded migrations. Pure Go via modernc.org/sqlite. |
+| `internal/model` | Plain structs: `Host`, `Run`, `Stage`, `Measurement`, `SpecDiff`, `Artifact`. |
+| `internal/store` | Repository layer; SQL is hand-written. |
+| `internal/orchestrator` | State machine, dispatcher, per-run runner, WoL sender, HMAC run tokens, iperf supervisor. |
+| `internal/api` | HTTP handlers: `agent_handlers.go` (the agent-facing API) and `ui_handlers.go` (HTMX fragments + SSE). |
+| `internal/httpserver` | chi router assembly — lives here to avoid `api ↔ orchestrator` cyclic imports. |
+| `internal/web` | Embedded static assets + compiled Templ templates. |
+| `internal/auth` | Single-admin bcrypt + signed-cookie sessions. |
+| `internal/pxe` | dnsmasq subprocess supervisor + per-MAC iPXE script generator. |
+| `internal/events` | In-process SSE hub (fan-out to live browser clients). |
+| `internal/logs` | Per-run flat-file writer + SSE fan-out of live log tail. |
+| `internal/spec` | Expected-vs-actual diff engine with severity classification. |
+| `internal/notify` | Pluggable notifier registry (ntfy, Discord webhook, SMTP). |
+| `internal/report` | HTML + JSON report generation (html/template, self-contained). |
+| `internal/hold` | Per-run SSH key issuance for `FailedHolding`. |
+| `internal/janitor` | Retention-based cleanup of old artifact files + log files. |
+| `agent/` | In-image agent: claim loop, stage dispatch, heartbeat, log forwarder, thermal sidecar. |
+| `agent/probes` | lshw, dmidecode, smartctl, lspci, hwmon, nvidia-smi wrappers. |
+| `agent/tests` | Per-stage test implementations (SMART, CPUStress, Storage, Network, GPU, PSU). |
+| `live-image/` | mkosi config + postinst for the Debian live image. |
+| `deploy/` | systemd unit + example config + install.sh. |
+| `test/e2e/` | Build-tagged (`-tags=e2e`) QEMU + PXE full-stack test. |
+
+## State machine
+
+Per-run state is the single source of truth; the UI is a pure
+projection of DB + event stream.
+
+```
+Registered → Queued → WaitingWoL → Booting → InventoryCheck
+  → SpecValidate → SMART → CPUStress → Storage → Network
+  → GPU → PSU → Reporting → Completed
+
+any stage → Failed → FailedHolding → Released
+```
+
+Key points:
+
+- **Transitions are table-driven** (`internal/orchestrator/statemachine.go`).
+  Each `(state, event) → (next, action)` is encoded once.
+- **Orchestrator-owned stages resolve inside `/result`:** `SpecValidate`
+  and `Reporting` flip state forward as part of the preceding stage's
+  result handler, so the agent never sees them as "its turn".
+- **Stage rows persist before SSE fan-out** — the UI can re-derive
+  state by reading SQLite, and an SSE reconnect mid-run just fetches
+  fresh tile fragments.
+
+## Agent ↔ orchestrator protocol
+
+```
+GET  /ipxe/{MAC}                     → per-MAC iPXE script
+POST /api/v1/runs/{id}/hello         → "I booted; here's my address"
+POST /api/v1/runs/{id}/claim         → validate token, receive stage list
+POST /api/v1/runs/{id}/heartbeat     → liveness ping; response carries cmd
+POST /api/v1/runs/{id}/log           → batch of log lines
+POST /api/v1/runs/{id}/sensor        → batch of measurements (thermals, throughput)
+POST /api/v1/runs/{id}/result        → stage result; response says next_state
+POST /api/v1/runs/{id}/hold          → on FailedHolding, receive authorized_key
+```
+
+Auth on every `/api/v1/*` call: the bearer token is stored as a bcrypt
+hash in `runs.agent_token_hash` and compared in constant time. The
+plaintext is in the kernel cmdline — unforgeable by anyone not on the
+trusted bridge, because the iPXE script is issued per-MAC and the MAC
+must already be in the dnsmasq allowlist.
+
+### Heartbeat control channel
+
+The heartbeat response carries a `cmd` field the agent acts on:
+
+| cmd | When fired | Agent action |
+|---|---|---|
+| `continue` | Normal case | No-op; keep running current stage |
+| `shutdown` | Run reached `Completed` | `systemctl poweroff` |
+| `abort` | Run in `FailedHolding` or `Released` | Stop heartbeat loop; let the operator drive |
+| `retry_stage` | Operator pressed "Override wipe" | Re-enter the named stage with `override_flags` armed |
+
+## Safety: destructive disk tests
+
+Four layered gates:
+
+1. **MAC allowlist** — dnsmasq only answers DHCP for registered MACs.
+2. **Signed run token** — orchestrator issues a per-run HMAC token in
+   the iPXE kernel cmdline; the agent submits it on `/claim` and the
+   orchestrator verifies before handing back the stage list.
+3. **Wipe probe** — before `badblocks`, the agent scans for filesystem
+   signatures / LVM metadata / partition tables. Anything found →
+   `FailedHolding` on `Storage`. The operator explicitly clicks
+   **Override wipe-probe** to proceed.
+4. **Device allowlist** — the agent only targets block devices matching
+   the inventory's `expected_disks`. USB sticks and surprise disks are
+   skipped.
+
+## Notifications
+
+Fire-and-forget. The orchestrator fires four event kinds:
+
+| Kind | Severity | When |
+|---|---|---|
+| `StageFailed` | critical | Any stage returns `passed=false` |
+| `SpecMismatch` | critical | `SpecValidate` finds critical diffs |
+| `HoldingOpened` | critical | Agent POSTs `/hold` (operator can SSH in) |
+| `RunCompleted` | info | Pipeline reaches `Completed` |
+
+The config maps event kinds and severities to one or more notifiers
+(ntfy, Discord webhook, SMTP). Each notifier gets one attempt per
+event with a 10s timeout; delivery failures are logged, nothing is
+persisted.
+
+## Why a separate notify package?
+
+Keeps the `/result` and `/hold` handlers non-blocking. Each dispatch
+starts a goroutine per target; a slow ntfy server doesn't back up an
+SMTP notifier or delay the HTTP response to the agent.
+
+## Data retention
+
+The janitor goroutine (`internal/janitor`) runs a sweep every
+`janitor.interval_minutes` (default 60) and deletes:
+
+- artifact files older than `artifacts.retention_days`, plus their
+  `artifacts` table rows
+- log files older than `logs.retention_days`
+
+`runs`, `hosts`, `stages`, `measurements`, `spec_diffs` rows are
+**never** deleted by the janitor — host histories and aggregate
+metrics survive cleanups.
+
+## Reproducible builds
+
+The orchestrator and agent are pure Go; `make orchestrator-linux`
+cross-compiles to `linux-amd64` from Windows or macOS.
+
+The live image requires Linux-side tooling (mkosi, debootstrap,
+squashfs-tools) so `make live-image` fails loudly on Windows and
+redirects to `wsl make live-image`. Pinning to snapshot.debian.org in
+`live-image/mkosi.conf` keeps image bits stable across time for a
+given git SHA.
@@ -0,0 +1,171 @@
+# Operations
+
+Operator-facing runbook for the vetting orchestrator. If you're looking
+for the "what does the system do" overview, see
+[architecture.md](architecture.md). For what each test stage actually
+measures, see [test-suite.md](test-suite.md).
+
+## Install (Proxmox LXC)
+
+Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster
+you're vetting for. The LXC must be on the same L2 segment as the
+repaired nodes so DHCP and WoL work.
+
+1. On your workstation, cross-build the binary:
+
+   ```
+   make orchestrator-linux
+   ```
+
+   This produces `bin/vetting-linux-amd64`.
+
+2. Copy the repo tree (or just `bin/`, `deploy/`) into the LXC, then
+   from inside the LXC:
+
+   ```
+   sudo ./deploy/install.sh
+   ```
+
+   The installer:
+   - `apt install`s `dnsmasq`, `iperf3`, `ca-certificates`
+   - creates the `vetting` system user (home = `/var/lib/vetting`)
+   - installs the binary into `/usr/local/bin/vetting`
+   - drops `vetting.example.yaml` into `/etc/vetting/vetting.yaml`
+     (only if there's no existing config — existing configs are
+     preserved)
+   - drops `/etc/systemd/system/vetting.service`
+   - disables the distro-default dnsmasq (the orchestrator supervises
+     its own)
+
+   The installer does **not** enable the service, because the default
+   config has a placeholder bcrypt password that the binary refuses to
+   start with.
+
+3. Generate an admin password hash and a session secret, then edit
+   `/etc/vetting/vetting.yaml`:
+
+   ```
+   ./bin/gen-admin-password 'your-password-here'       # prints a bcrypt hash
+   openssl rand -hex 32                                 # prints a 64-char hex string
+   ```
+
+   Required fields:
+   - `auth.admin_password_bcrypt` — the bcrypt hash
+   - `auth.session_secret_hex` — the 32-byte hex string
+   - `server.public_url` — the URL your browser hits the LXC on
+     (e.g. `https://vetting.lan:8443`). This is used as the
+     click-through link in notifications, so it must be the *external*
+     URL, not the bind address.
+
+4. (Optional) Configure notifiers in the same file — see the
+   commented-out example block for ntfy / Discord / SMTP.
+
+5. Enable and start:
+
+   ```
+   sudo systemctl enable --now vetting
+   sudo journalctl -fu vetting
+   ```
+
+## First vetting run
+
+Against a QEMU VM first, before you point it at real hardware:
+
+1. On the Proxmox host (or wherever your LXC lives):
+
+   ```
+   sudo ip link add br-vetting type bridge
+   sudo ip addr add 10.77.0.1/24 dev br-vetting
+   sudo ip link set br-vetting up
+   ```
+
+2. In the UI at `https://<lxc>:8443`, log in and register a host:
+   - Name: `qemu-test`
+   - MAC: `52:54:00:12:34:56`
+   - WoL broadcast IP: `10.77.0.255`
+   - Expected spec: paste a minimal YAML like
+     ```yaml
+     memory: { total_gib: 4 }
+     cpu: { logical_cores: 4 }
+     ```
+
+3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingWoL`.
+
+4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:
+
+   ```
+   sudo qemu-system-x86_64 \
+     -enable-kvm -cpu host -smp 4 -m 4096 \
+     -netdev bridge,id=n0,br=br-vetting \
+     -device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
+     -drive file=/tmp/test-disk.img,format=raw,if=virtio \
+     -boot n -serial mon:stdio -display none
+   ```
+
+5. Watch the tile advance through stages. On success, the tile shows
+   **View report** and the VM auto-shuts-down.
+
+For real repaired hardware: same flow, but register the node's actual
+MAC + expected spec, and make sure the node's BIOS is set to PXE-boot
+from the NIC that's on the `br-vetting` network.
+
+## A failed run — SSH to the held host
+
+When a stage fails, the pipeline halts at `FailedHolding` and the
+agent installs an orchestrator-issued SSH key into the live-image's
+`/root/.ssh/authorized_keys`. The UI tile surfaces the IP and the
+exact `ssh` command.
+
+The hold key is **per-run**. Once you're done:
+
+1. Power the host off (`poweroff` from the SSH session).
+2. In the UI, click **Override wipe-probe** only when the failure was
+   at the `Storage` stage *and* you're sure the disks are expendable.
+   Otherwise click **Start vetting** on a fresh run from the host
+   dashboard after fixing the underlying issue.
+
+## Log + artifact layout
+
+```
+/var/lib/vetting/
+  vetting.db                 # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
+  artifacts/
+    run-<N>/
+      report.html            # operator-facing summary
+      report.json            # machine-readable summary
+      inventory.json         # raw probe output
+      fio-<disk>.log         # storage stage output
+      iperf-<nic>.json       # network stage output
+      hold-<N>.pub           # per-run SSH pubkey (only if held)
+/var/log/vetting/
+  run-<N>.log                # append-only per-run log tail
+```
+
+Retention is governed by the `artifacts.retention_days` and
+`logs.retention_days` settings. DB rows (run history) are preserved
+indefinitely; only on-disk files get pruned.
+
+## Troubleshooting
+
+| Symptom | First check |
+|---|---|
+| Service refuses to start with `auth.admin_password_bcrypt is the placeholder` | You didn't replace the bcrypt hash in the config. Run `gen-admin-password`. |
+| PXE client gets no DHCP offer | `journalctl -u vetting` for dnsmasq errors; confirm the LXC has `CAP_NET_ADMIN` (the shipped systemd unit does); confirm the host MAC is actually registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). |
+| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
+| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
+| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
+| Notification didn't fire | `journalctl -u vetting \| grep notify:` — delivery is fire-and-forget and the failure reason is logged but not persisted. |
+
+## Upgrading
+
+1. `make orchestrator-linux` on your workstation.
+2. `scp bin/vetting-linux-amd64 lxc:/tmp/vetting.new`
+3. On the LXC:
+   ```
+   sudo systemctl stop vetting
+   sudo install -m 0755 /tmp/vetting.new /usr/local/bin/vetting
+   sudo systemctl start vetting
+   ```
+
+The DB migration runs at startup and is append-only — no manual schema
+work unless a release's notes call it out.
@@ -0,0 +1,166 @@
+# Test suite
+
+What each stage measures, what "pass" means, and where the results
+land. Stages run strictly in order. Any stage returning `passed=false`
+halts the pipeline at `FailedHolding` — the operator decides whether
+to fix, override, or abandon.
+
+## Stage order
+
+```
+Inventory → SpecValidate → SMART → CPUStress → Storage
+         → Network → GPU → PSU → Reporting
+```
+
+Stages marked *orchestrator-owned* resolve inside `/result` and never
+show up as "the agent's turn".
+
+---
+
+## Inventory
+
+**Owner:** agent.
+**What it does:** `dmidecode`, `lscpu`, `lshw`, `lspci`, `smartctl -i`
+over each block device, `nvidia-smi -q` if present. The raw output is
+merged into a single JSON blob.
+**Pass:** the probes run to completion; missing optional tools (e.g.
+`nvidia-smi` on a GPU-less host) are tolerated.
+**Artifacts:** `inventory.json` under `artifacts/run-<N>/`.
+
+## SpecValidate *(orchestrator-owned)*
+
+**Owner:** orchestrator (resolves inline inside the `/result` for the
+preceding Inventory stage).
+**What it does:** diffs the submitted inventory against the host's
+`expected_spec_yaml`. The diff engine classifies each field as
+`critical`, `warning`, or `info`.
+**Pass:** zero `critical` diffs.
+**Fail mode:** fires a `SpecMismatch` notification; transitions run
+to `Failed → FailedHolding`.
+**Artifacts:** `spec_diffs` table rows (one per divergence).
+
+## SMART
+
+**Owner:** agent.
+**What it does:** `smartctl -a /dev/<disk>` for each disk in the
+inventory's `expected_disks`. Parses reallocated-sector counts, pending
+sectors, end-to-end error counters, overall-health attribute.
+**Pass:** SMART overall-health is PASSED on every expected disk and
+reallocated-sector count is below threshold.
+**Artifacts:** `smart-<disk>.txt` raw output.
+
+## CPUStress
+
+**Owner:** agent.
+**What it does:** runs `stress-ng --cpu N --vm M --vm-bytes 90% -t
+120s` with `N = logical_cores` and `M ≈ logical_cores/2`. The `--vm`
+flag is the **stand-in for Memtest86+**: it exercises the memory
+subsystem under load and will fail if the RAM has latent faults that
+surface under thermal + allocator pressure.
+**Pass:** `stress-ng` exits 0 and thermal samples taken by the sidecar
+stay below the configured per-host `max_temp_c`.
+**Caveat:** weaker than a dedicated memtest pass; see
+[architecture.md](architecture.md) for the reasoning (Memtest86+
+can't be signalled back without IPMI serial).
+
+## Storage
+
+**Owner:** agent (destructive).
+**What it does:**
+
+1. **Wipe probe** — scans for filesystem signatures, LVM metadata,
+   partition tables on the expected disks. Any hit → halt with
+   `UnexpectedData`; operator must click **Override wipe-probe**.
+2. `badblocks -svw` (destructive read/write) on each expected disk.
+3. `fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1G` on
+   each disk; captures IOPS and p99 latency.
+
+**Pass:** badblocks reports zero bad blocks; fio IOPS above a
+per-class floor (configurable).
+**Artifacts:** `fio-<disk>.json` per disk.
+**Safety gate:** the wipe-probe + device allowlist are the second and
+third lines of defense against wiping the wrong disk. See
+[architecture.md § Safety](architecture.md#safety-destructive-disk-tests).
+
+## Network
+
+**Owner:** agent.
+**What it does:** `iperf3 -c <orchestrator> -p <iperf_port> -t 10 -J`
+to measure throughput to the orchestrator. The orchestrator-side
+`iperf3 -s` is supervised by `internal/orchestrator/iperf.go` and
+binds to the configured `network.iperf_port`.
+**Pass:** throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps
+for 10GbE).
+**Artifacts:** `iperf-<nic>.json`.
+
+## GPU
+
+**Owner:** agent.
+**What it does:** runs `nvidia-smi -q` and a short compute workload
+(`gpu-burn` if present, else `nvidia-smi dmon` during a `stress-ng
+--gpu` burst). Skipped cleanly when no GPU is present.
+**Pass:** no ECC errors reported; temperature below threshold; compute
+workload exits 0.
+
+## PSU
+
+**Owner:** agent.
+**What it does:** reads `/sys/class/hwmon/*/power_average` and `in*_input`
+during a synthetic load burst (CPU + disk + NIC simultaneously) to
+look for voltage sag or wattage anomalies. Records the full envelope
+as `measurements` rows with `kind=psu`.
+**Pass:** no voltage dip below threshold across the load burst.
+**Caveat:** only reports on what the BMC exposes via hwmon — servers
+without exposed PSU telemetry pass trivially. Documented limitation.
+
+## Reporting *(orchestrator-owned)*
+
+**Owner:** orchestrator (resolves inline inside the `/result` for PSU).
+**What it does:**
+
+1. Gathers run, host, stages, spec_diffs, and measurement aggregates.
+2. Renders `report.html` via `internal/report` (html/template with
+   inlined CSS; self-contained offline-viewable).
+3. Writes `report.json` with the same data in machine-readable form.
+4. Records both as `report_html` / `report_json` artifact rows.
+5. Transitions run → `Completed`.
+6. Fires `RunCompleted` notification.
+7. The next agent heartbeat returns `cmd=shutdown`.
+
+## Thermal sidecar
+
+**Owner:** agent (always-on from `Booting` until the agent exits).
+**What it does:** every 5 seconds, walks `/sys/class/hwmon/*` and
+POSTs temperature samples as a batch to `/sensor`. Populates the
+`measurements` table with `kind=thermal`.
+**No pass/fail** on its own — stages that care about thermals read the
+sidecar's data via `measurements`. A dead sensor just drops out of
+the next batch.
+
+---
+
+## Where pass/fail lives
+
+- `runs.state` — authoritative terminal state (`Completed`,
+  `FailedHolding`, `Released`).
+- `runs.result` — `pass` or `fail` string once the run completes.
+- `runs.failed_stage` — name of the stage that halted the pipeline, if
+  any. Cleared when the operator overrides and re-enters.
+- `stages` — one row per attempted stage with `passed`, `started_at`,
+  `completed_at`, `summary_json`, `message`.
+- `measurements` — time-series samples from the thermal sidecar and
+  from stages that capture numeric outputs.
+- `artifacts` — on-disk files (report, fio logs, iperf logs, etc).
+- `spec_diffs` — one row per expected-vs-actual divergence.
+
+## Adding a new stage
+
+1. Add the name to `store.DefaultStageOrder`.
+2. Add a `model.State<Name>` const and wire it into
+   `internal/orchestrator/statemachine.go` (both the forward
+   transition table and the stage-for-state lookup).
+3. Add a case to `agent/runner.go`'s `runStage` dispatch.
+4. Drop the implementation into `agent/tests/`.
+5. If the stage is orchestrator-owned, add a `resolve<Name>` helper to
+   `internal/api/agent_handlers.go` and invoke it from the `/result`
+   handler after the preceding stage's `NextState` resolves.