Initial commit: full Phases 1-6 implementation

Post-repair hardware validation pipeline for Proxmox cluster hosts. Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
2026-04-17 21:32:10 -04:00
commit 9bb4b09a04
98 changed files with 11960 additions and 0 deletions
@@ -0,0 +1,166 @@
+# Test suite
+
+What each stage measures, what "pass" means, and where the results
+land. Stages run strictly in order. Any stage returning `passed=false`
+halts the pipeline at `FailedHolding` — the operator decides whether
+to fix, override, or abandon.
+
+## Stage order
+
+```
+Inventory → SpecValidate → SMART → CPUStress → Storage
+         → Network → GPU → PSU → Reporting
+```
+
+Stages marked *orchestrator-owned* resolve inside `/result` and never
+show up as "the agent's turn".
+
+---
+
+## Inventory
+
+**Owner:** agent.
+**What it does:** `dmidecode`, `lscpu`, `lshw`, `lspci`, `smartctl -i`
+over each block device, `nvidia-smi -q` if present. The raw output is
+merged into a single JSON blob.
+**Pass:** the probes run to completion; missing optional tools (e.g.
+`nvidia-smi` on a GPU-less host) are tolerated.
+**Artifacts:** `inventory.json` under `artifacts/run-<N>/`.
+
+## SpecValidate *(orchestrator-owned)*
+
+**Owner:** orchestrator (resolves inline inside the `/result` for the
+preceding Inventory stage).
+**What it does:** diffs the submitted inventory against the host's
+`expected_spec_yaml`. The diff engine classifies each field as
+`critical`, `warning`, or `info`.
+**Pass:** zero `critical` diffs.
+**Fail mode:** fires a `SpecMismatch` notification; transitions run
+to `Failed → FailedHolding`.
+**Artifacts:** `spec_diffs` table rows (one per divergence).
+
+## SMART
+
+**Owner:** agent.
+**What it does:** `smartctl -a /dev/<disk>` for each disk in the
+inventory's `expected_disks`. Parses reallocated-sector counts, pending
+sectors, end-to-end error counters, overall-health attribute.
+**Pass:** SMART overall-health is PASSED on every expected disk and
+reallocated-sector count is below threshold.
+**Artifacts:** `smart-<disk>.txt` raw output.
+
+## CPUStress
+
+**Owner:** agent.
+**What it does:** runs `stress-ng --cpu N --vm M --vm-bytes 90% -t
+120s` with `N = logical_cores` and `M ≈ logical_cores/2`. The `--vm`
+flag is the **stand-in for Memtest86+**: it exercises the memory
+subsystem under load and will fail if the RAM has latent faults that
+surface under thermal + allocator pressure.
+**Pass:** `stress-ng` exits 0 and thermal samples taken by the sidecar
+stay below the configured per-host `max_temp_c`.
+**Caveat:** weaker than a dedicated memtest pass; see
+[architecture.md](architecture.md) for the reasoning (Memtest86+
+can't be signalled back without IPMI serial).
+
+## Storage
+
+**Owner:** agent (destructive).
+**What it does:**
+
+1. **Wipe probe** — scans for filesystem signatures, LVM metadata,
+   partition tables on the expected disks. Any hit → halt with
+   `UnexpectedData`; operator must click **Override wipe-probe**.
+2. `badblocks -svw` (destructive read/write) on each expected disk.
+3. `fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1G` on
+   each disk; captures IOPS and p99 latency.
+
+**Pass:** badblocks reports zero bad blocks; fio IOPS above a
+per-class floor (configurable).
+**Artifacts:** `fio-<disk>.json` per disk.
+**Safety gate:** the wipe-probe + device allowlist are the second and
+third lines of defense against wiping the wrong disk. See
+[architecture.md § Safety](architecture.md#safety-destructive-disk-tests).
+
+## Network
+
+**Owner:** agent.
+**What it does:** `iperf3 -c <orchestrator> -p <iperf_port> -t 10 -J`
+to measure throughput to the orchestrator. The orchestrator-side
+`iperf3 -s` is supervised by `internal/orchestrator/iperf.go` and
+binds to the configured `network.iperf_port`.
+**Pass:** throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps
+for 10GbE).
+**Artifacts:** `iperf-<nic>.json`.
+
+## GPU
+
+**Owner:** agent.
+**What it does:** runs `nvidia-smi -q` and a short compute workload
+(`gpu-burn` if present, else `nvidia-smi dmon` during a `stress-ng
+--gpu` burst). Skipped cleanly when no GPU is present.
+**Pass:** no ECC errors reported; temperature below threshold; compute
+workload exits 0.
+
+## PSU
+
+**Owner:** agent.
+**What it does:** reads `/sys/class/hwmon/*/power_average` and `in*_input`
+during a synthetic load burst (CPU + disk + NIC simultaneously) to
+look for voltage sag or wattage anomalies. Records the full envelope
+as `measurements` rows with `kind=psu`.
+**Pass:** no voltage dip below threshold across the load burst.
+**Caveat:** only reports on what the BMC exposes via hwmon — servers
+without exposed PSU telemetry pass trivially. Documented limitation.
+
+## Reporting *(orchestrator-owned)*
+
+**Owner:** orchestrator (resolves inline inside the `/result` for PSU).
+**What it does:**
+
+1. Gathers run, host, stages, spec_diffs, and measurement aggregates.
+2. Renders `report.html` via `internal/report` (html/template with
+   inlined CSS; self-contained offline-viewable).
+3. Writes `report.json` with the same data in machine-readable form.
+4. Records both as `report_html` / `report_json` artifact rows.
+5. Transitions run → `Completed`.
+6. Fires `RunCompleted` notification.
+7. The next agent heartbeat returns `cmd=shutdown`.
+
+## Thermal sidecar
+
+**Owner:** agent (always-on from `Booting` until the agent exits).
+**What it does:** every 5 seconds, walks `/sys/class/hwmon/*` and
+POSTs temperature samples as a batch to `/sensor`. Populates the
+`measurements` table with `kind=thermal`.
+**No pass/fail** on its own — stages that care about thermals read the
+sidecar's data via `measurements`. A dead sensor just drops out of
+the next batch.
+
+---
+
+## Where pass/fail lives
+
+- `runs.state` — authoritative terminal state (`Completed`,
+  `FailedHolding`, `Released`).
+- `runs.result` — `pass` or `fail` string once the run completes.
+- `runs.failed_stage` — name of the stage that halted the pipeline, if
+  any. Cleared when the operator overrides and re-enters.
+- `stages` — one row per attempted stage with `passed`, `started_at`,
+  `completed_at`, `summary_json`, `message`.
+- `measurements` — time-series samples from the thermal sidecar and
+  from stages that capture numeric outputs.
+- `artifacts` — on-disk files (report, fio logs, iperf logs, etc).
+- `spec_diffs` — one row per expected-vs-actual divergence.
+
+## Adding a new stage
+
+1. Add the name to `store.DefaultStageOrder`.
+2. Add a `model.State<Name>` const and wire it into
+   `internal/orchestrator/statemachine.go` (both the forward
+   transition table and the stage-for-state lookup).
+3. Add a case to `agent/runner.go`'s `runStage` dispatch.
+4. Drop the implementation into `agent/tests/`.
+5. If the stage is orchestrator-owned, add a `resolve<Name>` helper to
+   `internal/api/agent_handlers.go` and invoke it from the `/result`
+   handler after the preceding stage's `NextState` resolves.