Initial commit: full Phases 1-6 implementation
CI / Lint + build + test (push) Has been cancelled

Post-repair hardware validation pipeline for Proxmox cluster hosts.
Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq
PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
This commit is contained in:
2026-04-17 21:32:10 -04:00
commit 9bb4b09a04
98 changed files with 11960 additions and 0 deletions
+166
View File
@@ -0,0 +1,166 @@
# Test suite
What each stage measures, what "pass" means, and where the results
land. Stages run strictly in order. Any stage returning `passed=false`
halts the pipeline at `FailedHolding` — the operator decides whether
to fix, override, or abandon.
## Stage order
```
Inventory → SpecValidate → SMART → CPUStress → Storage
→ Network → GPU → PSU → Reporting
```
Stages marked *orchestrator-owned* resolve inside `/result` and never
show up as "the agent's turn".
---
## Inventory
**Owner:** agent.
**What it does:** `dmidecode`, `lscpu`, `lshw`, `lspci`, `smartctl -i`
over each block device, `nvidia-smi -q` if present. The raw output is
merged into a single JSON blob.
**Pass:** the probes run to completion; missing optional tools (e.g.
`nvidia-smi` on a GPU-less host) are tolerated.
**Artifacts:** `inventory.json` under `artifacts/run-<N>/`.
## SpecValidate *(orchestrator-owned)*
**Owner:** orchestrator (resolves inline inside the `/result` for the
preceding Inventory stage).
**What it does:** diffs the submitted inventory against the host's
`expected_spec_yaml`. The diff engine classifies each field as
`critical`, `warning`, or `info`.
**Pass:** zero `critical` diffs.
**Fail mode:** fires a `SpecMismatch` notification; transitions run
to `Failed → FailedHolding`.
**Artifacts:** `spec_diffs` table rows (one per divergence).
## SMART
**Owner:** agent.
**What it does:** `smartctl -a /dev/<disk>` for each disk in the
inventory's `expected_disks`. Parses reallocated-sector counts, pending
sectors, end-to-end error counters, overall-health attribute.
**Pass:** SMART overall-health is PASSED on every expected disk and
reallocated-sector count is below threshold.
**Artifacts:** `smart-<disk>.txt` raw output.
## CPUStress
**Owner:** agent.
**What it does:** runs `stress-ng --cpu N --vm M --vm-bytes 90% -t
120s` with `N = logical_cores` and `M ≈ logical_cores/2`. The `--vm`
flag is the **stand-in for Memtest86+**: it exercises the memory
subsystem under load and will fail if the RAM has latent faults that
surface under thermal + allocator pressure.
**Pass:** `stress-ng` exits 0 and thermal samples taken by the sidecar
stay below the configured per-host `max_temp_c`.
**Caveat:** weaker than a dedicated memtest pass; see
[architecture.md](architecture.md) for the reasoning (Memtest86+
can't be signalled back without IPMI serial).
## Storage
**Owner:** agent (destructive).
**What it does:**
1. **Wipe probe** — scans for filesystem signatures, LVM metadata,
partition tables on the expected disks. Any hit → halt with
`UnexpectedData`; operator must click **Override wipe-probe**.
2. `badblocks -svw` (destructive read/write) on each expected disk.
3. `fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1G` on
each disk; captures IOPS and p99 latency.
**Pass:** badblocks reports zero bad blocks; fio IOPS above a
per-class floor (configurable).
**Artifacts:** `fio-<disk>.json` per disk.
**Safety gate:** the wipe-probe + device allowlist are the second and
third lines of defense against wiping the wrong disk. See
[architecture.md § Safety](architecture.md#safety-destructive-disk-tests).
## Network
**Owner:** agent.
**What it does:** `iperf3 -c <orchestrator> -p <iperf_port> -t 10 -J`
to measure throughput to the orchestrator. The orchestrator-side
`iperf3 -s` is supervised by `internal/orchestrator/iperf.go` and
binds to the configured `network.iperf_port`.
**Pass:** throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps
for 10GbE).
**Artifacts:** `iperf-<nic>.json`.
## GPU
**Owner:** agent.
**What it does:** runs `nvidia-smi -q` and a short compute workload
(`gpu-burn` if present, else `nvidia-smi dmon` during a `stress-ng
--gpu` burst). Skipped cleanly when no GPU is present.
**Pass:** no ECC errors reported; temperature below threshold; compute
workload exits 0.
## PSU
**Owner:** agent.
**What it does:** reads `/sys/class/hwmon/*/power_average` and `in*_input`
during a synthetic load burst (CPU + disk + NIC simultaneously) to
look for voltage sag or wattage anomalies. Records the full envelope
as `measurements` rows with `kind=psu`.
**Pass:** no voltage dip below threshold across the load burst.
**Caveat:** only reports on what the BMC exposes via hwmon — servers
without exposed PSU telemetry pass trivially. Documented limitation.
## Reporting *(orchestrator-owned)*
**Owner:** orchestrator (resolves inline inside the `/result` for PSU).
**What it does:**
1. Gathers run, host, stages, spec_diffs, and measurement aggregates.
2. Renders `report.html` via `internal/report` (html/template with
inlined CSS; self-contained offline-viewable).
3. Writes `report.json` with the same data in machine-readable form.
4. Records both as `report_html` / `report_json` artifact rows.
5. Transitions run → `Completed`.
6. Fires `RunCompleted` notification.
7. The next agent heartbeat returns `cmd=shutdown`.
## Thermal sidecar
**Owner:** agent (always-on from `Booting` until the agent exits).
**What it does:** every 5 seconds, walks `/sys/class/hwmon/*` and
POSTs temperature samples as a batch to `/sensor`. Populates the
`measurements` table with `kind=thermal`.
**No pass/fail** on its own — stages that care about thermals read the
sidecar's data via `measurements`. A dead sensor just drops out of
the next batch.
---
## Where pass/fail lives
- `runs.state` — authoritative terminal state (`Completed`,
`FailedHolding`, `Released`).
- `runs.result``pass` or `fail` string once the run completes.
- `runs.failed_stage` — name of the stage that halted the pipeline, if
any. Cleared when the operator overrides and re-enters.
- `stages` — one row per attempted stage with `passed`, `started_at`,
`completed_at`, `summary_json`, `message`.
- `measurements` — time-series samples from the thermal sidecar and
from stages that capture numeric outputs.
- `artifacts` — on-disk files (report, fio logs, iperf logs, etc).
- `spec_diffs` — one row per expected-vs-actual divergence.
## Adding a new stage
1. Add the name to `store.DefaultStageOrder`.
2. Add a `model.State<Name>` const and wire it into
`internal/orchestrator/statemachine.go` (both the forward
transition table and the stage-for-state lookup).
3. Add a case to `agent/runner.go`'s `runStage` dispatch.
4. Drop the implementation into `agent/tests/`.
5. If the stage is orchestrator-owned, add a `resolve<Name>` helper to
`internal/api/agent_handlers.go` and invoke it from the `/result`
handler after the preceding stage's `NextState` resolves.