Post-repair hardware validation pipeline for Proxmox cluster hosts. Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
6.4 KiB
Test suite
What each stage measures, what "pass" means, and where the results
land. Stages run strictly in order. Any stage returning passed=false
halts the pipeline at FailedHolding — the operator decides whether
to fix, override, or abandon.
Stage order
Inventory → SpecValidate → SMART → CPUStress → Storage
→ Network → GPU → PSU → Reporting
Stages marked orchestrator-owned resolve inside /result and never
show up as "the agent's turn".
Inventory
Owner: agent.
What it does: dmidecode, lscpu, lshw, lspci, smartctl -i
over each block device, nvidia-smi -q if present. The raw output is
merged into a single JSON blob.
Pass: the probes run to completion; missing optional tools (e.g.
nvidia-smi on a GPU-less host) are tolerated.
Artifacts: inventory.json under artifacts/run-<N>/.
SpecValidate (orchestrator-owned)
Owner: orchestrator (resolves inline inside the /result for the
preceding Inventory stage).
What it does: diffs the submitted inventory against the host's
expected_spec_yaml. The diff engine classifies each field as
critical, warning, or info.
Pass: zero critical diffs.
Fail mode: fires a SpecMismatch notification; transitions run
to Failed → FailedHolding.
Artifacts: spec_diffs table rows (one per divergence).
SMART
Owner: agent.
What it does: smartctl -a /dev/<disk> for each disk in the
inventory's expected_disks. Parses reallocated-sector counts, pending
sectors, end-to-end error counters, overall-health attribute.
Pass: SMART overall-health is PASSED on every expected disk and
reallocated-sector count is below threshold.
Artifacts: smart-<disk>.txt raw output.
CPUStress
Owner: agent.
What it does: runs stress-ng --cpu N --vm M --vm-bytes 90% -t 120s with N = logical_cores and M ≈ logical_cores/2. The --vm
flag is the stand-in for Memtest86+: it exercises the memory
subsystem under load and will fail if the RAM has latent faults that
surface under thermal + allocator pressure.
Pass: stress-ng exits 0 and thermal samples taken by the sidecar
stay below the configured per-host max_temp_c.
Caveat: weaker than a dedicated memtest pass; see
architecture.md for the reasoning (Memtest86+
can't be signalled back without IPMI serial).
Storage
Owner: agent (destructive). What it does:
- Wipe probe — scans for filesystem signatures, LVM metadata,
partition tables on the expected disks. Any hit → halt with
UnexpectedData; operator must click Override wipe-probe. badblocks -svw(destructive read/write) on each expected disk.fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1Gon each disk; captures IOPS and p99 latency.
Pass: badblocks reports zero bad blocks; fio IOPS above a
per-class floor (configurable).
Artifacts: fio-<disk>.json per disk.
Safety gate: the wipe-probe + device allowlist are the second and
third lines of defense against wiping the wrong disk. See
architecture.md § Safety.
Network
Owner: agent.
What it does: iperf3 -c <orchestrator> -p <iperf_port> -t 10 -J
to measure throughput to the orchestrator. The orchestrator-side
iperf3 -s is supervised by internal/orchestrator/iperf.go and
binds to the configured network.iperf_port.
Pass: throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps
for 10GbE).
Artifacts: iperf-<nic>.json.
GPU
Owner: agent.
What it does: runs nvidia-smi -q and a short compute workload
(gpu-burn if present, else nvidia-smi dmon during a stress-ng --gpu burst). Skipped cleanly when no GPU is present.
Pass: no ECC errors reported; temperature below threshold; compute
workload exits 0.
PSU
Owner: agent.
What it does: reads /sys/class/hwmon/*/power_average and in*_input
during a synthetic load burst (CPU + disk + NIC simultaneously) to
look for voltage sag or wattage anomalies. Records the full envelope
as measurements rows with kind=psu.
Pass: no voltage dip below threshold across the load burst.
Caveat: only reports on what the BMC exposes via hwmon — servers
without exposed PSU telemetry pass trivially. Documented limitation.
Reporting (orchestrator-owned)
Owner: orchestrator (resolves inline inside the /result for PSU).
What it does:
- Gathers run, host, stages, spec_diffs, and measurement aggregates.
- Renders
report.htmlviainternal/report(html/template with inlined CSS; self-contained offline-viewable). - Writes
report.jsonwith the same data in machine-readable form. - Records both as
report_html/report_jsonartifact rows. - Transitions run →
Completed. - Fires
RunCompletednotification. - The next agent heartbeat returns
cmd=shutdown.
Thermal sidecar
Owner: agent (always-on from Booting until the agent exits).
What it does: every 5 seconds, walks /sys/class/hwmon/* and
POSTs temperature samples as a batch to /sensor. Populates the
measurements table with kind=thermal.
No pass/fail on its own — stages that care about thermals read the
sidecar's data via measurements. A dead sensor just drops out of
the next batch.
Where pass/fail lives
runs.state— authoritative terminal state (Completed,FailedHolding,Released).runs.result—passorfailstring once the run completes.runs.failed_stage— name of the stage that halted the pipeline, if any. Cleared when the operator overrides and re-enters.stages— one row per attempted stage withpassed,started_at,completed_at,summary_json,message.measurements— time-series samples from the thermal sidecar and from stages that capture numeric outputs.artifacts— on-disk files (report, fio logs, iperf logs, etc).spec_diffs— one row per expected-vs-actual divergence.
Adding a new stage
- Add the name to
store.DefaultStageOrder. - Add a
model.State<Name>const and wire it intointernal/orchestrator/statemachine.go(both the forward transition table and the stage-for-state lookup). - Add a case to
agent/runner.go'srunStagedispatch. - Drop the implementation into
agent/tests/. - If the stage is orchestrator-owned, add a
resolve<Name>helper tointernal/api/agent_handlers.goand invoke it from the/resulthandler after the preceding stage'sNextStateresolves.