Add 4 new doc files (configuration reference, development guide, API reference with full request/response schemas, database schema), expand the README with a feature list and how-it-works walkthrough, fix missing Firmware and Burn stages in architecture.md and test-suite.md, add threshold engine and host-mode agent sections, and add godoc comments to 11 packages and 6 model types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9.3 KiB
Test suite
What each stage measures, what "pass" means, and where the results
land. Stages run strictly in order. Any stage returning passed=false
halts the pipeline at FailedHolding — the operator decides whether
to fix, override, or abandon.
Stage order
Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage
→ Network → Burn → GPU → PSU → Reporting
Stages marked orchestrator-owned resolve inside /result and never
show up as "the agent's turn".
Inventory
Owner: agent.
What it does: dmidecode, lscpu, lshw, lspci, smartctl -i
over each block device, nvidia-smi -q if present. The raw output is
merged into a single JSON blob.
Pass: the probes run to completion; missing optional tools (e.g.
nvidia-smi on a GPU-less host) are tolerated.
Artifacts: inventory.json under artifacts/run-<N>/.
Firmware
Owner: agent.
What it does: probes firmware versions across all discoverable
components: BIOS (dmidecode -t bios), BMC (ipmitool mc info), NIC
firmware (ethtool -i per interface), NVMe firmware (nvme id-ctrl),
HBA firmware (lspci -vv), and CPU microcode (/proc/cpuinfo).
Missing tools are tolerated — a GPU-less server won't have
nvidia-smi, a consumer board won't have ipmitool.
Pass: always passes. Firmware is advisory-only; SpecValidate is the
gate that fails on version mismatches.
Artifacts: firmware_snapshots table rows (one per component,
keyed by (run_id, component, identifier)).
SpecValidate (orchestrator-owned)
Owner: orchestrator (resolves inline inside the /result for the
preceding Inventory stage).
What it does: diffs the submitted inventory against the host's
expected_spec_yaml. The diff engine classifies each field as
critical, warning, or info.
Pass: zero critical diffs.
Fail mode: fires a SpecMismatch notification; transitions run
to Failed → FailedHolding.
Artifacts: spec_diffs table rows (one per divergence).
SMART
Owner: agent.
What it does: smartctl -a /dev/<disk> for each disk in the
inventory's expected_disks. Parses reallocated-sector counts, pending
sectors, end-to-end error counters, overall-health attribute.
Pass: SMART overall-health is PASSED on every expected disk and
reallocated-sector count is below threshold.
Artifacts: smart-<disk>.txt raw output.
CPUStress
Owner: agent.
What it does: runs stress-ng --cpu N --vm M --vm-bytes 90% -t 120s with N = logical_cores and M ≈ logical_cores/2. The --vm
flag is the stand-in for Memtest86+: it exercises the memory
subsystem under load and will fail if the RAM has latent faults that
surface under thermal + allocator pressure.
Pass: stress-ng exits 0 and thermal samples taken by the sidecar
stay below the configured per-host max_temp_c.
Caveat: weaker than a dedicated memtest pass; see
architecture.md for the reasoning (Memtest86+
can't be signalled back without IPMI serial).
Storage
Owner: agent (destructive). What it does:
- Wipe probe — scans for filesystem signatures, LVM metadata,
partition tables on the expected disks. Any hit → halt with
UnexpectedData; operator must click Override wipe-probe. badblocks -svw(destructive read/write) on each expected disk.fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1Gon each disk; captures IOPS and p99 latency.
Pass: badblocks reports zero bad blocks; fio IOPS above a
per-class floor (configurable).
Artifacts: fio-<disk>.json per disk.
Safety gate: the wipe-probe + device allowlist are the second and
third lines of defense against wiping the wrong disk. See
architecture.md § Safety.
Network
Owner: agent.
What it does: iperf3 -c <orchestrator> -p <iperf_port> -t 10 -J
to measure throughput to the orchestrator. The orchestrator-side
iperf3 -s is supervised by internal/orchestrator/iperf.go and
binds to the configured network.iperf_port.
Pass: throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps
for 10GbE).
Artifacts: iperf-<nic>.json.
Burn
Owner: agent. What it does: runs CPU stress, memory stress, disk I/O, and network throughput simultaneously for the profile's burn duration. The goal is to stress every subsystem at once and surface failures that only appear under combined load (thermal throttling, PSU voltage sag, memory errors under thermal pressure).
Sub-workloads run as parallel goroutines:
- CPU —
stress-ng --cpu <workers>for the burn duration. - Memory —
stress-ng --vm --vm-bytes <mem_pct>%for the burn duration. - Disk —
fioagainst a spare partition (whenfio_on_spareis enabled). - Network —
iperf3 -c <orchestrator> -P <parallel>for the burn duration.
Pass: all four sub-workloads exit 0 and no critical threshold breach fires during the window. Configurable knobs (per profile):
| Knob | Description |
|---|---|
duration |
Total burn-in window. |
cpu_workers |
all = runtime.NumCPU(), or a fixed count. |
mem_pct |
Percentage of MemAvailable to stress. |
fio_on_spare |
Run fio inside Burn (requires a spare partition). |
iperf_parallel |
Parallel stream count for iperf3 -P. |
See configuration.md § burn for per-profile default values.
GPU
Owner: agent.
What it does: runs nvidia-smi -q and a short compute workload
(gpu-burn if present, else nvidia-smi dmon during a stress-ng --gpu burst). Skipped cleanly when no GPU is present.
Pass: no ECC errors reported; temperature below threshold; compute
workload exits 0.
PSU
Owner: agent.
What it does: reads /sys/class/hwmon/*/power_average and in*_input
during a synthetic load burst (CPU + disk + NIC simultaneously) to
look for voltage sag or wattage anomalies. Records the full envelope
as measurements rows with kind=psu.
Pass: no voltage dip below threshold across the load burst.
Caveat: only reports on what the BMC exposes via hwmon — servers
without exposed PSU telemetry pass trivially. Documented limitation.
Reporting (orchestrator-owned)
Owner: orchestrator (resolves inline inside the /result for PSU).
What it does:
- Gathers run, host, stages, spec_diffs, and measurement aggregates.
- Renders
report.htmlviainternal/report(html/template with inlined CSS; self-contained offline-viewable). - Writes
report.jsonwith the same data in machine-readable form. - Records both as
report_html/report_jsonartifact rows. - Transitions run →
Completed. - Fires
RunCompletednotification. - The next agent heartbeat returns
cmd=shutdown.
Thermal sidecar
Owner: agent (always-on from Booting until the agent exits).
What it does: every 5 seconds, walks /sys/class/hwmon/* and
POSTs temperature samples as a batch to /sensor. Populates the
measurements table with kind=thermal.
No pass/fail on its own — stages that care about thermals read the
sidecar's data via measurements. A dead sensor just drops out of
the next batch.
Where pass/fail lives
runs.state— authoritative terminal state (Completed,FailedHolding,Released).runs.result—passorfailstring once the run completes.runs.failed_stage— name of the stage that halted the pipeline, if any. Cleared when the operator overrides and re-enters.stages— one row per attempted stage withpassed,started_at,completed_at,summary_json,message.measurements— time-series samples from the thermal sidecar and from stages that capture numeric outputs.artifacts— on-disk files (report, fio logs, iperf logs, etc).spec_diffs— one row per expected-vs-actual divergence.
Profile duration summary
Three profiles scale every stage's duration. Probes and gates are identical across profiles — only the work size changes. See configuration.md § profiles for the full knob reference.
| Stage | quick (~10 min) | deep (~8-12 h) | soak (~36-40 h) |
|---|---|---|---|
| Inventory | seconds | seconds | seconds |
| Firmware | seconds | seconds | seconds |
| SpecValidate | instant (server) | instant (server) | instant (server) |
| SMART | seconds per disk | seconds per disk | seconds per disk |
| CPUStress | 2 m cpu + 2 m mem | 60 m cpu + 60 m mem | 12 h cpu + 12 h mem |
| Storage | 3 m fio (sample) | badblocks + 2 h fio | badblocks + 6 h fio |
| Network | 60 s iperf | 30 m iperf | 2 h iperf |
| Burn | 2 m all-at-once | 2 h all-at-once | 18 h all-at-once |
| GPU | seconds | seconds | seconds |
| PSU | 1 m load burst | 10 m load burst | 15 m load burst |
| Reporting | instant (server) | instant (server) | instant (server) |
Adding a new stage
- Add the name to
store.DefaultStageOrder. - Add a
model.State<Name>const and wire it intointernal/orchestrator/statemachine.go(both the forward transition table and the stage-for-state lookup). - Add a case to
agent/runner.go'srunStagedispatch. - Drop the implementation into
agent/tests/. - If the stage is orchestrator-owned, add a
resolve<Name>helper tointernal/api/agent_handlers.goand invoke it from the/resulthandler after the preceding stage'sNextStateresolves.