Files
Vetting/docs/test-suite.md
T
josh 8367ec2a9f
CI / Lint + build + test (push) Successful in 1m36s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 49s
docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API
reference with full request/response schemas, database schema), expand
the README with a feature list and how-it-works walkthrough, fix
missing Firmware and Burn stages in architecture.md and test-suite.md,
add threshold engine and host-mode agent sections, and add godoc
comments to 11 packages and 6 model types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-23 18:37:26 -04:00

9.3 KiB

Test suite

What each stage measures, what "pass" means, and where the results land. Stages run strictly in order. Any stage returning passed=false halts the pipeline at FailedHolding — the operator decides whether to fix, override, or abandon.

Stage order

Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage
          → Network → Burn → GPU → PSU → Reporting

Stages marked orchestrator-owned resolve inside /result and never show up as "the agent's turn".


Inventory

Owner: agent. What it does: dmidecode, lscpu, lshw, lspci, smartctl -i over each block device, nvidia-smi -q if present. The raw output is merged into a single JSON blob. Pass: the probes run to completion; missing optional tools (e.g. nvidia-smi on a GPU-less host) are tolerated. Artifacts: inventory.json under artifacts/run-<N>/.

Firmware

Owner: agent. What it does: probes firmware versions across all discoverable components: BIOS (dmidecode -t bios), BMC (ipmitool mc info), NIC firmware (ethtool -i per interface), NVMe firmware (nvme id-ctrl), HBA firmware (lspci -vv), and CPU microcode (/proc/cpuinfo). Missing tools are tolerated — a GPU-less server won't have nvidia-smi, a consumer board won't have ipmitool. Pass: always passes. Firmware is advisory-only; SpecValidate is the gate that fails on version mismatches. Artifacts: firmware_snapshots table rows (one per component, keyed by (run_id, component, identifier)).

SpecValidate (orchestrator-owned)

Owner: orchestrator (resolves inline inside the /result for the preceding Inventory stage). What it does: diffs the submitted inventory against the host's expected_spec_yaml. The diff engine classifies each field as critical, warning, or info. Pass: zero critical diffs. Fail mode: fires a SpecMismatch notification; transitions run to Failed → FailedHolding. Artifacts: spec_diffs table rows (one per divergence).

SMART

Owner: agent. What it does: smartctl -a /dev/<disk> for each disk in the inventory's expected_disks. Parses reallocated-sector counts, pending sectors, end-to-end error counters, overall-health attribute. Pass: SMART overall-health is PASSED on every expected disk and reallocated-sector count is below threshold. Artifacts: smart-<disk>.txt raw output.

CPUStress

Owner: agent. What it does: runs stress-ng --cpu N --vm M --vm-bytes 90% -t 120s with N = logical_cores and M ≈ logical_cores/2. The --vm flag is the stand-in for Memtest86+: it exercises the memory subsystem under load and will fail if the RAM has latent faults that surface under thermal + allocator pressure. Pass: stress-ng exits 0 and thermal samples taken by the sidecar stay below the configured per-host max_temp_c. Caveat: weaker than a dedicated memtest pass; see architecture.md for the reasoning (Memtest86+ can't be signalled back without IPMI serial).

Storage

Owner: agent (destructive). What it does:

  1. Wipe probe — scans for filesystem signatures, LVM metadata, partition tables on the expected disks. Any hit → halt with UnexpectedData; operator must click Override wipe-probe.
  2. badblocks -svw (destructive read/write) on each expected disk.
  3. fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1G on each disk; captures IOPS and p99 latency.

Pass: badblocks reports zero bad blocks; fio IOPS above a per-class floor (configurable). Artifacts: fio-<disk>.json per disk. Safety gate: the wipe-probe + device allowlist are the second and third lines of defense against wiping the wrong disk. See architecture.md § Safety.

Network

Owner: agent. What it does: iperf3 -c <orchestrator> -p <iperf_port> -t 10 -J to measure throughput to the orchestrator. The orchestrator-side iperf3 -s is supervised by internal/orchestrator/iperf.go and binds to the configured network.iperf_port. Pass: throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps for 10GbE). Artifacts: iperf-<nic>.json.

Burn

Owner: agent. What it does: runs CPU stress, memory stress, disk I/O, and network throughput simultaneously for the profile's burn duration. The goal is to stress every subsystem at once and surface failures that only appear under combined load (thermal throttling, PSU voltage sag, memory errors under thermal pressure).

Sub-workloads run as parallel goroutines:

  • CPUstress-ng --cpu <workers> for the burn duration.
  • Memorystress-ng --vm --vm-bytes <mem_pct>% for the burn duration.
  • Diskfio against a spare partition (when fio_on_spare is enabled).
  • Networkiperf3 -c <orchestrator> -P <parallel> for the burn duration.

Pass: all four sub-workloads exit 0 and no critical threshold breach fires during the window. Configurable knobs (per profile):

Knob Description
duration Total burn-in window.
cpu_workers all = runtime.NumCPU(), or a fixed count.
mem_pct Percentage of MemAvailable to stress.
fio_on_spare Run fio inside Burn (requires a spare partition).
iperf_parallel Parallel stream count for iperf3 -P.

See configuration.md § burn for per-profile default values.

GPU

Owner: agent. What it does: runs nvidia-smi -q and a short compute workload (gpu-burn if present, else nvidia-smi dmon during a stress-ng --gpu burst). Skipped cleanly when no GPU is present. Pass: no ECC errors reported; temperature below threshold; compute workload exits 0.

PSU

Owner: agent. What it does: reads /sys/class/hwmon/*/power_average and in*_input during a synthetic load burst (CPU + disk + NIC simultaneously) to look for voltage sag or wattage anomalies. Records the full envelope as measurements rows with kind=psu. Pass: no voltage dip below threshold across the load burst. Caveat: only reports on what the BMC exposes via hwmon — servers without exposed PSU telemetry pass trivially. Documented limitation.

Reporting (orchestrator-owned)

Owner: orchestrator (resolves inline inside the /result for PSU). What it does:

  1. Gathers run, host, stages, spec_diffs, and measurement aggregates.
  2. Renders report.html via internal/report (html/template with inlined CSS; self-contained offline-viewable).
  3. Writes report.json with the same data in machine-readable form.
  4. Records both as report_html / report_json artifact rows.
  5. Transitions run → Completed.
  6. Fires RunCompleted notification.
  7. The next agent heartbeat returns cmd=shutdown.

Thermal sidecar

Owner: agent (always-on from Booting until the agent exits). What it does: every 5 seconds, walks /sys/class/hwmon/* and POSTs temperature samples as a batch to /sensor. Populates the measurements table with kind=thermal. No pass/fail on its own — stages that care about thermals read the sidecar's data via measurements. A dead sensor just drops out of the next batch.


Where pass/fail lives

  • runs.state — authoritative terminal state (Completed, FailedHolding, Released).
  • runs.resultpass or fail string once the run completes.
  • runs.failed_stage — name of the stage that halted the pipeline, if any. Cleared when the operator overrides and re-enters.
  • stages — one row per attempted stage with passed, started_at, completed_at, summary_json, message.
  • measurements — time-series samples from the thermal sidecar and from stages that capture numeric outputs.
  • artifacts — on-disk files (report, fio logs, iperf logs, etc).
  • spec_diffs — one row per expected-vs-actual divergence.

Profile duration summary

Three profiles scale every stage's duration. Probes and gates are identical across profiles — only the work size changes. See configuration.md § profiles for the full knob reference.

Stage quick (~10 min) deep (~8-12 h) soak (~36-40 h)
Inventory seconds seconds seconds
Firmware seconds seconds seconds
SpecValidate instant (server) instant (server) instant (server)
SMART seconds per disk seconds per disk seconds per disk
CPUStress 2 m cpu + 2 m mem 60 m cpu + 60 m mem 12 h cpu + 12 h mem
Storage 3 m fio (sample) badblocks + 2 h fio badblocks + 6 h fio
Network 60 s iperf 30 m iperf 2 h iperf
Burn 2 m all-at-once 2 h all-at-once 18 h all-at-once
GPU seconds seconds seconds
PSU 1 m load burst 10 m load burst 15 m load burst
Reporting instant (server) instant (server) instant (server)

Adding a new stage

  1. Add the name to store.DefaultStageOrder.
  2. Add a model.State<Name> const and wire it into internal/orchestrator/statemachine.go (both the forward transition table and the stage-for-state lookup).
  3. Add a case to agent/runner.go's runStage dispatch.
  4. Drop the implementation into agent/tests/.
  5. If the stage is orchestrator-owned, add a resolve<Name> helper to internal/api/agent_handlers.go and invoke it from the /result handler after the preceding stage's NextState resolves.