# Test suite What each stage measures, what "pass" means, and where the results land. Stages run strictly in order. Any stage returning `passed=false` halts the pipeline at `FailedHolding` — the operator decides whether to fix, override, or abandon. ## Stage order ``` Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage → Network → Burn → GPU → PSU → Reporting ``` Stages marked *orchestrator-owned* resolve inside `/result` and never show up as "the agent's turn". --- ## Inventory **Owner:** agent. **What it does:** `dmidecode`, `lscpu`, `lshw`, `lspci`, `smartctl -i` over each block device, `nvidia-smi -q` if present. The raw output is merged into a single JSON blob. **Pass:** the probes run to completion; missing optional tools (e.g. `nvidia-smi` on a GPU-less host) are tolerated. **Artifacts:** `inventory.json` under `artifacts/run-/`. ## Firmware **Owner:** agent. **What it does:** probes firmware versions across all discoverable components: BIOS (`dmidecode -t bios`), BMC (`ipmitool mc info`), NIC firmware (`ethtool -i` per interface), NVMe firmware (`nvme id-ctrl`), HBA firmware (`lspci -vv`), and CPU microcode (`/proc/cpuinfo`). Missing tools are tolerated — a GPU-less server won't have `nvidia-smi`, a consumer board won't have `ipmitool`. **Pass:** always passes. Firmware is advisory-only; SpecValidate is the gate that fails on version mismatches. **Artifacts:** `firmware_snapshots` table rows (one per component, keyed by `(run_id, component, identifier)`). ## SpecValidate *(orchestrator-owned)* **Owner:** orchestrator (resolves inline inside the `/result` for the preceding Inventory stage). **What it does:** diffs the submitted inventory against the host's `expected_spec_yaml`. The diff engine classifies each field as `critical`, `warning`, or `info`. **Pass:** zero `critical` diffs. **Fail mode:** fires a `SpecMismatch` notification; transitions run to `Failed → FailedHolding`. **Artifacts:** `spec_diffs` table rows (one per divergence). ## SMART **Owner:** agent. **What it does:** `smartctl -a /dev/` for each disk in the inventory's `expected_disks`. Parses reallocated-sector counts, pending sectors, end-to-end error counters, overall-health attribute. **Pass:** SMART overall-health is PASSED on every expected disk and reallocated-sector count is below threshold. **Artifacts:** `smart-.txt` raw output. ## CPUStress **Owner:** agent. **What it does:** runs `stress-ng --cpu N --vm M --vm-bytes 90% -t 120s` with `N = logical_cores` and `M ≈ logical_cores/2`. The `--vm` flag is the **stand-in for Memtest86+**: it exercises the memory subsystem under load and will fail if the RAM has latent faults that surface under thermal + allocator pressure. **Pass:** `stress-ng` exits 0 and thermal samples taken by the sidecar stay below the configured per-host `max_temp_c`. **Caveat:** weaker than a dedicated memtest pass; see [architecture.md](architecture.md) for the reasoning (Memtest86+ can't be signalled back without IPMI serial). ## Storage **Owner:** agent (destructive). **What it does:** 1. **Wipe probe** — scans for filesystem signatures, LVM metadata, partition tables on the expected disks. Any hit → halt with `UnexpectedData`; operator must click **Override wipe-probe**. 2. `badblocks -svw` (destructive read/write) on each expected disk. 3. `fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1G` on each disk; captures IOPS and p99 latency. **Pass:** badblocks reports zero bad blocks; fio IOPS above a per-class floor (configurable). **Artifacts:** `fio-.json` per disk. **Safety gate:** the wipe-probe + device allowlist are the second and third lines of defense against wiping the wrong disk. See [architecture.md § Safety](architecture.md#safety-destructive-disk-tests). ## Network **Owner:** agent. **What it does:** `iperf3 -c -p -t 10 -J` to measure throughput to the orchestrator. The orchestrator-side `iperf3 -s` is supervised by `internal/orchestrator/iperf.go` and binds to the configured `network.iperf_port`. **Pass:** throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps for 10GbE). **Artifacts:** `iperf-.json`. ## Burn **Owner:** agent. **What it does:** runs CPU stress, memory stress, disk I/O, and network throughput **simultaneously** for the profile's burn duration. The goal is to stress every subsystem at once and surface failures that only appear under combined load (thermal throttling, PSU voltage sag, memory errors under thermal pressure). Sub-workloads run as parallel goroutines: - **CPU** — `stress-ng --cpu ` for the burn duration. - **Memory** — `stress-ng --vm --vm-bytes %` for the burn duration. - **Disk** — `fio` against a spare partition (when `fio_on_spare` is enabled). - **Network** — `iperf3 -c -P ` for the burn duration. **Pass:** all four sub-workloads exit 0 and no critical threshold breach fires during the window. **Configurable knobs** (per profile): | Knob | Description | |------|-------------| | `duration` | Total burn-in window. | | `cpu_workers` | `all` = `runtime.NumCPU()`, or a fixed count. | | `mem_pct` | Percentage of MemAvailable to stress. | | `fio_on_spare` | Run fio inside Burn (requires a spare partition). | | `iperf_parallel` | Parallel stream count for `iperf3 -P`. | See [configuration.md § burn](configuration.md#burn) for per-profile default values. ## GPU **Owner:** agent. **What it does:** runs `nvidia-smi -q` and a short compute workload (`gpu-burn` if present, else `nvidia-smi dmon` during a `stress-ng --gpu` burst). Skipped cleanly when no GPU is present. **Pass:** no ECC errors reported; temperature below threshold; compute workload exits 0. ## PSU **Owner:** agent. **What it does:** reads `/sys/class/hwmon/*/power_average` and `in*_input` during a synthetic load burst (CPU + disk + NIC simultaneously) to look for voltage sag or wattage anomalies. Records the full envelope as `measurements` rows with `kind=psu`. **Pass:** no voltage dip below threshold across the load burst. **Caveat:** only reports on what the BMC exposes via hwmon — servers without exposed PSU telemetry pass trivially. Documented limitation. ## Reporting *(orchestrator-owned)* **Owner:** orchestrator (resolves inline inside the `/result` for PSU). **What it does:** 1. Gathers run, host, stages, spec_diffs, and measurement aggregates. 2. Renders `report.html` via `internal/report` (html/template with inlined CSS; self-contained offline-viewable). 3. Writes `report.json` with the same data in machine-readable form. 4. Records both as `report_html` / `report_json` artifact rows. 5. Transitions run → `Completed`. 6. Fires `RunCompleted` notification. 7. The next agent heartbeat returns `cmd=shutdown`. ## Thermal sidecar **Owner:** agent (always-on from `Booting` until the agent exits). **What it does:** every 5 seconds, walks `/sys/class/hwmon/*` and POSTs temperature samples as a batch to `/sensor`. Populates the `measurements` table with `kind=thermal`. **No pass/fail** on its own — stages that care about thermals read the sidecar's data via `measurements`. A dead sensor just drops out of the next batch. --- ## Where pass/fail lives - `runs.state` — authoritative terminal state (`Completed`, `FailedHolding`, `Released`). - `runs.result` — `pass` or `fail` string once the run completes. - `runs.failed_stage` — name of the stage that halted the pipeline, if any. Cleared when the operator overrides and re-enters. - `stages` — one row per attempted stage with `passed`, `started_at`, `completed_at`, `summary_json`, `message`. - `measurements` — time-series samples from the thermal sidecar and from stages that capture numeric outputs. - `artifacts` — on-disk files (report, fio logs, iperf logs, etc). - `spec_diffs` — one row per expected-vs-actual divergence. ## Profile duration summary Three profiles scale every stage's duration. Probes and gates are identical across profiles — only the work size changes. See [configuration.md § profiles](configuration.md#profiles) for the full knob reference. | Stage | quick (~10 min) | deep (~8-12 h) | soak (~36-40 h) | |-------|----------------|----------------|-----------------| | Inventory | seconds | seconds | seconds | | Firmware | seconds | seconds | seconds | | SpecValidate | instant (server) | instant (server) | instant (server) | | SMART | seconds per disk | seconds per disk | seconds per disk | | CPUStress | 2 m cpu + 2 m mem | 60 m cpu + 60 m mem | 12 h cpu + 12 h mem | | Storage | 3 m fio (sample) | badblocks + 2 h fio | badblocks + 6 h fio | | Network | 60 s iperf | 30 m iperf | 2 h iperf | | Burn | 2 m all-at-once | 2 h all-at-once | 18 h all-at-once | | GPU | seconds | seconds | seconds | | PSU | 1 m load burst | 10 m load burst | 15 m load burst | | Reporting | instant (server) | instant (server) | instant (server) | --- ## Adding a new stage 1. Add the name to `store.DefaultStageOrder`. 2. Add a `model.State` const and wire it into `internal/orchestrator/statemachine.go` (both the forward transition table and the stage-for-state lookup). 3. Add a case to `agent/runner.go`'s `runStage` dispatch. 4. Drop the implementation into `agent/tests/`. 5. If the stage is orchestrator-owned, add a `resolve` helper to `internal/api/agent_handlers.go` and invoke it from the `/result` handler after the preceding stage's `NextState` resolves.