# Test suite What each stage measures, what "pass" means, and where the results land. Stages run strictly in order. Any stage returning `passed=false` halts the pipeline at `FailedHolding` — the operator decides whether to fix, override, or abandon. ## Stage order ``` Inventory → SpecValidate → SMART → CPUStress → Storage → Network → GPU → PSU → Reporting ``` Stages marked *orchestrator-owned* resolve inside `/result` and never show up as "the agent's turn". --- ## Inventory **Owner:** agent. **What it does:** `dmidecode`, `lscpu`, `lshw`, `lspci`, `smartctl -i` over each block device, `nvidia-smi -q` if present. The raw output is merged into a single JSON blob. **Pass:** the probes run to completion; missing optional tools (e.g. `nvidia-smi` on a GPU-less host) are tolerated. **Artifacts:** `inventory.json` under `artifacts/run-/`. ## SpecValidate *(orchestrator-owned)* **Owner:** orchestrator (resolves inline inside the `/result` for the preceding Inventory stage). **What it does:** diffs the submitted inventory against the host's `expected_spec_yaml`. The diff engine classifies each field as `critical`, `warning`, or `info`. **Pass:** zero `critical` diffs. **Fail mode:** fires a `SpecMismatch` notification; transitions run to `Failed → FailedHolding`. **Artifacts:** `spec_diffs` table rows (one per divergence). ## SMART **Owner:** agent. **What it does:** `smartctl -a /dev/` for each disk in the inventory's `expected_disks`. Parses reallocated-sector counts, pending sectors, end-to-end error counters, overall-health attribute. **Pass:** SMART overall-health is PASSED on every expected disk and reallocated-sector count is below threshold. **Artifacts:** `smart-.txt` raw output. ## CPUStress **Owner:** agent. **What it does:** runs `stress-ng --cpu N --vm M --vm-bytes 90% -t 120s` with `N = logical_cores` and `M ≈ logical_cores/2`. The `--vm` flag is the **stand-in for Memtest86+**: it exercises the memory subsystem under load and will fail if the RAM has latent faults that surface under thermal + allocator pressure. **Pass:** `stress-ng` exits 0 and thermal samples taken by the sidecar stay below the configured per-host `max_temp_c`. **Caveat:** weaker than a dedicated memtest pass; see [architecture.md](architecture.md) for the reasoning (Memtest86+ can't be signalled back without IPMI serial). ## Storage **Owner:** agent (destructive). **What it does:** 1. **Wipe probe** — scans for filesystem signatures, LVM metadata, partition tables on the expected disks. Any hit → halt with `UnexpectedData`; operator must click **Override wipe-probe**. 2. `badblocks -svw` (destructive read/write) on each expected disk. 3. `fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1G` on each disk; captures IOPS and p99 latency. **Pass:** badblocks reports zero bad blocks; fio IOPS above a per-class floor (configurable). **Artifacts:** `fio-.json` per disk. **Safety gate:** the wipe-probe + device allowlist are the second and third lines of defense against wiping the wrong disk. See [architecture.md § Safety](architecture.md#safety-destructive-disk-tests). ## Network **Owner:** agent. **What it does:** `iperf3 -c -p -t 10 -J` to measure throughput to the orchestrator. The orchestrator-side `iperf3 -s` is supervised by `internal/orchestrator/iperf.go` and binds to the configured `network.iperf_port`. **Pass:** throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps for 10GbE). **Artifacts:** `iperf-.json`. ## GPU **Owner:** agent. **What it does:** runs `nvidia-smi -q` and a short compute workload (`gpu-burn` if present, else `nvidia-smi dmon` during a `stress-ng --gpu` burst). Skipped cleanly when no GPU is present. **Pass:** no ECC errors reported; temperature below threshold; compute workload exits 0. ## PSU **Owner:** agent. **What it does:** reads `/sys/class/hwmon/*/power_average` and `in*_input` during a synthetic load burst (CPU + disk + NIC simultaneously) to look for voltage sag or wattage anomalies. Records the full envelope as `measurements` rows with `kind=psu`. **Pass:** no voltage dip below threshold across the load burst. **Caveat:** only reports on what the BMC exposes via hwmon — servers without exposed PSU telemetry pass trivially. Documented limitation. ## Reporting *(orchestrator-owned)* **Owner:** orchestrator (resolves inline inside the `/result` for PSU). **What it does:** 1. Gathers run, host, stages, spec_diffs, and measurement aggregates. 2. Renders `report.html` via `internal/report` (html/template with inlined CSS; self-contained offline-viewable). 3. Writes `report.json` with the same data in machine-readable form. 4. Records both as `report_html` / `report_json` artifact rows. 5. Transitions run → `Completed`. 6. Fires `RunCompleted` notification. 7. The next agent heartbeat returns `cmd=shutdown`. ## Thermal sidecar **Owner:** agent (always-on from `Booting` until the agent exits). **What it does:** every 5 seconds, walks `/sys/class/hwmon/*` and POSTs temperature samples as a batch to `/sensor`. Populates the `measurements` table with `kind=thermal`. **No pass/fail** on its own — stages that care about thermals read the sidecar's data via `measurements`. A dead sensor just drops out of the next batch. --- ## Where pass/fail lives - `runs.state` — authoritative terminal state (`Completed`, `FailedHolding`, `Released`). - `runs.result` — `pass` or `fail` string once the run completes. - `runs.failed_stage` — name of the stage that halted the pipeline, if any. Cleared when the operator overrides and re-enters. - `stages` — one row per attempted stage with `passed`, `started_at`, `completed_at`, `summary_json`, `message`. - `measurements` — time-series samples from the thermal sidecar and from stages that capture numeric outputs. - `artifacts` — on-disk files (report, fio logs, iperf logs, etc). - `spec_diffs` — one row per expected-vs-actual divergence. ## Adding a new stage 1. Add the name to `store.DefaultStageOrder`. 2. Add a `model.State` const and wire it into `internal/orchestrator/statemachine.go` (both the forward transition table and the stage-for-state lookup). 3. Add a case to `agent/runner.go`'s `runStage` dispatch. 4. Drop the implementation into `agent/tests/`. 5. If the stage is orchestrator-owned, add a `resolve` helper to `internal/api/agent_handlers.go` and invoke it from the `/result` handler after the preceding stage's `NextState` resolves.