Vetting/docs/test-suite.md

# Test suite

What each stage measures, what "pass" means, and where the results
land. Stages run strictly in order. Any stage returning `passed=false`
halts the pipeline at `FailedHolding` — the operator decides whether
to fix, override, or abandon.

## Stage order

```
Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage
          → Network → Burn → GPU → PSU → Reporting
```

Stages marked *orchestrator-owned* resolve inside `/result` and never
show up as "the agent's turn".

---

## Inventory

**Owner:** agent.
**What it does:** `dmidecode`, `lscpu`, `lshw`, `lspci`, `smartctl -i`
over each block device, `nvidia-smi -q` if present. The raw output is
merged into a single JSON blob.
**Pass:** the probes run to completion; missing optional tools (e.g.
`nvidia-smi` on a GPU-less host) are tolerated.
**Artifacts:** `inventory.json` under `artifacts/run-<N>/`.

## Firmware

**Owner:** agent.
**What it does:** probes firmware versions across all discoverable
components: BIOS (`dmidecode -t bios`), BMC (`ipmitool mc info`), NIC
firmware (`ethtool -i` per interface), NVMe firmware (`nvme id-ctrl`),
HBA firmware (`lspci -vv`), and CPU microcode (`/proc/cpuinfo`).
Missing tools are tolerated — a GPU-less server won't have
`nvidia-smi`, a consumer board won't have `ipmitool`.
**Pass:** always passes. Firmware is advisory-only; SpecValidate is the
gate that fails on version mismatches.
**Artifacts:** `firmware_snapshots` table rows (one per component,
keyed by `(run_id, component, identifier)`).

## SpecValidate *(orchestrator-owned)*

**Owner:** orchestrator (resolves inline inside the `/result` for the
preceding Inventory stage).
**What it does:** diffs the submitted inventory against the host's
`expected_spec_yaml`. The diff engine classifies each field as
`critical`, `warning`, or `info`.
**Pass:** zero `critical` diffs.
**Fail mode:** fires a `SpecMismatch` notification; transitions run
to `Failed → FailedHolding`.
**Artifacts:** `spec_diffs` table rows (one per divergence).

## SMART

**Owner:** agent.
**What it does:** `smartctl -a /dev/<disk>` for each disk in the
inventory's `expected_disks`. Parses reallocated-sector counts, pending
sectors, end-to-end error counters, overall-health attribute.
**Pass:** SMART overall-health is PASSED on every expected disk and
reallocated-sector count is below threshold.
**Artifacts:** `smart-<disk>.txt` raw output.

## CPUStress

**Owner:** agent.
**What it does:** runs `stress-ng --cpu N --vm M --vm-bytes 90% -t
120s` with `N = logical_cores` and `M ≈ logical_cores/2`. The `--vm`
flag is the **stand-in for Memtest86+**: it exercises the memory
subsystem under load and will fail if the RAM has latent faults that
surface under thermal + allocator pressure.
**Pass:** `stress-ng` exits 0 and thermal samples taken by the sidecar
stay below the configured per-host `max_temp_c`.
**Caveat:** weaker than a dedicated memtest pass; see
[architecture.md](architecture.md) for the reasoning (Memtest86+
can't be signalled back without IPMI serial).

## Storage

**Owner:** agent (destructive).
**What it does:**

1. **Wipe probe** — scans for filesystem signatures, LVM metadata,
   partition tables on the expected disks. Any hit → halt with
   `UnexpectedData`; operator must click **Override wipe-probe**.
2. `badblocks -svw` (destructive read/write) on each expected disk.
3. `fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1G` on
   each disk; captures IOPS and p99 latency.

**Pass:** badblocks reports zero bad blocks; fio IOPS above a
per-class floor (configurable).
**Artifacts:** `fio-<disk>.json` per disk.
**Safety gate:** the wipe-probe + device allowlist are the second and
third lines of defense against wiping the wrong disk. See
[architecture.md § Safety](architecture.md#safety-destructive-disk-tests).

## Network

**Owner:** agent.
**What it does:** `iperf3 -c <orchestrator> -p <iperf_port> -t 10 -J`
to measure throughput to the orchestrator. The orchestrator-side
`iperf3 -s` is supervised by `internal/orchestrator/iperf.go` and
binds to the configured `network.iperf_port`.
**Pass:** throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps
for 10GbE).
**Artifacts:** `iperf-<nic>.json`.

## Burn

**Owner:** agent.
**What it does:** runs CPU stress, memory stress, disk I/O, and
network throughput **simultaneously** for the profile's burn duration.
The goal is to stress every subsystem at once and surface failures that
only appear under combined load (thermal throttling, PSU voltage sag,
memory errors under thermal pressure).

Sub-workloads run as parallel goroutines:

- **CPU** — `stress-ng --cpu <workers>` for the burn duration.
- **Memory** — `stress-ng --vm --vm-bytes <mem_pct>%` for the burn
  duration.
- **Disk** — `fio` against a spare partition (when `fio_on_spare` is
  enabled).
- **Network** — `iperf3 -c <orchestrator> -P <parallel>` for the burn
  duration.

**Pass:** all four sub-workloads exit 0 and no critical threshold
breach fires during the window.
**Configurable knobs** (per profile):

| Knob | Description |
|------|-------------|
| `duration` | Total burn-in window. |
| `cpu_workers` | `all` = `runtime.NumCPU()`, or a fixed count. |
| `mem_pct` | Percentage of MemAvailable to stress. |
| `fio_on_spare` | Run fio inside Burn (requires a spare partition). |
| `iperf_parallel` | Parallel stream count for `iperf3 -P`. |

See [configuration.md § burn](configuration.md#burn) for per-profile
default values.

## GPU

**Owner:** agent.
**What it does:** runs `nvidia-smi -q` and a short compute workload
(`gpu-burn` if present, else `nvidia-smi dmon` during a `stress-ng
--gpu` burst). Skipped cleanly when no GPU is present.
**Pass:** no ECC errors reported; temperature below threshold; compute
workload exits 0.

## PSU

**Owner:** agent.
**What it does:** reads `/sys/class/hwmon/*/power_average` and `in*_input`
during a synthetic load burst (CPU + disk + NIC simultaneously) to
look for voltage sag or wattage anomalies. Records the full envelope
as `measurements` rows with `kind=psu`.
**Pass:** no voltage dip below threshold across the load burst.
**Caveat:** only reports on what the BMC exposes via hwmon — servers
without exposed PSU telemetry pass trivially. Documented limitation.

## Reporting *(orchestrator-owned)*

**Owner:** orchestrator (resolves inline inside the `/result` for PSU).
**What it does:**

1. Gathers run, host, stages, spec_diffs, and measurement aggregates.
2. Renders `report.html` via `internal/report` (html/template with
   inlined CSS; self-contained offline-viewable).
3. Writes `report.json` with the same data in machine-readable form.
4. Records both as `report_html` / `report_json` artifact rows.
5. Transitions run → `Completed`.
6. Fires `RunCompleted` notification.
7. The next agent heartbeat returns `cmd=shutdown`.

## Thermal sidecar

**Owner:** agent (always-on from `Booting` until the agent exits).
**What it does:** every 5 seconds, walks `/sys/class/hwmon/*` and
POSTs temperature samples as a batch to `/sensor`. Populates the
`measurements` table with `kind=thermal`.
**No pass/fail** on its own — stages that care about thermals read the
sidecar's data via `measurements`. A dead sensor just drops out of
the next batch.

---

## Where pass/fail lives

- `runs.state` — authoritative terminal state (`Completed`,
  `FailedHolding`, `Released`).
- `runs.result` — `pass` or `fail` string once the run completes.
- `runs.failed_stage` — name of the stage that halted the pipeline, if
  any. Cleared when the operator overrides and re-enters.
- `stages` — one row per attempted stage with `passed`, `started_at`,
  `completed_at`, `summary_json`, `message`.
- `measurements` — time-series samples from the thermal sidecar and
  from stages that capture numeric outputs.
- `artifacts` — on-disk files (report, fio logs, iperf logs, etc).
- `spec_diffs` — one row per expected-vs-actual divergence.

## Profile duration summary

Three profiles scale every stage's duration. Probes and gates are
identical across profiles — only the work size changes. See
[configuration.md § profiles](configuration.md#profiles) for the full
knob reference.

| Stage | quick (~10 min) | deep (~8-12 h) | soak (~36-40 h) |
|-------|----------------|----------------|-----------------|
| Inventory | seconds | seconds | seconds |
| Firmware | seconds | seconds | seconds |
| SpecValidate | instant (server) | instant (server) | instant (server) |
| SMART | seconds per disk | seconds per disk | seconds per disk |
| CPUStress | 2 m cpu + 2 m mem | 60 m cpu + 60 m mem | 12 h cpu + 12 h mem |
| Storage | 3 m fio (sample) | badblocks + 2 h fio | badblocks + 6 h fio |
| Network | 60 s iperf | 30 m iperf | 2 h iperf |
| Burn | 2 m all-at-once | 2 h all-at-once | 18 h all-at-once |
| GPU | seconds | seconds | seconds |
| PSU | 1 m load burst | 10 m load burst | 15 m load burst |
| Reporting | instant (server) | instant (server) | instant (server) |

---

## Adding a new stage

1. Add the name to `store.DefaultStageOrder`.
2. Add a `model.State<Name>` const and wire it into
   `internal/orchestrator/statemachine.go` (both the forward
   transition table and the stage-for-state lookup).
3. Add a case to `agent/runner.go`'s `runStage` dispatch.
4. Drop the implementation into `agent/tests/`.
5. If the stage is orchestrator-owned, add a `resolve<Name>` helper to
   `internal/api/agent_handlers.go` and invoke it from the `/result`
   handler after the preceding stage's `NextState` resolves.