8367ec2a9f
Add 4 new doc files (configuration reference, development guide, API reference with full request/response schemas, database schema), expand the README with a feature list and how-it-works walkthrough, fix missing Firmware and Burn stages in architecture.md and test-suite.md, add threshold engine and host-mode agent sections, and add godoc comments to 11 packages and 6 model types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
238 lines
9.3 KiB
Markdown
238 lines
9.3 KiB
Markdown
# Test suite
|
|
|
|
What each stage measures, what "pass" means, and where the results
|
|
land. Stages run strictly in order. Any stage returning `passed=false`
|
|
halts the pipeline at `FailedHolding` — the operator decides whether
|
|
to fix, override, or abandon.
|
|
|
|
## Stage order
|
|
|
|
```
|
|
Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage
|
|
→ Network → Burn → GPU → PSU → Reporting
|
|
```
|
|
|
|
Stages marked *orchestrator-owned* resolve inside `/result` and never
|
|
show up as "the agent's turn".
|
|
|
|
---
|
|
|
|
## Inventory
|
|
|
|
**Owner:** agent.
|
|
**What it does:** `dmidecode`, `lscpu`, `lshw`, `lspci`, `smartctl -i`
|
|
over each block device, `nvidia-smi -q` if present. The raw output is
|
|
merged into a single JSON blob.
|
|
**Pass:** the probes run to completion; missing optional tools (e.g.
|
|
`nvidia-smi` on a GPU-less host) are tolerated.
|
|
**Artifacts:** `inventory.json` under `artifacts/run-<N>/`.
|
|
|
|
## Firmware
|
|
|
|
**Owner:** agent.
|
|
**What it does:** probes firmware versions across all discoverable
|
|
components: BIOS (`dmidecode -t bios`), BMC (`ipmitool mc info`), NIC
|
|
firmware (`ethtool -i` per interface), NVMe firmware (`nvme id-ctrl`),
|
|
HBA firmware (`lspci -vv`), and CPU microcode (`/proc/cpuinfo`).
|
|
Missing tools are tolerated — a GPU-less server won't have
|
|
`nvidia-smi`, a consumer board won't have `ipmitool`.
|
|
**Pass:** always passes. Firmware is advisory-only; SpecValidate is the
|
|
gate that fails on version mismatches.
|
|
**Artifacts:** `firmware_snapshots` table rows (one per component,
|
|
keyed by `(run_id, component, identifier)`).
|
|
|
|
## SpecValidate *(orchestrator-owned)*
|
|
|
|
**Owner:** orchestrator (resolves inline inside the `/result` for the
|
|
preceding Inventory stage).
|
|
**What it does:** diffs the submitted inventory against the host's
|
|
`expected_spec_yaml`. The diff engine classifies each field as
|
|
`critical`, `warning`, or `info`.
|
|
**Pass:** zero `critical` diffs.
|
|
**Fail mode:** fires a `SpecMismatch` notification; transitions run
|
|
to `Failed → FailedHolding`.
|
|
**Artifacts:** `spec_diffs` table rows (one per divergence).
|
|
|
|
## SMART
|
|
|
|
**Owner:** agent.
|
|
**What it does:** `smartctl -a /dev/<disk>` for each disk in the
|
|
inventory's `expected_disks`. Parses reallocated-sector counts, pending
|
|
sectors, end-to-end error counters, overall-health attribute.
|
|
**Pass:** SMART overall-health is PASSED on every expected disk and
|
|
reallocated-sector count is below threshold.
|
|
**Artifacts:** `smart-<disk>.txt` raw output.
|
|
|
|
## CPUStress
|
|
|
|
**Owner:** agent.
|
|
**What it does:** runs `stress-ng --cpu N --vm M --vm-bytes 90% -t
|
|
120s` with `N = logical_cores` and `M ≈ logical_cores/2`. The `--vm`
|
|
flag is the **stand-in for Memtest86+**: it exercises the memory
|
|
subsystem under load and will fail if the RAM has latent faults that
|
|
surface under thermal + allocator pressure.
|
|
**Pass:** `stress-ng` exits 0 and thermal samples taken by the sidecar
|
|
stay below the configured per-host `max_temp_c`.
|
|
**Caveat:** weaker than a dedicated memtest pass; see
|
|
[architecture.md](architecture.md) for the reasoning (Memtest86+
|
|
can't be signalled back without IPMI serial).
|
|
|
|
## Storage
|
|
|
|
**Owner:** agent (destructive).
|
|
**What it does:**
|
|
|
|
1. **Wipe probe** — scans for filesystem signatures, LVM metadata,
|
|
partition tables on the expected disks. Any hit → halt with
|
|
`UnexpectedData`; operator must click **Override wipe-probe**.
|
|
2. `badblocks -svw` (destructive read/write) on each expected disk.
|
|
3. `fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1G` on
|
|
each disk; captures IOPS and p99 latency.
|
|
|
|
**Pass:** badblocks reports zero bad blocks; fio IOPS above a
|
|
per-class floor (configurable).
|
|
**Artifacts:** `fio-<disk>.json` per disk.
|
|
**Safety gate:** the wipe-probe + device allowlist are the second and
|
|
third lines of defense against wiping the wrong disk. See
|
|
[architecture.md § Safety](architecture.md#safety-destructive-disk-tests).
|
|
|
|
## Network
|
|
|
|
**Owner:** agent.
|
|
**What it does:** `iperf3 -c <orchestrator> -p <iperf_port> -t 10 -J`
|
|
to measure throughput to the orchestrator. The orchestrator-side
|
|
`iperf3 -s` is supervised by `internal/orchestrator/iperf.go` and
|
|
binds to the configured `network.iperf_port`.
|
|
**Pass:** throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps
|
|
for 10GbE).
|
|
**Artifacts:** `iperf-<nic>.json`.
|
|
|
|
## Burn
|
|
|
|
**Owner:** agent.
|
|
**What it does:** runs CPU stress, memory stress, disk I/O, and
|
|
network throughput **simultaneously** for the profile's burn duration.
|
|
The goal is to stress every subsystem at once and surface failures that
|
|
only appear under combined load (thermal throttling, PSU voltage sag,
|
|
memory errors under thermal pressure).
|
|
|
|
Sub-workloads run as parallel goroutines:
|
|
|
|
- **CPU** — `stress-ng --cpu <workers>` for the burn duration.
|
|
- **Memory** — `stress-ng --vm --vm-bytes <mem_pct>%` for the burn
|
|
duration.
|
|
- **Disk** — `fio` against a spare partition (when `fio_on_spare` is
|
|
enabled).
|
|
- **Network** — `iperf3 -c <orchestrator> -P <parallel>` for the burn
|
|
duration.
|
|
|
|
**Pass:** all four sub-workloads exit 0 and no critical threshold
|
|
breach fires during the window.
|
|
**Configurable knobs** (per profile):
|
|
|
|
| Knob | Description |
|
|
|------|-------------|
|
|
| `duration` | Total burn-in window. |
|
|
| `cpu_workers` | `all` = `runtime.NumCPU()`, or a fixed count. |
|
|
| `mem_pct` | Percentage of MemAvailable to stress. |
|
|
| `fio_on_spare` | Run fio inside Burn (requires a spare partition). |
|
|
| `iperf_parallel` | Parallel stream count for `iperf3 -P`. |
|
|
|
|
See [configuration.md § burn](configuration.md#burn) for per-profile
|
|
default values.
|
|
|
|
## GPU
|
|
|
|
**Owner:** agent.
|
|
**What it does:** runs `nvidia-smi -q` and a short compute workload
|
|
(`gpu-burn` if present, else `nvidia-smi dmon` during a `stress-ng
|
|
--gpu` burst). Skipped cleanly when no GPU is present.
|
|
**Pass:** no ECC errors reported; temperature below threshold; compute
|
|
workload exits 0.
|
|
|
|
## PSU
|
|
|
|
**Owner:** agent.
|
|
**What it does:** reads `/sys/class/hwmon/*/power_average` and `in*_input`
|
|
during a synthetic load burst (CPU + disk + NIC simultaneously) to
|
|
look for voltage sag or wattage anomalies. Records the full envelope
|
|
as `measurements` rows with `kind=psu`.
|
|
**Pass:** no voltage dip below threshold across the load burst.
|
|
**Caveat:** only reports on what the BMC exposes via hwmon — servers
|
|
without exposed PSU telemetry pass trivially. Documented limitation.
|
|
|
|
## Reporting *(orchestrator-owned)*
|
|
|
|
**Owner:** orchestrator (resolves inline inside the `/result` for PSU).
|
|
**What it does:**
|
|
|
|
1. Gathers run, host, stages, spec_diffs, and measurement aggregates.
|
|
2. Renders `report.html` via `internal/report` (html/template with
|
|
inlined CSS; self-contained offline-viewable).
|
|
3. Writes `report.json` with the same data in machine-readable form.
|
|
4. Records both as `report_html` / `report_json` artifact rows.
|
|
5. Transitions run → `Completed`.
|
|
6. Fires `RunCompleted` notification.
|
|
7. The next agent heartbeat returns `cmd=shutdown`.
|
|
|
|
## Thermal sidecar
|
|
|
|
**Owner:** agent (always-on from `Booting` until the agent exits).
|
|
**What it does:** every 5 seconds, walks `/sys/class/hwmon/*` and
|
|
POSTs temperature samples as a batch to `/sensor`. Populates the
|
|
`measurements` table with `kind=thermal`.
|
|
**No pass/fail** on its own — stages that care about thermals read the
|
|
sidecar's data via `measurements`. A dead sensor just drops out of
|
|
the next batch.
|
|
|
|
---
|
|
|
|
## Where pass/fail lives
|
|
|
|
- `runs.state` — authoritative terminal state (`Completed`,
|
|
`FailedHolding`, `Released`).
|
|
- `runs.result` — `pass` or `fail` string once the run completes.
|
|
- `runs.failed_stage` — name of the stage that halted the pipeline, if
|
|
any. Cleared when the operator overrides and re-enters.
|
|
- `stages` — one row per attempted stage with `passed`, `started_at`,
|
|
`completed_at`, `summary_json`, `message`.
|
|
- `measurements` — time-series samples from the thermal sidecar and
|
|
from stages that capture numeric outputs.
|
|
- `artifacts` — on-disk files (report, fio logs, iperf logs, etc).
|
|
- `spec_diffs` — one row per expected-vs-actual divergence.
|
|
|
|
## Profile duration summary
|
|
|
|
Three profiles scale every stage's duration. Probes and gates are
|
|
identical across profiles — only the work size changes. See
|
|
[configuration.md § profiles](configuration.md#profiles) for the full
|
|
knob reference.
|
|
|
|
| Stage | quick (~10 min) | deep (~8-12 h) | soak (~36-40 h) |
|
|
|-------|----------------|----------------|-----------------|
|
|
| Inventory | seconds | seconds | seconds |
|
|
| Firmware | seconds | seconds | seconds |
|
|
| SpecValidate | instant (server) | instant (server) | instant (server) |
|
|
| SMART | seconds per disk | seconds per disk | seconds per disk |
|
|
| CPUStress | 2 m cpu + 2 m mem | 60 m cpu + 60 m mem | 12 h cpu + 12 h mem |
|
|
| Storage | 3 m fio (sample) | badblocks + 2 h fio | badblocks + 6 h fio |
|
|
| Network | 60 s iperf | 30 m iperf | 2 h iperf |
|
|
| Burn | 2 m all-at-once | 2 h all-at-once | 18 h all-at-once |
|
|
| GPU | seconds | seconds | seconds |
|
|
| PSU | 1 m load burst | 10 m load burst | 15 m load burst |
|
|
| Reporting | instant (server) | instant (server) | instant (server) |
|
|
|
|
---
|
|
|
|
## Adding a new stage
|
|
|
|
1. Add the name to `store.DefaultStageOrder`.
|
|
2. Add a `model.State<Name>` const and wire it into
|
|
`internal/orchestrator/statemachine.go` (both the forward
|
|
transition table and the stage-for-state lookup).
|
|
3. Add a case to `agent/runner.go`'s `runStage` dispatch.
|
|
4. Drop the implementation into `agent/tests/`.
|
|
5. If the stage is orchestrator-owned, add a `resolve<Name>` helper to
|
|
`internal/api/agent_handlers.go` and invoke it from the `/result`
|
|
handler after the preceding stage's `NextState` resolves.
|