docs: comprehensive documentation expansion
CI / Lint + build + test (push) Successful in 1m36s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 49s

Add 4 new doc files (configuration reference, development guide, API
reference with full request/response schemas, database schema), expand
the README with a feature list and how-it-works walkthrough, fix
missing Firmware and Burn stages in architecture.md and test-suite.md,
add threshold engine and host-mode agent sections, and add godoc
comments to 11 packages and 6 model types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-23 18:37:26 -04:00
parent 17ec55cb85
commit 8367ec2a9f
18 changed files with 1548 additions and 10 deletions
+73 -2
View File
@@ -8,8 +8,8 @@ to fix, override, or abandon.
## Stage order
```
Inventory → SpecValidate → SMART → CPUStress → Storage
→ Network → GPU → PSU → Reporting
Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage
→ Network → Burn → GPU → PSU → Reporting
```
Stages marked *orchestrator-owned* resolve inside `/result` and never
@@ -27,6 +27,20 @@ merged into a single JSON blob.
`nvidia-smi` on a GPU-less host) are tolerated.
**Artifacts:** `inventory.json` under `artifacts/run-<N>/`.
## Firmware
**Owner:** agent.
**What it does:** probes firmware versions across all discoverable
components: BIOS (`dmidecode -t bios`), BMC (`ipmitool mc info`), NIC
firmware (`ethtool -i` per interface), NVMe firmware (`nvme id-ctrl`),
HBA firmware (`lspci -vv`), and CPU microcode (`/proc/cpuinfo`).
Missing tools are tolerated — a GPU-less server won't have
`nvidia-smi`, a consumer board won't have `ipmitool`.
**Pass:** always passes. Firmware is advisory-only; SpecValidate is the
gate that fails on version mismatches.
**Artifacts:** `firmware_snapshots` table rows (one per component,
keyed by `(run_id, component, identifier)`).
## SpecValidate *(orchestrator-owned)*
**Owner:** orchestrator (resolves inline inside the `/result` for the
@@ -93,6 +107,40 @@ binds to the configured `network.iperf_port`.
for 10GbE).
**Artifacts:** `iperf-<nic>.json`.
## Burn
**Owner:** agent.
**What it does:** runs CPU stress, memory stress, disk I/O, and
network throughput **simultaneously** for the profile's burn duration.
The goal is to stress every subsystem at once and surface failures that
only appear under combined load (thermal throttling, PSU voltage sag,
memory errors under thermal pressure).
Sub-workloads run as parallel goroutines:
- **CPU** — `stress-ng --cpu <workers>` for the burn duration.
- **Memory** — `stress-ng --vm --vm-bytes <mem_pct>%` for the burn
duration.
- **Disk** — `fio` against a spare partition (when `fio_on_spare` is
enabled).
- **Network** — `iperf3 -c <orchestrator> -P <parallel>` for the burn
duration.
**Pass:** all four sub-workloads exit 0 and no critical threshold
breach fires during the window.
**Configurable knobs** (per profile):
| Knob | Description |
|------|-------------|
| `duration` | Total burn-in window. |
| `cpu_workers` | `all` = `runtime.NumCPU()`, or a fixed count. |
| `mem_pct` | Percentage of MemAvailable to stress. |
| `fio_on_spare` | Run fio inside Burn (requires a spare partition). |
| `iperf_parallel` | Parallel stream count for `iperf3 -P`. |
See [configuration.md § burn](configuration.md#burn) for per-profile
default values.
## GPU
**Owner:** agent.
@@ -153,6 +201,29 @@ the next batch.
- `artifacts` — on-disk files (report, fio logs, iperf logs, etc).
- `spec_diffs` — one row per expected-vs-actual divergence.
## Profile duration summary
Three profiles scale every stage's duration. Probes and gates are
identical across profiles — only the work size changes. See
[configuration.md § profiles](configuration.md#profiles) for the full
knob reference.
| Stage | quick (~10 min) | deep (~8-12 h) | soak (~36-40 h) |
|-------|----------------|----------------|-----------------|
| Inventory | seconds | seconds | seconds |
| Firmware | seconds | seconds | seconds |
| SpecValidate | instant (server) | instant (server) | instant (server) |
| SMART | seconds per disk | seconds per disk | seconds per disk |
| CPUStress | 2 m cpu + 2 m mem | 60 m cpu + 60 m mem | 12 h cpu + 12 h mem |
| Storage | 3 m fio (sample) | badblocks + 2 h fio | badblocks + 6 h fio |
| Network | 60 s iperf | 30 m iperf | 2 h iperf |
| Burn | 2 m all-at-once | 2 h all-at-once | 18 h all-at-once |
| GPU | seconds | seconds | seconds |
| PSU | 1 m load burst | 10 m load burst | 15 m load burst |
| Reporting | instant (server) | instant (server) | instant (server) |
---
## Adding a new stage
1. Add the name to `store.DefaultStageOrder`.