docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API reference with full request/response schemas, database schema), expand the README with a feature list and how-it-works walkthrough, fix missing Firmware and Burn stages in architecture.md and test-suite.md, add threshold engine and host-mode agent sections, and add godoc comments to 11 packages and 6 model types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
+73
-2
@@ -8,8 +8,8 @@ to fix, override, or abandon.
|
||||
## Stage order
|
||||
|
||||
```
|
||||
Inventory → SpecValidate → SMART → CPUStress → Storage
|
||||
→ Network → GPU → PSU → Reporting
|
||||
Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage
|
||||
→ Network → Burn → GPU → PSU → Reporting
|
||||
```
|
||||
|
||||
Stages marked *orchestrator-owned* resolve inside `/result` and never
|
||||
@@ -27,6 +27,20 @@ merged into a single JSON blob.
|
||||
`nvidia-smi` on a GPU-less host) are tolerated.
|
||||
**Artifacts:** `inventory.json` under `artifacts/run-<N>/`.
|
||||
|
||||
## Firmware
|
||||
|
||||
**Owner:** agent.
|
||||
**What it does:** probes firmware versions across all discoverable
|
||||
components: BIOS (`dmidecode -t bios`), BMC (`ipmitool mc info`), NIC
|
||||
firmware (`ethtool -i` per interface), NVMe firmware (`nvme id-ctrl`),
|
||||
HBA firmware (`lspci -vv`), and CPU microcode (`/proc/cpuinfo`).
|
||||
Missing tools are tolerated — a GPU-less server won't have
|
||||
`nvidia-smi`, a consumer board won't have `ipmitool`.
|
||||
**Pass:** always passes. Firmware is advisory-only; SpecValidate is the
|
||||
gate that fails on version mismatches.
|
||||
**Artifacts:** `firmware_snapshots` table rows (one per component,
|
||||
keyed by `(run_id, component, identifier)`).
|
||||
|
||||
## SpecValidate *(orchestrator-owned)*
|
||||
|
||||
**Owner:** orchestrator (resolves inline inside the `/result` for the
|
||||
@@ -93,6 +107,40 @@ binds to the configured `network.iperf_port`.
|
||||
for 10GbE).
|
||||
**Artifacts:** `iperf-<nic>.json`.
|
||||
|
||||
## Burn
|
||||
|
||||
**Owner:** agent.
|
||||
**What it does:** runs CPU stress, memory stress, disk I/O, and
|
||||
network throughput **simultaneously** for the profile's burn duration.
|
||||
The goal is to stress every subsystem at once and surface failures that
|
||||
only appear under combined load (thermal throttling, PSU voltage sag,
|
||||
memory errors under thermal pressure).
|
||||
|
||||
Sub-workloads run as parallel goroutines:
|
||||
|
||||
- **CPU** — `stress-ng --cpu <workers>` for the burn duration.
|
||||
- **Memory** — `stress-ng --vm --vm-bytes <mem_pct>%` for the burn
|
||||
duration.
|
||||
- **Disk** — `fio` against a spare partition (when `fio_on_spare` is
|
||||
enabled).
|
||||
- **Network** — `iperf3 -c <orchestrator> -P <parallel>` for the burn
|
||||
duration.
|
||||
|
||||
**Pass:** all four sub-workloads exit 0 and no critical threshold
|
||||
breach fires during the window.
|
||||
**Configurable knobs** (per profile):
|
||||
|
||||
| Knob | Description |
|
||||
|------|-------------|
|
||||
| `duration` | Total burn-in window. |
|
||||
| `cpu_workers` | `all` = `runtime.NumCPU()`, or a fixed count. |
|
||||
| `mem_pct` | Percentage of MemAvailable to stress. |
|
||||
| `fio_on_spare` | Run fio inside Burn (requires a spare partition). |
|
||||
| `iperf_parallel` | Parallel stream count for `iperf3 -P`. |
|
||||
|
||||
See [configuration.md § burn](configuration.md#burn) for per-profile
|
||||
default values.
|
||||
|
||||
## GPU
|
||||
|
||||
**Owner:** agent.
|
||||
@@ -153,6 +201,29 @@ the next batch.
|
||||
- `artifacts` — on-disk files (report, fio logs, iperf logs, etc).
|
||||
- `spec_diffs` — one row per expected-vs-actual divergence.
|
||||
|
||||
## Profile duration summary
|
||||
|
||||
Three profiles scale every stage's duration. Probes and gates are
|
||||
identical across profiles — only the work size changes. See
|
||||
[configuration.md § profiles](configuration.md#profiles) for the full
|
||||
knob reference.
|
||||
|
||||
| Stage | quick (~10 min) | deep (~8-12 h) | soak (~36-40 h) |
|
||||
|-------|----------------|----------------|-----------------|
|
||||
| Inventory | seconds | seconds | seconds |
|
||||
| Firmware | seconds | seconds | seconds |
|
||||
| SpecValidate | instant (server) | instant (server) | instant (server) |
|
||||
| SMART | seconds per disk | seconds per disk | seconds per disk |
|
||||
| CPUStress | 2 m cpu + 2 m mem | 60 m cpu + 60 m mem | 12 h cpu + 12 h mem |
|
||||
| Storage | 3 m fio (sample) | badblocks + 2 h fio | badblocks + 6 h fio |
|
||||
| Network | 60 s iperf | 30 m iperf | 2 h iperf |
|
||||
| Burn | 2 m all-at-once | 2 h all-at-once | 18 h all-at-once |
|
||||
| GPU | seconds | seconds | seconds |
|
||||
| PSU | 1 m load burst | 10 m load burst | 15 m load burst |
|
||||
| Reporting | instant (server) | instant (server) | instant (server) |
|
||||
|
||||
---
|
||||
|
||||
## Adding a new stage
|
||||
|
||||
1. Add the name to `store.DefaultStageOrder`.
|
||||
|
||||
Reference in New Issue
Block a user