docs: comprehensive documentation expansion
CI / Lint + build + test (push) Successful in 1m36s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 49s

Add 4 new doc files (configuration reference, development guide, API
reference with full request/response schemas, database schema), expand
the README with a feature list and how-it-works walkthrough, fix
missing Firmware and Burn stages in architecture.md and test-suite.md,
add threshold engine and host-mode agent sections, and add godoc
comments to 11 packages and 6 model types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-23 18:37:26 -04:00
parent 17ec55cb85
commit 8367ec2a9f
18 changed files with 1548 additions and 10 deletions
+61 -6
View File
@@ -37,10 +37,10 @@ Operator browser (HTMX + SSE, admin login)
|---|---|
| `cmd/vetting` | Orchestrator entrypoint. Wires config, stores, runner, dispatcher, iperf supervisor, PXE supervisor, janitor, HTTP router. |
| `cmd/vetting-agent` | In-image agent entrypoint. Reads kernel cmdline params, starts the agent loop. |
| `internal/config` | YAML loader + types. |
| `internal/config` | YAML loader + types. `ProfileRegistry` holds the quick/deep/soak profile definitions, threshold defaults, and per-stage probe knobs. |
| `internal/db` | SQLite open + embedded migrations. Pure Go via modernc.org/sqlite. |
| `internal/model` | Plain structs: `Host`, `Run`, `Stage`, `Measurement`, `SpecDiff`, `Artifact`. |
| `internal/store` | Repository layer; SQL is hand-written. |
| `internal/store` | Repository layer; SQL is hand-written (no ORM). Stores for hosts, runs, stages, sub-steps, artifacts, spec diffs, measurements, thresholds, firmware. |
| `internal/orchestrator` | State machine, dispatcher, per-run runner, WoL sender, HMAC run tokens, iperf supervisor. |
| `internal/api` | HTTP handlers: `agent_handlers.go` (the agent-facing API) and `ui_handlers.go` (HTMX fragments + SSE). |
| `internal/httpserver` | chi router assembly — lives here to avoid `api ↔ orchestrator` cyclic imports. |
@@ -66,11 +66,13 @@ Per-run state is the single source of truth; the UI is a pure
projection of DB + event stream.
```
Registered → Queued → WaitingWoL → Booting → InventoryCheck
SpecValidate → SMART → CPUStress → Storage → Network
GPU → PSU → Reporting → Completed
Registered → Queued → WaitingWoL / WaitingReboot → Booting
InventoryCheck → Firmware → SpecValidate → SMART
CPUStress → Storage → Network → Burn → GPU → PSU
→ Reporting → Completed
any stage → Failed → FailedHolding → Released
any active state → Cancelled
```
Key points:
@@ -97,7 +99,10 @@ POST /api/v1/runs/{id}/result → stage result; response says next_state
POST /api/v1/runs/{id}/hold → on FailedHolding, receive authorized_key
```
Auth on every `/api/v1/*` call: the bearer token is stored as a bcrypt
See [api-reference.md](api-reference.md) for full request/response
schemas and SSE event types.
Auth on every `/api/v1/runs/*` call: the bearer token is stored as a bcrypt
hash in `runs.agent_token_hash` and compared in constant time. The
plaintext is in the kernel cmdline — unforgeable by anyone not on the
trusted bridge, because the iPXE script is issued per-MAC and the MAC
@@ -165,6 +170,56 @@ The janitor goroutine (`internal/janitor`) runs a sweep every
**never** deleted by the janitor — host histories and aggregate
metrics survive cleanups.
## Threshold engine
Every `/sensor` batch is evaluated against rules seeded per-run at
creation time from the `ProfileRegistry` + per-host overrides. Rules
are immutable for the life of a run — a late config edit can't
retroactively pass or fail an in-flight run.
Operators: `lt`, `lte`, `gt`, `gte`, `within_pct`. Key matching is
glob-ish: `*` matches all keys, `cpu/*` matches any key starting with
`cpu/`, exact strings for specific keys. Stage matching works the same
way (`*` for global, exact name for stage-specific).
Severity drives the action:
- **critical** — fail the run immediately. The current stage is marked
failed, the run enters `FailedHolding`, and a `StageFailed`
notification fires.
- **warning** — record the breach for the report. The stage continues.
Every evaluation (pass or fail) is persisted as a
`threshold_evaluations` row so the report can render per-sample
verdict badges. See [configuration.md § thresholds](configuration.md#vettingthresholds)
for the config-level reference.
## Host-mode agent
The `vetting-agent host` binary runs as a systemd service on
installed hosts. It heartbeats to `POST /api/v1/hosts/{mac}/heartbeat`
every 30 s so the dashboard shows online/offline status.
The quick-register one-liner (`GET /register/quick.sh`) downloads the
agent binary from `/assets/vetting-agent-linux-amd64`, installs it as
a systemd service, and auto-POSTs to `POST /api/v1/hosts` to register
the host — no manual MAC entry needed.
When the operator clicks **Start Vetting**, the orchestrator's
dispatcher sets `cmd=reboot_for_vetting` on the next heartbeat
response. The host-mode agent reboots the host, which PXE-boots into
the live image and enters the normal vetting flow.
## Host API
These endpoints are LAN-trusted (no bearer token) and share the same
threat model as the browser UI:
```
POST /api/v1/hosts → JSON host registration (quick-register)
POST /api/v1/hosts/{mac}/heartbeat → host-mode liveness + command channel
```
## Reproducible builds
The orchestrator and agent are pure Go; `make orchestrator-linux`