docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API reference with full request/response schemas, database schema), expand the README with a feature list and how-it-works walkthrough, fix missing Firmware and Burn stages in architecture.md and test-suite.md, add threshold engine and host-mode agent sections, and add godoc comments to 11 packages and 6 model types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
+61
-6
@@ -37,10 +37,10 @@ Operator browser (HTMX + SSE, admin login)
|
||||
|---|---|
|
||||
| `cmd/vetting` | Orchestrator entrypoint. Wires config, stores, runner, dispatcher, iperf supervisor, PXE supervisor, janitor, HTTP router. |
|
||||
| `cmd/vetting-agent` | In-image agent entrypoint. Reads kernel cmdline params, starts the agent loop. |
|
||||
| `internal/config` | YAML loader + types. |
|
||||
| `internal/config` | YAML loader + types. `ProfileRegistry` holds the quick/deep/soak profile definitions, threshold defaults, and per-stage probe knobs. |
|
||||
| `internal/db` | SQLite open + embedded migrations. Pure Go via modernc.org/sqlite. |
|
||||
| `internal/model` | Plain structs: `Host`, `Run`, `Stage`, `Measurement`, `SpecDiff`, `Artifact`. |
|
||||
| `internal/store` | Repository layer; SQL is hand-written. |
|
||||
| `internal/store` | Repository layer; SQL is hand-written (no ORM). Stores for hosts, runs, stages, sub-steps, artifacts, spec diffs, measurements, thresholds, firmware. |
|
||||
| `internal/orchestrator` | State machine, dispatcher, per-run runner, WoL sender, HMAC run tokens, iperf supervisor. |
|
||||
| `internal/api` | HTTP handlers: `agent_handlers.go` (the agent-facing API) and `ui_handlers.go` (HTMX fragments + SSE). |
|
||||
| `internal/httpserver` | chi router assembly — lives here to avoid `api ↔ orchestrator` cyclic imports. |
|
||||
@@ -66,11 +66,13 @@ Per-run state is the single source of truth; the UI is a pure
|
||||
projection of DB + event stream.
|
||||
|
||||
```
|
||||
Registered → Queued → WaitingWoL → Booting → InventoryCheck
|
||||
→ SpecValidate → SMART → CPUStress → Storage → Network
|
||||
→ GPU → PSU → Reporting → Completed
|
||||
Registered → Queued → WaitingWoL / WaitingReboot → Booting
|
||||
→ InventoryCheck → Firmware → SpecValidate → SMART
|
||||
→ CPUStress → Storage → Network → Burn → GPU → PSU
|
||||
→ Reporting → Completed
|
||||
|
||||
any stage → Failed → FailedHolding → Released
|
||||
any active state → Cancelled
|
||||
```
|
||||
|
||||
Key points:
|
||||
@@ -97,7 +99,10 @@ POST /api/v1/runs/{id}/result → stage result; response says next_state
|
||||
POST /api/v1/runs/{id}/hold → on FailedHolding, receive authorized_key
|
||||
```
|
||||
|
||||
Auth on every `/api/v1/*` call: the bearer token is stored as a bcrypt
|
||||
See [api-reference.md](api-reference.md) for full request/response
|
||||
schemas and SSE event types.
|
||||
|
||||
Auth on every `/api/v1/runs/*` call: the bearer token is stored as a bcrypt
|
||||
hash in `runs.agent_token_hash` and compared in constant time. The
|
||||
plaintext is in the kernel cmdline — unforgeable by anyone not on the
|
||||
trusted bridge, because the iPXE script is issued per-MAC and the MAC
|
||||
@@ -165,6 +170,56 @@ The janitor goroutine (`internal/janitor`) runs a sweep every
|
||||
**never** deleted by the janitor — host histories and aggregate
|
||||
metrics survive cleanups.
|
||||
|
||||
## Threshold engine
|
||||
|
||||
Every `/sensor` batch is evaluated against rules seeded per-run at
|
||||
creation time from the `ProfileRegistry` + per-host overrides. Rules
|
||||
are immutable for the life of a run — a late config edit can't
|
||||
retroactively pass or fail an in-flight run.
|
||||
|
||||
Operators: `lt`, `lte`, `gt`, `gte`, `within_pct`. Key matching is
|
||||
glob-ish: `*` matches all keys, `cpu/*` matches any key starting with
|
||||
`cpu/`, exact strings for specific keys. Stage matching works the same
|
||||
way (`*` for global, exact name for stage-specific).
|
||||
|
||||
Severity drives the action:
|
||||
|
||||
- **critical** — fail the run immediately. The current stage is marked
|
||||
failed, the run enters `FailedHolding`, and a `StageFailed`
|
||||
notification fires.
|
||||
- **warning** — record the breach for the report. The stage continues.
|
||||
|
||||
Every evaluation (pass or fail) is persisted as a
|
||||
`threshold_evaluations` row so the report can render per-sample
|
||||
verdict badges. See [configuration.md § thresholds](configuration.md#vettingthresholds)
|
||||
for the config-level reference.
|
||||
|
||||
## Host-mode agent
|
||||
|
||||
The `vetting-agent host` binary runs as a systemd service on
|
||||
installed hosts. It heartbeats to `POST /api/v1/hosts/{mac}/heartbeat`
|
||||
every 30 s so the dashboard shows online/offline status.
|
||||
|
||||
The quick-register one-liner (`GET /register/quick.sh`) downloads the
|
||||
agent binary from `/assets/vetting-agent-linux-amd64`, installs it as
|
||||
a systemd service, and auto-POSTs to `POST /api/v1/hosts` to register
|
||||
the host — no manual MAC entry needed.
|
||||
|
||||
When the operator clicks **Start Vetting**, the orchestrator's
|
||||
dispatcher sets `cmd=reboot_for_vetting` on the next heartbeat
|
||||
response. The host-mode agent reboots the host, which PXE-boots into
|
||||
the live image and enters the normal vetting flow.
|
||||
|
||||
## Host API
|
||||
|
||||
These endpoints are LAN-trusted (no bearer token) and share the same
|
||||
threat model as the browser UI:
|
||||
|
||||
```
|
||||
POST /api/v1/hosts → JSON host registration (quick-register)
|
||||
POST /api/v1/hosts/{mac}/heartbeat → host-mode liveness + command channel
|
||||
```
|
||||
|
||||
## Reproducible builds
|
||||
|
||||
The orchestrator and agent are pure Go; `make orchestrator-linux`
|
||||
|
||||
Reference in New Issue
Block a user