docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API reference with full request/response schemas, database schema), expand the README with a feature list and how-it-works walkthrough, fix missing Firmware and Burn stages in architecture.md and test-suite.md, add threshold engine and host-mode agent sections, and add godoc comments to 11 packages and 6 model types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,490 @@
|
||||
# API reference
|
||||
|
||||
Complete HTTP API for the vetting orchestrator. Routes are assembled
|
||||
in `internal/httpserver/router.go`; handler logic lives in
|
||||
`internal/api/agent_handlers.go` (agent-facing) and
|
||||
`internal/api/ui_handlers.go` (browser + host-mode).
|
||||
|
||||
---
|
||||
|
||||
## Agent API
|
||||
|
||||
These endpoints are called by the in-image vetting agent during a
|
||||
run. Every request must carry a `Authorization: Bearer <token>`
|
||||
header. The token is issued per-run in the iPXE kernel cmdline and
|
||||
verified against a bcrypt hash stored in `runs.agent_token_hash`.
|
||||
|
||||
### `GET /ipxe/{mac}`
|
||||
|
||||
iPXE chainload script. Called by iPXE itself after dnsmasq hands it
|
||||
the chainload URL. No auth required — the MAC path parameter is the
|
||||
key.
|
||||
|
||||
**Responses:**
|
||||
|
||||
| Scenario | Script |
|
||||
|----------|--------|
|
||||
| Known MAC with an active run | Boot script: kernel + initrd + cmdline (run_id, mac, token, orchestrator_url, tls_fpr). Triggers `PXEObserved` transition. |
|
||||
| Known MAC, no active run | Poweroff script. |
|
||||
| Unknown MAC | Halt/error script. |
|
||||
|
||||
---
|
||||
|
||||
### `POST /api/v1/runs/{id}/hello`
|
||||
|
||||
First call the agent makes once userspace is up. Idempotent. Writes a
|
||||
log line; the authoritative transition comes from `/claim`.
|
||||
|
||||
**Request body:**
|
||||
|
||||
```json
|
||||
{}
|
||||
```
|
||||
|
||||
**Response (200):**
|
||||
|
||||
```json
|
||||
{ "ok": true, "run_id": 42 }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `POST /api/v1/runs/{id}/claim`
|
||||
|
||||
Binding call: the agent proves it holds the plaintext token for this
|
||||
run. In return the orchestrator seeds stage rows, transitions to
|
||||
`InventoryCheck`, and returns the stage list + per-profile config.
|
||||
Subsequent claims are idempotent (safe after transient network
|
||||
failures).
|
||||
|
||||
**Request body:**
|
||||
|
||||
```json
|
||||
{
|
||||
"agent_ip": "192.168.1.42" // optional; falls back to RemoteAddr
|
||||
}
|
||||
```
|
||||
|
||||
**Response (200):**
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"run_id": 42,
|
||||
"stages": ["Inventory", "Firmware", "SpecValidate", "SMART", "CPUStress",
|
||||
"Storage", "Network", "Burn", "GPU", "PSU", "Reporting"],
|
||||
"expected_disks": [
|
||||
{ "serial": "WD-ABC123", "size_gb": 500 }
|
||||
],
|
||||
"iperf_port": 5201,
|
||||
"non_destructive": false,
|
||||
"current_state": "InventoryCheck",
|
||||
"stage_config": {
|
||||
"profile": "quick",
|
||||
"stage_timeouts": { "CPUStress": "5m0s", "Storage": "5m0s" },
|
||||
"cpustress": { "cpu_pass": "2m", "mem_pass": "2m", "edac_poll": "10s" },
|
||||
"storage": { "mode": "fio_sample", "fio_size": "1GiB", "fio_time": "3m",
|
||||
"fio_bs": "4k", "fio_rw": "randrw", "verify": "md5" },
|
||||
"network": { "duration": "60s" },
|
||||
"burn": { "duration": "2m", "cpu_workers": "all", "mem_pct": 50,
|
||||
"fio_on_spare": true, "iperf_parallel": 2 }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**`stage_config` shape:**
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `profile` | string | `quick`, `deep`, or `soak`. |
|
||||
| `stage_timeouts` | map[string]string | Per-stage timeout durations (Go duration strings). |
|
||||
| `cpustress.cpu_pass` | string | stress-ng CPU pass duration. |
|
||||
| `cpustress.mem_pass` | string | stress-ng memory pass duration. |
|
||||
| `cpustress.edac_poll` | string | EDAC error counter polling interval. |
|
||||
| `storage.mode` | string | `fio_sample` (skip badblocks) or `full_disk`. |
|
||||
| `storage.fio_size` | string | fio test file size (fio_sample mode only). |
|
||||
| `storage.fio_time` | string | fio runtime. |
|
||||
| `storage.fio_bs` | string | fio block size. |
|
||||
| `storage.fio_rw` | string | fio I/O pattern. |
|
||||
| `storage.verify` | string | fio integrity mode (`md5` or empty). |
|
||||
| `network.duration` | string | iperf3 test duration. |
|
||||
| `burn.duration` | string | Total burn-in window. |
|
||||
| `burn.cpu_workers` | string | `all` or a numeric string. |
|
||||
| `burn.mem_pct` | int | Percentage of MemAvailable to stress. |
|
||||
| `burn.fio_on_spare` | bool | Run fio inside Burn. |
|
||||
| `burn.iperf_parallel` | int | iperf3 parallel stream count. |
|
||||
|
||||
---
|
||||
|
||||
### `POST /api/v1/runs/{id}/heartbeat`
|
||||
|
||||
Periodic liveness ping. The response body acts as a control channel.
|
||||
|
||||
**Request body:**
|
||||
|
||||
```json
|
||||
{}
|
||||
```
|
||||
|
||||
**Response (200):**
|
||||
|
||||
```json
|
||||
{
|
||||
"state": "CPUStress",
|
||||
"cmd": "continue"
|
||||
}
|
||||
```
|
||||
|
||||
**`cmd` values:**
|
||||
|
||||
| cmd | When | Agent action |
|
||||
|-----|------|--------------|
|
||||
| `continue` | Normal case (including FailedHolding) | No-op; keep running current stage or wait for override. |
|
||||
| `reboot` | Run reached `Completed` | `systemctl reboot` (falls through iPXE to local disk). |
|
||||
| `abort` | Run in `Released` | Stop heartbeat loop. |
|
||||
| `retry_stage` | Operator pressed "Override wipe & retry" | Re-enter the named stage with override flags. Response includes `stage` and `override_flags`. |
|
||||
| `cancel_stage` | Operator clicked Cancel mid-stage | Kill running stage subprocess, then power off. |
|
||||
|
||||
---
|
||||
|
||||
### `POST /api/v1/runs/{id}/log`
|
||||
|
||||
Batch of log lines from the agent. Written to per-run flat file and
|
||||
fanned out to SSE subscribers.
|
||||
|
||||
**Request body:**
|
||||
|
||||
```json
|
||||
{
|
||||
"lines": [
|
||||
{
|
||||
"ts": "2026-04-21T15:32:18.123Z",
|
||||
"level": "info",
|
||||
"stage": "SMART",
|
||||
"text": "smartctl -a /dev/sda: PASSED"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `ts` | string | no | RFC 3339 timestamp. Server clock used if empty. |
|
||||
| `level` | string | no | `info`, `warn`, `error`, `debug`. |
|
||||
| `stage` | string | no | Stage tag for per-stage SSE fan-out. |
|
||||
| `text` | string | yes | Log message. |
|
||||
|
||||
**Response (200):**
|
||||
|
||||
```json
|
||||
{ "ok": true, "written": 1 }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `POST /api/v1/runs/{id}/sensor`
|
||||
|
||||
Batch of numeric samples (thermals, fan RPM, PSU rails, iperf
|
||||
throughput, fio IOPS). Each sample is evaluated against the run's
|
||||
seeded thresholds — critical breaches fail the run immediately.
|
||||
|
||||
**Request body:**
|
||||
|
||||
```json
|
||||
{
|
||||
"samples": [
|
||||
{
|
||||
"ts": "2026-04-21T15:32:18Z",
|
||||
"kind": "temp",
|
||||
"key": "cpu/0",
|
||||
"value": 72.5,
|
||||
"unit": "C"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `ts` | string | no | RFC 3339 timestamp. Defaults to server-now. |
|
||||
| `kind` | string | yes | `temp`, `fan`, `psu_volt`, `iperf`, `fio`, `fio_p99_us`, `smart_attr`, `nic_retrans`, `edac_ue`, `edac_ce`, `mce`. |
|
||||
| `key` | string | yes | Identifies the source (e.g. `cpu/0`, `+12V`, `throughput_mbps`). |
|
||||
| `value` | float | yes | Numeric sample value. |
|
||||
| `unit` | string | no | Display unit (e.g. `C`, `V`, `Mbps`). |
|
||||
|
||||
**Response (200):**
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"written": 1,
|
||||
"breach": false,
|
||||
"breach_kind": ""
|
||||
}
|
||||
```
|
||||
|
||||
When a critical breach is detected, `breach` is `true` and
|
||||
`breach_kind` contains a human-readable label like
|
||||
`"temp cpu/0=92.5 breached lt 92"`. The run transitions to
|
||||
`FailedHolding`.
|
||||
|
||||
---
|
||||
|
||||
### `POST /api/v1/runs/{id}/result`
|
||||
|
||||
Stage outcome. Drives the state machine forward (pass) or into
|
||||
`FailedHolding` (fail).
|
||||
|
||||
**Request body:**
|
||||
|
||||
```json
|
||||
{
|
||||
"stage": "SMART",
|
||||
"passed": true,
|
||||
"summary": { "disks_checked": 2, "reallocated": 0 },
|
||||
"message": "",
|
||||
"inventory": null,
|
||||
"firmware": [],
|
||||
"sub_steps": []
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `stage` | string | yes | Stage name (must match `DefaultStageOrder`). |
|
||||
| `passed` | bool | yes | `true` = advance; `false` = fail. |
|
||||
| `summary` | object | no | Arbitrary JSON persisted in `stages.summary_json`. |
|
||||
| `message` | string | no | Human-readable detail (shown in notifications on failure). |
|
||||
| `inventory` | object | no | Only set for `stage=Inventory`. Full `spec.Inventory` JSON. |
|
||||
| `firmware` | array | no | Only set for `stage=Firmware`. Array of firmware snapshots. |
|
||||
| `sub_steps` | array | no | Per-disk/per-NIC/per-GPU granular results. |
|
||||
|
||||
**`firmware[]` shape:**
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `component` | string | `bios`, `bmc`, `nic`, `hba`, `microcode`, `nvme_fw`. |
|
||||
| `identifier` | string | Slot, serial, or device path that distinguishes this component. |
|
||||
| `version` | string | Firmware version string. |
|
||||
| `vendor` | string | Vendor name (optional). |
|
||||
| `raw` | map | Additional key-value metadata (optional). |
|
||||
|
||||
**`sub_steps[]` shape:**
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `name` | string | Human-readable label (e.g. `sda SMART`, `eth0 iperf`). |
|
||||
| `passed` | bool | Sub-step result. |
|
||||
| `skipped` | bool | `true` if the sub-step was skipped (e.g. no GPU). |
|
||||
| `started_at` | string | RFC 3339 timestamp. |
|
||||
| `completed_at` | string | RFC 3339 timestamp. |
|
||||
| `summary` | object | Arbitrary JSON persisted in `sub_steps.summary_json`. |
|
||||
|
||||
**Response (200, pass):**
|
||||
|
||||
```json
|
||||
{ "ok": true, "next_state": "CPUStress" }
|
||||
```
|
||||
|
||||
**Response (200, fail):**
|
||||
|
||||
```json
|
||||
{ "ok": true, "next_state": "FailedHolding" }
|
||||
```
|
||||
|
||||
**Response (409, stage mismatch):**
|
||||
|
||||
Returned when the agent reports a stage that doesn't match the
|
||||
orchestrator's expected state. The run is parked in `FailedHolding`.
|
||||
|
||||
```json
|
||||
{ "ok": false, "error": "stage mismatch: got SMART, expected CPUStress" }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `POST /api/v1/runs/{id}/hold`
|
||||
|
||||
Request the per-run SSH key so the operator can SSH into a held host.
|
||||
|
||||
**Request body:**
|
||||
|
||||
```json
|
||||
{
|
||||
"agent_ip": "192.168.1.42"
|
||||
}
|
||||
```
|
||||
|
||||
**Response (200):**
|
||||
|
||||
```json
|
||||
{
|
||||
"authorized_key": "ssh-ed25519 AAAAC3... vetting-run-42",
|
||||
"run_id": 42
|
||||
}
|
||||
```
|
||||
|
||||
The private key is written to
|
||||
`artifacts/run-<N>/hold.key` on the orchestrator. The agent installs
|
||||
the `authorized_key` into `/root/.ssh/authorized_keys` in the live
|
||||
image.
|
||||
|
||||
---
|
||||
|
||||
## Host API
|
||||
|
||||
LAN-trusted endpoints called by the host-mode agent. No bearer token.
|
||||
Same threat model as the browser UI.
|
||||
|
||||
### `POST /api/v1/hosts`
|
||||
|
||||
JSON host registration. Called by the quick-register one-liner.
|
||||
|
||||
**Request body:**
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "node-01",
|
||||
"mac": "aa:bb:cc:dd:ee:ff",
|
||||
"wol_broadcast_ip": "192.168.1.255",
|
||||
"wol_port": 9,
|
||||
"expected_spec_yaml": "memory:\n total_gib: 64\ncpu:\n logical_cores: 16\n",
|
||||
"notes": ""
|
||||
}
|
||||
```
|
||||
|
||||
**Response (201):**
|
||||
|
||||
```json
|
||||
{ "ok": true, "id": 5 }
|
||||
```
|
||||
|
||||
### `POST /api/v1/hosts/{mac}/heartbeat`
|
||||
|
||||
Host-mode agent liveness ping. Stamps `hosts.last_seen_at` and
|
||||
triggers a dashboard tile refresh via SSE.
|
||||
|
||||
**Request body:** empty.
|
||||
|
||||
**Response (200):**
|
||||
|
||||
```json
|
||||
{ "ok": true }
|
||||
```
|
||||
|
||||
When a run is queued for this host:
|
||||
|
||||
```json
|
||||
{ "ok": true, "cmd": "reboot_for_vetting", "run_id": 42 }
|
||||
```
|
||||
|
||||
The agent reboots the host on receiving `cmd=reboot_for_vetting`.
|
||||
The `run_id` is informational (for agent logging).
|
||||
|
||||
---
|
||||
|
||||
## Browser UI routes
|
||||
|
||||
No auth. Bind to loopback or LAN only, or front with a reverse proxy.
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| GET | `/` | Dashboard — host tile grid. |
|
||||
| GET | `/hosts/new` | New host registration form. |
|
||||
| POST | `/hosts` | Create host (form submission). |
|
||||
| GET | `/hosts/{id}` | Host detail page (summary, actions, run history). |
|
||||
| POST | `/hosts/{id}/delete` | Delete host. |
|
||||
| POST | `/hosts/{id}/start` | Start a vetting run (queue it). |
|
||||
| POST | `/hosts/{id}/cancel` | Cancel the active run. |
|
||||
| POST | `/hosts/{id}/override-wipe` | Override the wipe-probe guard and retry Storage. |
|
||||
| GET | `/runs/{runID}` | Run detail page (stages, spec diffs, pipeline). |
|
||||
| GET | `/reports/{runID}` | HTML report artifact. |
|
||||
| GET | `/register/quick.sh` | Quick-register bash one-liner script. |
|
||||
| GET | `/events` | SSE event stream (browser subscriptions). |
|
||||
|
||||
**Static assets:**
|
||||
|
||||
| Path | Description |
|
||||
|------|-------------|
|
||||
| `/static/*` | Embedded CSS + JS (`internal/web/static/`). |
|
||||
| `/live/*` | Live image files (vmlinuz + initrd.img) served from `pxe.live_dir`. |
|
||||
| `/assets/*` | Agent binary served from `agent.asset_dir`. |
|
||||
|
||||
---
|
||||
|
||||
## SSE events
|
||||
|
||||
The browser connects to `GET /events` and receives server-sent events.
|
||||
Each event has a `name` (the SSE `event:` field) and a `data` payload
|
||||
containing a pre-rendered HTML fragment with `hx-swap-oob` attributes
|
||||
that HTMX uses to swap the target DOM element.
|
||||
|
||||
### Connection events
|
||||
|
||||
| Event name | Payload | Description |
|
||||
|------------|---------|-------------|
|
||||
| `hello` | `ok` | Sent immediately on connection. |
|
||||
| `heartbeat` | `<span data-heartbeat="<unix-ts>"></span>` | 15-second keep-alive. |
|
||||
|
||||
### Dashboard events
|
||||
|
||||
| Event name | Payload | Description |
|
||||
|------------|---------|-------------|
|
||||
| `tile-{hostID}` | Host tile HTML fragment | Refreshed on state transitions, heartbeats, holds. |
|
||||
|
||||
### Host detail page events
|
||||
|
||||
| Event name | Payload | Description |
|
||||
|------------|---------|-------------|
|
||||
| `detail-summary-{hostID}` | Summary section HTML | Host metadata + latest run status. |
|
||||
| `detail-actions-{hostID}` | Actions row HTML | Start/Cancel/Override buttons. |
|
||||
| `detail-inflight-{hostID}` | In-flight banner HTML | Active run progress indicator. |
|
||||
| `runrow-{runID}` | Run history row HTML | Updated when a run completes or fails. |
|
||||
|
||||
### Run detail page events
|
||||
|
||||
| Event name | Payload | Description |
|
||||
|------------|---------|-------------|
|
||||
| `run-header-{runID}` | Run metadata HTML | State, profile, timing. |
|
||||
| `detail-hold-{runID}` | Hold banner HTML | SSH command + hold IP. |
|
||||
| `detail-specdiffs-{runID}` | Spec diffs list HTML | Expected-vs-actual divergences. |
|
||||
| `pipeline-{runID}` | Pipeline dot visualization HTML | Stage progress dots. |
|
||||
| `substep-{runID}-{stage}-{ordinal}` | Sub-step row HTML | Per-disk, per-NIC, per-GPU detail. |
|
||||
|
||||
### Log events
|
||||
|
||||
| Event name | Payload | Description |
|
||||
|------------|---------|-------------|
|
||||
| `log-{runID}` | Log line HTML | All log lines for a run. |
|
||||
| `log-{runID}-{stage}` | Log line HTML | Stage-filtered log lines. |
|
||||
|
||||
---
|
||||
|
||||
## Authentication
|
||||
|
||||
### Agent bearer token lifecycle
|
||||
|
||||
1. **Issuance** — when a registered host's iPXE script is fetched
|
||||
(`GET /ipxe/{mac}`), the orchestrator generates a random token,
|
||||
hashes it with SHA-256, and stores the hash in
|
||||
`runs.agent_token_hash`. The plaintext token is embedded in the
|
||||
iPXE kernel cmdline as `token=<plaintext>`.
|
||||
|
||||
2. **Rotation** — each iPXE fetch rotates the token. Only the most
|
||||
recent PXE boot can claim the run.
|
||||
|
||||
3. **Verification** — every `/api/v1/runs/{id}/*` endpoint extracts
|
||||
the `Bearer` header, SHA-256 hashes it, and compares against the
|
||||
stored hash using `crypto/subtle.ConstantTimeCompare`.
|
||||
|
||||
4. **Scope** — the token authenticates a single run. It cannot be
|
||||
used to access other runs or host-level endpoints.
|
||||
|
||||
### LAN-trust model
|
||||
|
||||
Host-mode endpoints (`POST /api/v1/hosts`, `POST /api/v1/hosts/{mac}/heartbeat`)
|
||||
and the browser UI have no authentication. They share a LAN-trust
|
||||
assumption: anything that can reach the orchestrator's bind address is
|
||||
trusted. To add a password, front the orchestrator with a reverse
|
||||
proxy (Caddy, nginx, Traefik) that adds basic-auth or OIDC. See
|
||||
[operations.md § Exposing outside the LAN](operations.md#exposing-outside-the-lan).
|
||||
Reference in New Issue
Block a user