Files
Vetting/docs/api-reference.md
josh 8367ec2a9f
CI / Lint + build + test (push) Successful in 1m36s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 49s
docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API
reference with full request/response schemas, database schema), expand
the README with a feature list and how-it-works walkthrough, fix
missing Firmware and Burn stages in architecture.md and test-suite.md,
add threshold engine and host-mode agent sections, and add godoc
comments to 11 packages and 6 model types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-23 18:37:26 -04:00

491 lines
14 KiB
Markdown

# API reference
Complete HTTP API for the vetting orchestrator. Routes are assembled
in `internal/httpserver/router.go`; handler logic lives in
`internal/api/agent_handlers.go` (agent-facing) and
`internal/api/ui_handlers.go` (browser + host-mode).
---
## Agent API
These endpoints are called by the in-image vetting agent during a
run. Every request must carry a `Authorization: Bearer <token>`
header. The token is issued per-run in the iPXE kernel cmdline and
verified against a bcrypt hash stored in `runs.agent_token_hash`.
### `GET /ipxe/{mac}`
iPXE chainload script. Called by iPXE itself after dnsmasq hands it
the chainload URL. No auth required — the MAC path parameter is the
key.
**Responses:**
| Scenario | Script |
|----------|--------|
| Known MAC with an active run | Boot script: kernel + initrd + cmdline (run_id, mac, token, orchestrator_url, tls_fpr). Triggers `PXEObserved` transition. |
| Known MAC, no active run | Poweroff script. |
| Unknown MAC | Halt/error script. |
---
### `POST /api/v1/runs/{id}/hello`
First call the agent makes once userspace is up. Idempotent. Writes a
log line; the authoritative transition comes from `/claim`.
**Request body:**
```json
{}
```
**Response (200):**
```json
{ "ok": true, "run_id": 42 }
```
---
### `POST /api/v1/runs/{id}/claim`
Binding call: the agent proves it holds the plaintext token for this
run. In return the orchestrator seeds stage rows, transitions to
`InventoryCheck`, and returns the stage list + per-profile config.
Subsequent claims are idempotent (safe after transient network
failures).
**Request body:**
```json
{
"agent_ip": "192.168.1.42" // optional; falls back to RemoteAddr
}
```
**Response (200):**
```json
{
"ok": true,
"run_id": 42,
"stages": ["Inventory", "Firmware", "SpecValidate", "SMART", "CPUStress",
"Storage", "Network", "Burn", "GPU", "PSU", "Reporting"],
"expected_disks": [
{ "serial": "WD-ABC123", "size_gb": 500 }
],
"iperf_port": 5201,
"non_destructive": false,
"current_state": "InventoryCheck",
"stage_config": {
"profile": "quick",
"stage_timeouts": { "CPUStress": "5m0s", "Storage": "5m0s" },
"cpustress": { "cpu_pass": "2m", "mem_pass": "2m", "edac_poll": "10s" },
"storage": { "mode": "fio_sample", "fio_size": "1GiB", "fio_time": "3m",
"fio_bs": "4k", "fio_rw": "randrw", "verify": "md5" },
"network": { "duration": "60s" },
"burn": { "duration": "2m", "cpu_workers": "all", "mem_pct": 50,
"fio_on_spare": true, "iperf_parallel": 2 }
}
}
```
**`stage_config` shape:**
| Field | Type | Description |
|-------|------|-------------|
| `profile` | string | `quick`, `deep`, or `soak`. |
| `stage_timeouts` | map[string]string | Per-stage timeout durations (Go duration strings). |
| `cpustress.cpu_pass` | string | stress-ng CPU pass duration. |
| `cpustress.mem_pass` | string | stress-ng memory pass duration. |
| `cpustress.edac_poll` | string | EDAC error counter polling interval. |
| `storage.mode` | string | `fio_sample` (skip badblocks) or `full_disk`. |
| `storage.fio_size` | string | fio test file size (fio_sample mode only). |
| `storage.fio_time` | string | fio runtime. |
| `storage.fio_bs` | string | fio block size. |
| `storage.fio_rw` | string | fio I/O pattern. |
| `storage.verify` | string | fio integrity mode (`md5` or empty). |
| `network.duration` | string | iperf3 test duration. |
| `burn.duration` | string | Total burn-in window. |
| `burn.cpu_workers` | string | `all` or a numeric string. |
| `burn.mem_pct` | int | Percentage of MemAvailable to stress. |
| `burn.fio_on_spare` | bool | Run fio inside Burn. |
| `burn.iperf_parallel` | int | iperf3 parallel stream count. |
---
### `POST /api/v1/runs/{id}/heartbeat`
Periodic liveness ping. The response body acts as a control channel.
**Request body:**
```json
{}
```
**Response (200):**
```json
{
"state": "CPUStress",
"cmd": "continue"
}
```
**`cmd` values:**
| cmd | When | Agent action |
|-----|------|--------------|
| `continue` | Normal case (including FailedHolding) | No-op; keep running current stage or wait for override. |
| `reboot` | Run reached `Completed` | `systemctl reboot` (falls through iPXE to local disk). |
| `abort` | Run in `Released` | Stop heartbeat loop. |
| `retry_stage` | Operator pressed "Override wipe & retry" | Re-enter the named stage with override flags. Response includes `stage` and `override_flags`. |
| `cancel_stage` | Operator clicked Cancel mid-stage | Kill running stage subprocess, then power off. |
---
### `POST /api/v1/runs/{id}/log`
Batch of log lines from the agent. Written to per-run flat file and
fanned out to SSE subscribers.
**Request body:**
```json
{
"lines": [
{
"ts": "2026-04-21T15:32:18.123Z",
"level": "info",
"stage": "SMART",
"text": "smartctl -a /dev/sda: PASSED"
}
]
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `ts` | string | no | RFC 3339 timestamp. Server clock used if empty. |
| `level` | string | no | `info`, `warn`, `error`, `debug`. |
| `stage` | string | no | Stage tag for per-stage SSE fan-out. |
| `text` | string | yes | Log message. |
**Response (200):**
```json
{ "ok": true, "written": 1 }
```
---
### `POST /api/v1/runs/{id}/sensor`
Batch of numeric samples (thermals, fan RPM, PSU rails, iperf
throughput, fio IOPS). Each sample is evaluated against the run's
seeded thresholds — critical breaches fail the run immediately.
**Request body:**
```json
{
"samples": [
{
"ts": "2026-04-21T15:32:18Z",
"kind": "temp",
"key": "cpu/0",
"value": 72.5,
"unit": "C"
}
]
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `ts` | string | no | RFC 3339 timestamp. Defaults to server-now. |
| `kind` | string | yes | `temp`, `fan`, `psu_volt`, `iperf`, `fio`, `fio_p99_us`, `smart_attr`, `nic_retrans`, `edac_ue`, `edac_ce`, `mce`. |
| `key` | string | yes | Identifies the source (e.g. `cpu/0`, `+12V`, `throughput_mbps`). |
| `value` | float | yes | Numeric sample value. |
| `unit` | string | no | Display unit (e.g. `C`, `V`, `Mbps`). |
**Response (200):**
```json
{
"ok": true,
"written": 1,
"breach": false,
"breach_kind": ""
}
```
When a critical breach is detected, `breach` is `true` and
`breach_kind` contains a human-readable label like
`"temp cpu/0=92.5 breached lt 92"`. The run transitions to
`FailedHolding`.
---
### `POST /api/v1/runs/{id}/result`
Stage outcome. Drives the state machine forward (pass) or into
`FailedHolding` (fail).
**Request body:**
```json
{
"stage": "SMART",
"passed": true,
"summary": { "disks_checked": 2, "reallocated": 0 },
"message": "",
"inventory": null,
"firmware": [],
"sub_steps": []
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `stage` | string | yes | Stage name (must match `DefaultStageOrder`). |
| `passed` | bool | yes | `true` = advance; `false` = fail. |
| `summary` | object | no | Arbitrary JSON persisted in `stages.summary_json`. |
| `message` | string | no | Human-readable detail (shown in notifications on failure). |
| `inventory` | object | no | Only set for `stage=Inventory`. Full `spec.Inventory` JSON. |
| `firmware` | array | no | Only set for `stage=Firmware`. Array of firmware snapshots. |
| `sub_steps` | array | no | Per-disk/per-NIC/per-GPU granular results. |
**`firmware[]` shape:**
| Field | Type | Description |
|-------|------|-------------|
| `component` | string | `bios`, `bmc`, `nic`, `hba`, `microcode`, `nvme_fw`. |
| `identifier` | string | Slot, serial, or device path that distinguishes this component. |
| `version` | string | Firmware version string. |
| `vendor` | string | Vendor name (optional). |
| `raw` | map | Additional key-value metadata (optional). |
**`sub_steps[]` shape:**
| Field | Type | Description |
|-------|------|-------------|
| `name` | string | Human-readable label (e.g. `sda SMART`, `eth0 iperf`). |
| `passed` | bool | Sub-step result. |
| `skipped` | bool | `true` if the sub-step was skipped (e.g. no GPU). |
| `started_at` | string | RFC 3339 timestamp. |
| `completed_at` | string | RFC 3339 timestamp. |
| `summary` | object | Arbitrary JSON persisted in `sub_steps.summary_json`. |
**Response (200, pass):**
```json
{ "ok": true, "next_state": "CPUStress" }
```
**Response (200, fail):**
```json
{ "ok": true, "next_state": "FailedHolding" }
```
**Response (409, stage mismatch):**
Returned when the agent reports a stage that doesn't match the
orchestrator's expected state. The run is parked in `FailedHolding`.
```json
{ "ok": false, "error": "stage mismatch: got SMART, expected CPUStress" }
```
---
### `POST /api/v1/runs/{id}/hold`
Request the per-run SSH key so the operator can SSH into a held host.
**Request body:**
```json
{
"agent_ip": "192.168.1.42"
}
```
**Response (200):**
```json
{
"authorized_key": "ssh-ed25519 AAAAC3... vetting-run-42",
"run_id": 42
}
```
The private key is written to
`artifacts/run-<N>/hold.key` on the orchestrator. The agent installs
the `authorized_key` into `/root/.ssh/authorized_keys` in the live
image.
---
## Host API
LAN-trusted endpoints called by the host-mode agent. No bearer token.
Same threat model as the browser UI.
### `POST /api/v1/hosts`
JSON host registration. Called by the quick-register one-liner.
**Request body:**
```json
{
"name": "node-01",
"mac": "aa:bb:cc:dd:ee:ff",
"wol_broadcast_ip": "192.168.1.255",
"wol_port": 9,
"expected_spec_yaml": "memory:\n total_gib: 64\ncpu:\n logical_cores: 16\n",
"notes": ""
}
```
**Response (201):**
```json
{ "ok": true, "id": 5 }
```
### `POST /api/v1/hosts/{mac}/heartbeat`
Host-mode agent liveness ping. Stamps `hosts.last_seen_at` and
triggers a dashboard tile refresh via SSE.
**Request body:** empty.
**Response (200):**
```json
{ "ok": true }
```
When a run is queued for this host:
```json
{ "ok": true, "cmd": "reboot_for_vetting", "run_id": 42 }
```
The agent reboots the host on receiving `cmd=reboot_for_vetting`.
The `run_id` is informational (for agent logging).
---
## Browser UI routes
No auth. Bind to loopback or LAN only, or front with a reverse proxy.
| Method | Path | Description |
|--------|------|-------------|
| GET | `/` | Dashboard — host tile grid. |
| GET | `/hosts/new` | New host registration form. |
| POST | `/hosts` | Create host (form submission). |
| GET | `/hosts/{id}` | Host detail page (summary, actions, run history). |
| POST | `/hosts/{id}/delete` | Delete host. |
| POST | `/hosts/{id}/start` | Start a vetting run (queue it). |
| POST | `/hosts/{id}/cancel` | Cancel the active run. |
| POST | `/hosts/{id}/override-wipe` | Override the wipe-probe guard and retry Storage. |
| GET | `/runs/{runID}` | Run detail page (stages, spec diffs, pipeline). |
| GET | `/reports/{runID}` | HTML report artifact. |
| GET | `/register/quick.sh` | Quick-register bash one-liner script. |
| GET | `/events` | SSE event stream (browser subscriptions). |
**Static assets:**
| Path | Description |
|------|-------------|
| `/static/*` | Embedded CSS + JS (`internal/web/static/`). |
| `/live/*` | Live image files (vmlinuz + initrd.img) served from `pxe.live_dir`. |
| `/assets/*` | Agent binary served from `agent.asset_dir`. |
---
## SSE events
The browser connects to `GET /events` and receives server-sent events.
Each event has a `name` (the SSE `event:` field) and a `data` payload
containing a pre-rendered HTML fragment with `hx-swap-oob` attributes
that HTMX uses to swap the target DOM element.
### Connection events
| Event name | Payload | Description |
|------------|---------|-------------|
| `hello` | `ok` | Sent immediately on connection. |
| `heartbeat` | `<span data-heartbeat="<unix-ts>"></span>` | 15-second keep-alive. |
### Dashboard events
| Event name | Payload | Description |
|------------|---------|-------------|
| `tile-{hostID}` | Host tile HTML fragment | Refreshed on state transitions, heartbeats, holds. |
### Host detail page events
| Event name | Payload | Description |
|------------|---------|-------------|
| `detail-summary-{hostID}` | Summary section HTML | Host metadata + latest run status. |
| `detail-actions-{hostID}` | Actions row HTML | Start/Cancel/Override buttons. |
| `detail-inflight-{hostID}` | In-flight banner HTML | Active run progress indicator. |
| `runrow-{runID}` | Run history row HTML | Updated when a run completes or fails. |
### Run detail page events
| Event name | Payload | Description |
|------------|---------|-------------|
| `run-header-{runID}` | Run metadata HTML | State, profile, timing. |
| `detail-hold-{runID}` | Hold banner HTML | SSH command + hold IP. |
| `detail-specdiffs-{runID}` | Spec diffs list HTML | Expected-vs-actual divergences. |
| `pipeline-{runID}` | Pipeline dot visualization HTML | Stage progress dots. |
| `substep-{runID}-{stage}-{ordinal}` | Sub-step row HTML | Per-disk, per-NIC, per-GPU detail. |
### Log events
| Event name | Payload | Description |
|------------|---------|-------------|
| `log-{runID}` | Log line HTML | All log lines for a run. |
| `log-{runID}-{stage}` | Log line HTML | Stage-filtered log lines. |
---
## Authentication
### Agent bearer token lifecycle
1. **Issuance** — when a registered host's iPXE script is fetched
(`GET /ipxe/{mac}`), the orchestrator generates a random token,
hashes it with SHA-256, and stores the hash in
`runs.agent_token_hash`. The plaintext token is embedded in the
iPXE kernel cmdline as `token=<plaintext>`.
2. **Rotation** — each iPXE fetch rotates the token. Only the most
recent PXE boot can claim the run.
3. **Verification** — every `/api/v1/runs/{id}/*` endpoint extracts
the `Bearer` header, SHA-256 hashes it, and compares against the
stored hash using `crypto/subtle.ConstantTimeCompare`.
4. **Scope** — the token authenticates a single run. It cannot be
used to access other runs or host-level endpoints.
### LAN-trust model
Host-mode endpoints (`POST /api/v1/hosts`, `POST /api/v1/hosts/{mac}/heartbeat`)
and the browser UI have no authentication. They share a LAN-trust
assumption: anything that can reach the orchestrator's bind address is
trusted. To add a password, front the orchestrator with a reverse
proxy (Caddy, nginx, Traefik) that adds basic-auth or OIDC. See
[operations.md § Exposing outside the LAN](operations.md#exposing-outside-the-lan).