Add 4 new doc files (configuration reference, development guide, API reference with full request/response schemas, database schema), expand the README with a feature list and how-it-works walkthrough, fix missing Firmware and Burn stages in architecture.md and test-suite.md, add threshold engine and host-mode agent sections, and add godoc comments to 11 packages and 6 model types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 KiB
API reference
Complete HTTP API for the vetting orchestrator. Routes are assembled
in internal/httpserver/router.go; handler logic lives in
internal/api/agent_handlers.go (agent-facing) and
internal/api/ui_handlers.go (browser + host-mode).
Agent API
These endpoints are called by the in-image vetting agent during a
run. Every request must carry a Authorization: Bearer <token>
header. The token is issued per-run in the iPXE kernel cmdline and
verified against a bcrypt hash stored in runs.agent_token_hash.
GET /ipxe/{mac}
iPXE chainload script. Called by iPXE itself after dnsmasq hands it the chainload URL. No auth required — the MAC path parameter is the key.
Responses:
| Scenario | Script |
|---|---|
| Known MAC with an active run | Boot script: kernel + initrd + cmdline (run_id, mac, token, orchestrator_url, tls_fpr). Triggers PXEObserved transition. |
| Known MAC, no active run | Poweroff script. |
| Unknown MAC | Halt/error script. |
POST /api/v1/runs/{id}/hello
First call the agent makes once userspace is up. Idempotent. Writes a
log line; the authoritative transition comes from /claim.
Request body:
{}
Response (200):
{ "ok": true, "run_id": 42 }
POST /api/v1/runs/{id}/claim
Binding call: the agent proves it holds the plaintext token for this
run. In return the orchestrator seeds stage rows, transitions to
InventoryCheck, and returns the stage list + per-profile config.
Subsequent claims are idempotent (safe after transient network
failures).
Request body:
{
"agent_ip": "192.168.1.42" // optional; falls back to RemoteAddr
}
Response (200):
{
"ok": true,
"run_id": 42,
"stages": ["Inventory", "Firmware", "SpecValidate", "SMART", "CPUStress",
"Storage", "Network", "Burn", "GPU", "PSU", "Reporting"],
"expected_disks": [
{ "serial": "WD-ABC123", "size_gb": 500 }
],
"iperf_port": 5201,
"non_destructive": false,
"current_state": "InventoryCheck",
"stage_config": {
"profile": "quick",
"stage_timeouts": { "CPUStress": "5m0s", "Storage": "5m0s" },
"cpustress": { "cpu_pass": "2m", "mem_pass": "2m", "edac_poll": "10s" },
"storage": { "mode": "fio_sample", "fio_size": "1GiB", "fio_time": "3m",
"fio_bs": "4k", "fio_rw": "randrw", "verify": "md5" },
"network": { "duration": "60s" },
"burn": { "duration": "2m", "cpu_workers": "all", "mem_pct": 50,
"fio_on_spare": true, "iperf_parallel": 2 }
}
}
stage_config shape:
| Field | Type | Description |
|---|---|---|
profile |
string | quick, deep, or soak. |
stage_timeouts |
map[string]string | Per-stage timeout durations (Go duration strings). |
cpustress.cpu_pass |
string | stress-ng CPU pass duration. |
cpustress.mem_pass |
string | stress-ng memory pass duration. |
cpustress.edac_poll |
string | EDAC error counter polling interval. |
storage.mode |
string | fio_sample (skip badblocks) or full_disk. |
storage.fio_size |
string | fio test file size (fio_sample mode only). |
storage.fio_time |
string | fio runtime. |
storage.fio_bs |
string | fio block size. |
storage.fio_rw |
string | fio I/O pattern. |
storage.verify |
string | fio integrity mode (md5 or empty). |
network.duration |
string | iperf3 test duration. |
burn.duration |
string | Total burn-in window. |
burn.cpu_workers |
string | all or a numeric string. |
burn.mem_pct |
int | Percentage of MemAvailable to stress. |
burn.fio_on_spare |
bool | Run fio inside Burn. |
burn.iperf_parallel |
int | iperf3 parallel stream count. |
POST /api/v1/runs/{id}/heartbeat
Periodic liveness ping. The response body acts as a control channel.
Request body:
{}
Response (200):
{
"state": "CPUStress",
"cmd": "continue"
}
cmd values:
| cmd | When | Agent action |
|---|---|---|
continue |
Normal case (including FailedHolding) | No-op; keep running current stage or wait for override. |
reboot |
Run reached Completed |
systemctl reboot (falls through iPXE to local disk). |
abort |
Run in Released |
Stop heartbeat loop. |
retry_stage |
Operator pressed "Override wipe & retry" | Re-enter the named stage with override flags. Response includes stage and override_flags. |
cancel_stage |
Operator clicked Cancel mid-stage | Kill running stage subprocess, then power off. |
POST /api/v1/runs/{id}/log
Batch of log lines from the agent. Written to per-run flat file and fanned out to SSE subscribers.
Request body:
{
"lines": [
{
"ts": "2026-04-21T15:32:18.123Z",
"level": "info",
"stage": "SMART",
"text": "smartctl -a /dev/sda: PASSED"
}
]
}
| Field | Type | Required | Description |
|---|---|---|---|
ts |
string | no | RFC 3339 timestamp. Server clock used if empty. |
level |
string | no | info, warn, error, debug. |
stage |
string | no | Stage tag for per-stage SSE fan-out. |
text |
string | yes | Log message. |
Response (200):
{ "ok": true, "written": 1 }
POST /api/v1/runs/{id}/sensor
Batch of numeric samples (thermals, fan RPM, PSU rails, iperf throughput, fio IOPS). Each sample is evaluated against the run's seeded thresholds — critical breaches fail the run immediately.
Request body:
{
"samples": [
{
"ts": "2026-04-21T15:32:18Z",
"kind": "temp",
"key": "cpu/0",
"value": 72.5,
"unit": "C"
}
]
}
| Field | Type | Required | Description |
|---|---|---|---|
ts |
string | no | RFC 3339 timestamp. Defaults to server-now. |
kind |
string | yes | temp, fan, psu_volt, iperf, fio, fio_p99_us, smart_attr, nic_retrans, edac_ue, edac_ce, mce. |
key |
string | yes | Identifies the source (e.g. cpu/0, +12V, throughput_mbps). |
value |
float | yes | Numeric sample value. |
unit |
string | no | Display unit (e.g. C, V, Mbps). |
Response (200):
{
"ok": true,
"written": 1,
"breach": false,
"breach_kind": ""
}
When a critical breach is detected, breach is true and
breach_kind contains a human-readable label like
"temp cpu/0=92.5 breached lt 92". The run transitions to
FailedHolding.
POST /api/v1/runs/{id}/result
Stage outcome. Drives the state machine forward (pass) or into
FailedHolding (fail).
Request body:
{
"stage": "SMART",
"passed": true,
"summary": { "disks_checked": 2, "reallocated": 0 },
"message": "",
"inventory": null,
"firmware": [],
"sub_steps": []
}
| Field | Type | Required | Description |
|---|---|---|---|
stage |
string | yes | Stage name (must match DefaultStageOrder). |
passed |
bool | yes | true = advance; false = fail. |
summary |
object | no | Arbitrary JSON persisted in stages.summary_json. |
message |
string | no | Human-readable detail (shown in notifications on failure). |
inventory |
object | no | Only set for stage=Inventory. Full spec.Inventory JSON. |
firmware |
array | no | Only set for stage=Firmware. Array of firmware snapshots. |
sub_steps |
array | no | Per-disk/per-NIC/per-GPU granular results. |
firmware[] shape:
| Field | Type | Description |
|---|---|---|
component |
string | bios, bmc, nic, hba, microcode, nvme_fw. |
identifier |
string | Slot, serial, or device path that distinguishes this component. |
version |
string | Firmware version string. |
vendor |
string | Vendor name (optional). |
raw |
map | Additional key-value metadata (optional). |
sub_steps[] shape:
| Field | Type | Description |
|---|---|---|
name |
string | Human-readable label (e.g. sda SMART, eth0 iperf). |
passed |
bool | Sub-step result. |
skipped |
bool | true if the sub-step was skipped (e.g. no GPU). |
started_at |
string | RFC 3339 timestamp. |
completed_at |
string | RFC 3339 timestamp. |
summary |
object | Arbitrary JSON persisted in sub_steps.summary_json. |
Response (200, pass):
{ "ok": true, "next_state": "CPUStress" }
Response (200, fail):
{ "ok": true, "next_state": "FailedHolding" }
Response (409, stage mismatch):
Returned when the agent reports a stage that doesn't match the
orchestrator's expected state. The run is parked in FailedHolding.
{ "ok": false, "error": "stage mismatch: got SMART, expected CPUStress" }
POST /api/v1/runs/{id}/hold
Request the per-run SSH key so the operator can SSH into a held host.
Request body:
{
"agent_ip": "192.168.1.42"
}
Response (200):
{
"authorized_key": "ssh-ed25519 AAAAC3... vetting-run-42",
"run_id": 42
}
The private key is written to
artifacts/run-<N>/hold.key on the orchestrator. The agent installs
the authorized_key into /root/.ssh/authorized_keys in the live
image.
Host API
LAN-trusted endpoints called by the host-mode agent. No bearer token. Same threat model as the browser UI.
POST /api/v1/hosts
JSON host registration. Called by the quick-register one-liner.
Request body:
{
"name": "node-01",
"mac": "aa:bb:cc:dd:ee:ff",
"wol_broadcast_ip": "192.168.1.255",
"wol_port": 9,
"expected_spec_yaml": "memory:\n total_gib: 64\ncpu:\n logical_cores: 16\n",
"notes": ""
}
Response (201):
{ "ok": true, "id": 5 }
POST /api/v1/hosts/{mac}/heartbeat
Host-mode agent liveness ping. Stamps hosts.last_seen_at and
triggers a dashboard tile refresh via SSE.
Request body: empty.
Response (200):
{ "ok": true }
When a run is queued for this host:
{ "ok": true, "cmd": "reboot_for_vetting", "run_id": 42 }
The agent reboots the host on receiving cmd=reboot_for_vetting.
The run_id is informational (for agent logging).
Browser UI routes
No auth. Bind to loopback or LAN only, or front with a reverse proxy.
| Method | Path | Description |
|---|---|---|
| GET | / |
Dashboard — host tile grid. |
| GET | /hosts/new |
New host registration form. |
| POST | /hosts |
Create host (form submission). |
| GET | /hosts/{id} |
Host detail page (summary, actions, run history). |
| POST | /hosts/{id}/delete |
Delete host. |
| POST | /hosts/{id}/start |
Start a vetting run (queue it). |
| POST | /hosts/{id}/cancel |
Cancel the active run. |
| POST | /hosts/{id}/override-wipe |
Override the wipe-probe guard and retry Storage. |
| GET | /runs/{runID} |
Run detail page (stages, spec diffs, pipeline). |
| GET | /reports/{runID} |
HTML report artifact. |
| GET | /register/quick.sh |
Quick-register bash one-liner script. |
| GET | /events |
SSE event stream (browser subscriptions). |
Static assets:
| Path | Description |
|---|---|
/static/* |
Embedded CSS + JS (internal/web/static/). |
/live/* |
Live image files (vmlinuz + initrd.img) served from pxe.live_dir. |
/assets/* |
Agent binary served from agent.asset_dir. |
SSE events
The browser connects to GET /events and receives server-sent events.
Each event has a name (the SSE event: field) and a data payload
containing a pre-rendered HTML fragment with hx-swap-oob attributes
that HTMX uses to swap the target DOM element.
Connection events
| Event name | Payload | Description |
|---|---|---|
hello |
ok |
Sent immediately on connection. |
heartbeat |
<span data-heartbeat="<unix-ts>"></span> |
15-second keep-alive. |
Dashboard events
| Event name | Payload | Description |
|---|---|---|
tile-{hostID} |
Host tile HTML fragment | Refreshed on state transitions, heartbeats, holds. |
Host detail page events
| Event name | Payload | Description |
|---|---|---|
detail-summary-{hostID} |
Summary section HTML | Host metadata + latest run status. |
detail-actions-{hostID} |
Actions row HTML | Start/Cancel/Override buttons. |
detail-inflight-{hostID} |
In-flight banner HTML | Active run progress indicator. |
runrow-{runID} |
Run history row HTML | Updated when a run completes or fails. |
Run detail page events
| Event name | Payload | Description |
|---|---|---|
run-header-{runID} |
Run metadata HTML | State, profile, timing. |
detail-hold-{runID} |
Hold banner HTML | SSH command + hold IP. |
detail-specdiffs-{runID} |
Spec diffs list HTML | Expected-vs-actual divergences. |
pipeline-{runID} |
Pipeline dot visualization HTML | Stage progress dots. |
substep-{runID}-{stage}-{ordinal} |
Sub-step row HTML | Per-disk, per-NIC, per-GPU detail. |
Log events
| Event name | Payload | Description |
|---|---|---|
log-{runID} |
Log line HTML | All log lines for a run. |
log-{runID}-{stage} |
Log line HTML | Stage-filtered log lines. |
Authentication
Agent bearer token lifecycle
-
Issuance — when a registered host's iPXE script is fetched (
GET /ipxe/{mac}), the orchestrator generates a random token, hashes it with SHA-256, and stores the hash inruns.agent_token_hash. The plaintext token is embedded in the iPXE kernel cmdline astoken=<plaintext>. -
Rotation — each iPXE fetch rotates the token. Only the most recent PXE boot can claim the run.
-
Verification — every
/api/v1/runs/{id}/*endpoint extracts theBearerheader, SHA-256 hashes it, and compares against the stored hash usingcrypto/subtle.ConstantTimeCompare. -
Scope — the token authenticates a single run. It cannot be used to access other runs or host-level endpoints.
LAN-trust model
Host-mode endpoints (POST /api/v1/hosts, POST /api/v1/hosts/{mac}/heartbeat)
and the browser UI have no authentication. They share a LAN-trust
assumption: anything that can reach the orchestrator's bind address is
trusted. To add a password, front the orchestrator with a reverse
proxy (Caddy, nginx, Traefik) that adds basic-auth or OIDC. See
operations.md § Exposing outside the LAN.