# API reference Complete HTTP API for the vetting orchestrator. Routes are assembled in `internal/httpserver/router.go`; handler logic lives in `internal/api/agent_handlers.go` (agent-facing) and `internal/api/ui_handlers.go` (browser + host-mode). --- ## Agent API These endpoints are called by the in-image vetting agent during a run. Every request must carry a `Authorization: Bearer ` header. The token is issued per-run in the iPXE kernel cmdline and verified against a bcrypt hash stored in `runs.agent_token_hash`. ### `GET /ipxe/{mac}` iPXE chainload script. Called by iPXE itself after dnsmasq hands it the chainload URL. No auth required — the MAC path parameter is the key. **Responses:** | Scenario | Script | |----------|--------| | Known MAC with an active run | Boot script: kernel + initrd + cmdline (run_id, mac, token, orchestrator_url, tls_fpr). Triggers `PXEObserved` transition. | | Known MAC, no active run | Poweroff script. | | Unknown MAC | Halt/error script. | --- ### `POST /api/v1/runs/{id}/hello` First call the agent makes once userspace is up. Idempotent. Writes a log line; the authoritative transition comes from `/claim`. **Request body:** ```json {} ``` **Response (200):** ```json { "ok": true, "run_id": 42 } ``` --- ### `POST /api/v1/runs/{id}/claim` Binding call: the agent proves it holds the plaintext token for this run. In return the orchestrator seeds stage rows, transitions to `InventoryCheck`, and returns the stage list + per-profile config. Subsequent claims are idempotent (safe after transient network failures). **Request body:** ```json { "agent_ip": "192.168.1.42" // optional; falls back to RemoteAddr } ``` **Response (200):** ```json { "ok": true, "run_id": 42, "stages": ["Inventory", "Firmware", "SpecValidate", "SMART", "CPUStress", "Storage", "Network", "Burn", "GPU", "PSU", "Reporting"], "expected_disks": [ { "serial": "WD-ABC123", "size_gb": 500 } ], "iperf_port": 5201, "non_destructive": false, "current_state": "InventoryCheck", "stage_config": { "profile": "quick", "stage_timeouts": { "CPUStress": "5m0s", "Storage": "5m0s" }, "cpustress": { "cpu_pass": "2m", "mem_pass": "2m", "edac_poll": "10s" }, "storage": { "mode": "fio_sample", "fio_size": "1GiB", "fio_time": "3m", "fio_bs": "4k", "fio_rw": "randrw", "verify": "md5" }, "network": { "duration": "60s" }, "burn": { "duration": "2m", "cpu_workers": "all", "mem_pct": 50, "fio_on_spare": true, "iperf_parallel": 2 } } } ``` **`stage_config` shape:** | Field | Type | Description | |-------|------|-------------| | `profile` | string | `quick`, `deep`, or `soak`. | | `stage_timeouts` | map[string]string | Per-stage timeout durations (Go duration strings). | | `cpustress.cpu_pass` | string | stress-ng CPU pass duration. | | `cpustress.mem_pass` | string | stress-ng memory pass duration. | | `cpustress.edac_poll` | string | EDAC error counter polling interval. | | `storage.mode` | string | `fio_sample` (skip badblocks) or `full_disk`. | | `storage.fio_size` | string | fio test file size (fio_sample mode only). | | `storage.fio_time` | string | fio runtime. | | `storage.fio_bs` | string | fio block size. | | `storage.fio_rw` | string | fio I/O pattern. | | `storage.verify` | string | fio integrity mode (`md5` or empty). | | `network.duration` | string | iperf3 test duration. | | `burn.duration` | string | Total burn-in window. | | `burn.cpu_workers` | string | `all` or a numeric string. | | `burn.mem_pct` | int | Percentage of MemAvailable to stress. | | `burn.fio_on_spare` | bool | Run fio inside Burn. | | `burn.iperf_parallel` | int | iperf3 parallel stream count. | --- ### `POST /api/v1/runs/{id}/heartbeat` Periodic liveness ping. The response body acts as a control channel. **Request body:** ```json {} ``` **Response (200):** ```json { "state": "CPUStress", "cmd": "continue" } ``` **`cmd` values:** | cmd | When | Agent action | |-----|------|--------------| | `continue` | Normal case (including FailedHolding) | No-op; keep running current stage or wait for override. | | `reboot` | Run reached `Completed` | `systemctl reboot` (falls through iPXE to local disk). | | `abort` | Run in `Released` | Stop heartbeat loop. | | `retry_stage` | Operator pressed "Override wipe & retry" | Re-enter the named stage with override flags. Response includes `stage` and `override_flags`. | | `cancel_stage` | Operator clicked Cancel mid-stage | Kill running stage subprocess, then power off. | --- ### `POST /api/v1/runs/{id}/log` Batch of log lines from the agent. Written to per-run flat file and fanned out to SSE subscribers. **Request body:** ```json { "lines": [ { "ts": "2026-04-21T15:32:18.123Z", "level": "info", "stage": "SMART", "text": "smartctl -a /dev/sda: PASSED" } ] } ``` | Field | Type | Required | Description | |-------|------|----------|-------------| | `ts` | string | no | RFC 3339 timestamp. Server clock used if empty. | | `level` | string | no | `info`, `warn`, `error`, `debug`. | | `stage` | string | no | Stage tag for per-stage SSE fan-out. | | `text` | string | yes | Log message. | **Response (200):** ```json { "ok": true, "written": 1 } ``` --- ### `POST /api/v1/runs/{id}/sensor` Batch of numeric samples (thermals, fan RPM, PSU rails, iperf throughput, fio IOPS). Each sample is evaluated against the run's seeded thresholds — critical breaches fail the run immediately. **Request body:** ```json { "samples": [ { "ts": "2026-04-21T15:32:18Z", "kind": "temp", "key": "cpu/0", "value": 72.5, "unit": "C" } ] } ``` | Field | Type | Required | Description | |-------|------|----------|-------------| | `ts` | string | no | RFC 3339 timestamp. Defaults to server-now. | | `kind` | string | yes | `temp`, `fan`, `psu_volt`, `iperf`, `fio`, `fio_p99_us`, `smart_attr`, `nic_retrans`, `edac_ue`, `edac_ce`, `mce`. | | `key` | string | yes | Identifies the source (e.g. `cpu/0`, `+12V`, `throughput_mbps`). | | `value` | float | yes | Numeric sample value. | | `unit` | string | no | Display unit (e.g. `C`, `V`, `Mbps`). | **Response (200):** ```json { "ok": true, "written": 1, "breach": false, "breach_kind": "" } ``` When a critical breach is detected, `breach` is `true` and `breach_kind` contains a human-readable label like `"temp cpu/0=92.5 breached lt 92"`. The run transitions to `FailedHolding`. --- ### `POST /api/v1/runs/{id}/result` Stage outcome. Drives the state machine forward (pass) or into `FailedHolding` (fail). **Request body:** ```json { "stage": "SMART", "passed": true, "summary": { "disks_checked": 2, "reallocated": 0 }, "message": "", "inventory": null, "firmware": [], "sub_steps": [] } ``` | Field | Type | Required | Description | |-------|------|----------|-------------| | `stage` | string | yes | Stage name (must match `DefaultStageOrder`). | | `passed` | bool | yes | `true` = advance; `false` = fail. | | `summary` | object | no | Arbitrary JSON persisted in `stages.summary_json`. | | `message` | string | no | Human-readable detail (shown in notifications on failure). | | `inventory` | object | no | Only set for `stage=Inventory`. Full `spec.Inventory` JSON. | | `firmware` | array | no | Only set for `stage=Firmware`. Array of firmware snapshots. | | `sub_steps` | array | no | Per-disk/per-NIC/per-GPU granular results. | **`firmware[]` shape:** | Field | Type | Description | |-------|------|-------------| | `component` | string | `bios`, `bmc`, `nic`, `hba`, `microcode`, `nvme_fw`. | | `identifier` | string | Slot, serial, or device path that distinguishes this component. | | `version` | string | Firmware version string. | | `vendor` | string | Vendor name (optional). | | `raw` | map | Additional key-value metadata (optional). | **`sub_steps[]` shape:** | Field | Type | Description | |-------|------|-------------| | `name` | string | Human-readable label (e.g. `sda SMART`, `eth0 iperf`). | | `passed` | bool | Sub-step result. | | `skipped` | bool | `true` if the sub-step was skipped (e.g. no GPU). | | `started_at` | string | RFC 3339 timestamp. | | `completed_at` | string | RFC 3339 timestamp. | | `summary` | object | Arbitrary JSON persisted in `sub_steps.summary_json`. | **Response (200, pass):** ```json { "ok": true, "next_state": "CPUStress" } ``` **Response (200, fail):** ```json { "ok": true, "next_state": "FailedHolding" } ``` **Response (409, stage mismatch):** Returned when the agent reports a stage that doesn't match the orchestrator's expected state. The run is parked in `FailedHolding`. ```json { "ok": false, "error": "stage mismatch: got SMART, expected CPUStress" } ``` --- ### `POST /api/v1/runs/{id}/hold` Request the per-run SSH key so the operator can SSH into a held host. **Request body:** ```json { "agent_ip": "192.168.1.42" } ``` **Response (200):** ```json { "authorized_key": "ssh-ed25519 AAAAC3... vetting-run-42", "run_id": 42 } ``` The private key is written to `artifacts/run-/hold.key` on the orchestrator. The agent installs the `authorized_key` into `/root/.ssh/authorized_keys` in the live image. --- ## Host API LAN-trusted endpoints called by the host-mode agent. No bearer token. Same threat model as the browser UI. ### `POST /api/v1/hosts` JSON host registration. Called by the quick-register one-liner. **Request body:** ```json { "name": "node-01", "mac": "aa:bb:cc:dd:ee:ff", "wol_broadcast_ip": "192.168.1.255", "wol_port": 9, "expected_spec_yaml": "memory:\n total_gib: 64\ncpu:\n logical_cores: 16\n", "notes": "" } ``` **Response (201):** ```json { "ok": true, "id": 5 } ``` ### `POST /api/v1/hosts/{mac}/heartbeat` Host-mode agent liveness ping. Stamps `hosts.last_seen_at` and triggers a dashboard tile refresh via SSE. **Request body:** empty. **Response (200):** ```json { "ok": true } ``` When a run is queued for this host: ```json { "ok": true, "cmd": "reboot_for_vetting", "run_id": 42 } ``` The agent reboots the host on receiving `cmd=reboot_for_vetting`. The `run_id` is informational (for agent logging). --- ## Browser UI routes No auth. Bind to loopback or LAN only, or front with a reverse proxy. | Method | Path | Description | |--------|------|-------------| | GET | `/` | Dashboard — host tile grid. | | GET | `/hosts/new` | New host registration form. | | POST | `/hosts` | Create host (form submission). | | GET | `/hosts/{id}` | Host detail page (summary, actions, run history). | | POST | `/hosts/{id}/delete` | Delete host. | | POST | `/hosts/{id}/start` | Start a vetting run (queue it). | | POST | `/hosts/{id}/cancel` | Cancel the active run. | | POST | `/hosts/{id}/override-wipe` | Override the wipe-probe guard and retry Storage. | | GET | `/runs/{runID}` | Run detail page (stages, spec diffs, pipeline). | | GET | `/reports/{runID}` | HTML report artifact. | | GET | `/register/quick.sh` | Quick-register bash one-liner script. | | GET | `/events` | SSE event stream (browser subscriptions). | **Static assets:** | Path | Description | |------|-------------| | `/static/*` | Embedded CSS + JS (`internal/web/static/`). | | `/live/*` | Live image files (vmlinuz + initrd.img) served from `pxe.live_dir`. | | `/assets/*` | Agent binary served from `agent.asset_dir`. | --- ## SSE events The browser connects to `GET /events` and receives server-sent events. Each event has a `name` (the SSE `event:` field) and a `data` payload containing a pre-rendered HTML fragment with `hx-swap-oob` attributes that HTMX uses to swap the target DOM element. ### Connection events | Event name | Payload | Description | |------------|---------|-------------| | `hello` | `ok` | Sent immediately on connection. | | `heartbeat` | `` | 15-second keep-alive. | ### Dashboard events | Event name | Payload | Description | |------------|---------|-------------| | `tile-{hostID}` | Host tile HTML fragment | Refreshed on state transitions, heartbeats, holds. | ### Host detail page events | Event name | Payload | Description | |------------|---------|-------------| | `detail-summary-{hostID}` | Summary section HTML | Host metadata + latest run status. | | `detail-actions-{hostID}` | Actions row HTML | Start/Cancel/Override buttons. | | `detail-inflight-{hostID}` | In-flight banner HTML | Active run progress indicator. | | `runrow-{runID}` | Run history row HTML | Updated when a run completes or fails. | ### Run detail page events | Event name | Payload | Description | |------------|---------|-------------| | `run-header-{runID}` | Run metadata HTML | State, profile, timing. | | `detail-hold-{runID}` | Hold banner HTML | SSH command + hold IP. | | `detail-specdiffs-{runID}` | Spec diffs list HTML | Expected-vs-actual divergences. | | `pipeline-{runID}` | Pipeline dot visualization HTML | Stage progress dots. | | `substep-{runID}-{stage}-{ordinal}` | Sub-step row HTML | Per-disk, per-NIC, per-GPU detail. | ### Log events | Event name | Payload | Description | |------------|---------|-------------| | `log-{runID}` | Log line HTML | All log lines for a run. | | `log-{runID}-{stage}` | Log line HTML | Stage-filtered log lines. | --- ## Authentication ### Agent bearer token lifecycle 1. **Issuance** — when a registered host's iPXE script is fetched (`GET /ipxe/{mac}`), the orchestrator generates a random token, hashes it with SHA-256, and stores the hash in `runs.agent_token_hash`. The plaintext token is embedded in the iPXE kernel cmdline as `token=`. 2. **Rotation** — each iPXE fetch rotates the token. Only the most recent PXE boot can claim the run. 3. **Verification** — every `/api/v1/runs/{id}/*` endpoint extracts the `Bearer` header, SHA-256 hashes it, and compares against the stored hash using `crypto/subtle.ConstantTimeCompare`. 4. **Scope** — the token authenticates a single run. It cannot be used to access other runs or host-level endpoints. ### LAN-trust model Host-mode endpoints (`POST /api/v1/hosts`, `POST /api/v1/hosts/{mac}/heartbeat`) and the browser UI have no authentication. They share a LAN-trust assumption: anything that can reach the orchestrator's bind address is trusted. To add a password, front the orchestrator with a reverse proxy (Caddy, nginx, Traefik) that adds basic-auth or OIDC. See [operations.md § Exposing outside the LAN](operations.md#exposing-outside-the-lan).