Files
Vetting/docs/api-reference.md
josh 8367ec2a9f
CI / Lint + build + test (push) Successful in 1m36s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 49s
docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API
reference with full request/response schemas, database schema), expand
the README with a feature list and how-it-works walkthrough, fix
missing Firmware and Burn stages in architecture.md and test-suite.md,
add threshold engine and host-mode agent sections, and add godoc
comments to 11 packages and 6 model types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-23 18:37:26 -04:00

14 KiB

API reference

Complete HTTP API for the vetting orchestrator. Routes are assembled in internal/httpserver/router.go; handler logic lives in internal/api/agent_handlers.go (agent-facing) and internal/api/ui_handlers.go (browser + host-mode).


Agent API

These endpoints are called by the in-image vetting agent during a run. Every request must carry a Authorization: Bearer <token> header. The token is issued per-run in the iPXE kernel cmdline and verified against a bcrypt hash stored in runs.agent_token_hash.

GET /ipxe/{mac}

iPXE chainload script. Called by iPXE itself after dnsmasq hands it the chainload URL. No auth required — the MAC path parameter is the key.

Responses:

Scenario Script
Known MAC with an active run Boot script: kernel + initrd + cmdline (run_id, mac, token, orchestrator_url, tls_fpr). Triggers PXEObserved transition.
Known MAC, no active run Poweroff script.
Unknown MAC Halt/error script.

POST /api/v1/runs/{id}/hello

First call the agent makes once userspace is up. Idempotent. Writes a log line; the authoritative transition comes from /claim.

Request body:

{}

Response (200):

{ "ok": true, "run_id": 42 }

POST /api/v1/runs/{id}/claim

Binding call: the agent proves it holds the plaintext token for this run. In return the orchestrator seeds stage rows, transitions to InventoryCheck, and returns the stage list + per-profile config. Subsequent claims are idempotent (safe after transient network failures).

Request body:

{
  "agent_ip": "192.168.1.42"   // optional; falls back to RemoteAddr
}

Response (200):

{
  "ok": true,
  "run_id": 42,
  "stages": ["Inventory", "Firmware", "SpecValidate", "SMART", "CPUStress",
             "Storage", "Network", "Burn", "GPU", "PSU", "Reporting"],
  "expected_disks": [
    { "serial": "WD-ABC123", "size_gb": 500 }
  ],
  "iperf_port": 5201,
  "non_destructive": false,
  "current_state": "InventoryCheck",
  "stage_config": {
    "profile": "quick",
    "stage_timeouts": { "CPUStress": "5m0s", "Storage": "5m0s" },
    "cpustress": { "cpu_pass": "2m", "mem_pass": "2m", "edac_poll": "10s" },
    "storage": { "mode": "fio_sample", "fio_size": "1GiB", "fio_time": "3m",
                 "fio_bs": "4k", "fio_rw": "randrw", "verify": "md5" },
    "network": { "duration": "60s" },
    "burn": { "duration": "2m", "cpu_workers": "all", "mem_pct": 50,
              "fio_on_spare": true, "iperf_parallel": 2 }
  }
}

stage_config shape:

Field Type Description
profile string quick, deep, or soak.
stage_timeouts map[string]string Per-stage timeout durations (Go duration strings).
cpustress.cpu_pass string stress-ng CPU pass duration.
cpustress.mem_pass string stress-ng memory pass duration.
cpustress.edac_poll string EDAC error counter polling interval.
storage.mode string fio_sample (skip badblocks) or full_disk.
storage.fio_size string fio test file size (fio_sample mode only).
storage.fio_time string fio runtime.
storage.fio_bs string fio block size.
storage.fio_rw string fio I/O pattern.
storage.verify string fio integrity mode (md5 or empty).
network.duration string iperf3 test duration.
burn.duration string Total burn-in window.
burn.cpu_workers string all or a numeric string.
burn.mem_pct int Percentage of MemAvailable to stress.
burn.fio_on_spare bool Run fio inside Burn.
burn.iperf_parallel int iperf3 parallel stream count.

POST /api/v1/runs/{id}/heartbeat

Periodic liveness ping. The response body acts as a control channel.

Request body:

{}

Response (200):

{
  "state": "CPUStress",
  "cmd": "continue"
}

cmd values:

cmd When Agent action
continue Normal case (including FailedHolding) No-op; keep running current stage or wait for override.
reboot Run reached Completed systemctl reboot (falls through iPXE to local disk).
abort Run in Released Stop heartbeat loop.
retry_stage Operator pressed "Override wipe & retry" Re-enter the named stage with override flags. Response includes stage and override_flags.
cancel_stage Operator clicked Cancel mid-stage Kill running stage subprocess, then power off.

POST /api/v1/runs/{id}/log

Batch of log lines from the agent. Written to per-run flat file and fanned out to SSE subscribers.

Request body:

{
  "lines": [
    {
      "ts": "2026-04-21T15:32:18.123Z",
      "level": "info",
      "stage": "SMART",
      "text": "smartctl -a /dev/sda: PASSED"
    }
  ]
}
Field Type Required Description
ts string no RFC 3339 timestamp. Server clock used if empty.
level string no info, warn, error, debug.
stage string no Stage tag for per-stage SSE fan-out.
text string yes Log message.

Response (200):

{ "ok": true, "written": 1 }

POST /api/v1/runs/{id}/sensor

Batch of numeric samples (thermals, fan RPM, PSU rails, iperf throughput, fio IOPS). Each sample is evaluated against the run's seeded thresholds — critical breaches fail the run immediately.

Request body:

{
  "samples": [
    {
      "ts": "2026-04-21T15:32:18Z",
      "kind": "temp",
      "key": "cpu/0",
      "value": 72.5,
      "unit": "C"
    }
  ]
}
Field Type Required Description
ts string no RFC 3339 timestamp. Defaults to server-now.
kind string yes temp, fan, psu_volt, iperf, fio, fio_p99_us, smart_attr, nic_retrans, edac_ue, edac_ce, mce.
key string yes Identifies the source (e.g. cpu/0, +12V, throughput_mbps).
value float yes Numeric sample value.
unit string no Display unit (e.g. C, V, Mbps).

Response (200):

{
  "ok": true,
  "written": 1,
  "breach": false,
  "breach_kind": ""
}

When a critical breach is detected, breach is true and breach_kind contains a human-readable label like "temp cpu/0=92.5 breached lt 92". The run transitions to FailedHolding.


POST /api/v1/runs/{id}/result

Stage outcome. Drives the state machine forward (pass) or into FailedHolding (fail).

Request body:

{
  "stage": "SMART",
  "passed": true,
  "summary": { "disks_checked": 2, "reallocated": 0 },
  "message": "",
  "inventory": null,
  "firmware": [],
  "sub_steps": []
}
Field Type Required Description
stage string yes Stage name (must match DefaultStageOrder).
passed bool yes true = advance; false = fail.
summary object no Arbitrary JSON persisted in stages.summary_json.
message string no Human-readable detail (shown in notifications on failure).
inventory object no Only set for stage=Inventory. Full spec.Inventory JSON.
firmware array no Only set for stage=Firmware. Array of firmware snapshots.
sub_steps array no Per-disk/per-NIC/per-GPU granular results.

firmware[] shape:

Field Type Description
component string bios, bmc, nic, hba, microcode, nvme_fw.
identifier string Slot, serial, or device path that distinguishes this component.
version string Firmware version string.
vendor string Vendor name (optional).
raw map Additional key-value metadata (optional).

sub_steps[] shape:

Field Type Description
name string Human-readable label (e.g. sda SMART, eth0 iperf).
passed bool Sub-step result.
skipped bool true if the sub-step was skipped (e.g. no GPU).
started_at string RFC 3339 timestamp.
completed_at string RFC 3339 timestamp.
summary object Arbitrary JSON persisted in sub_steps.summary_json.

Response (200, pass):

{ "ok": true, "next_state": "CPUStress" }

Response (200, fail):

{ "ok": true, "next_state": "FailedHolding" }

Response (409, stage mismatch):

Returned when the agent reports a stage that doesn't match the orchestrator's expected state. The run is parked in FailedHolding.

{ "ok": false, "error": "stage mismatch: got SMART, expected CPUStress" }

POST /api/v1/runs/{id}/hold

Request the per-run SSH key so the operator can SSH into a held host.

Request body:

{
  "agent_ip": "192.168.1.42"
}

Response (200):

{
  "authorized_key": "ssh-ed25519 AAAAC3... vetting-run-42",
  "run_id": 42
}

The private key is written to artifacts/run-<N>/hold.key on the orchestrator. The agent installs the authorized_key into /root/.ssh/authorized_keys in the live image.


Host API

LAN-trusted endpoints called by the host-mode agent. No bearer token. Same threat model as the browser UI.

POST /api/v1/hosts

JSON host registration. Called by the quick-register one-liner.

Request body:

{
  "name": "node-01",
  "mac": "aa:bb:cc:dd:ee:ff",
  "wol_broadcast_ip": "192.168.1.255",
  "wol_port": 9,
  "expected_spec_yaml": "memory:\n  total_gib: 64\ncpu:\n  logical_cores: 16\n",
  "notes": ""
}

Response (201):

{ "ok": true, "id": 5 }

POST /api/v1/hosts/{mac}/heartbeat

Host-mode agent liveness ping. Stamps hosts.last_seen_at and triggers a dashboard tile refresh via SSE.

Request body: empty.

Response (200):

{ "ok": true }

When a run is queued for this host:

{ "ok": true, "cmd": "reboot_for_vetting", "run_id": 42 }

The agent reboots the host on receiving cmd=reboot_for_vetting. The run_id is informational (for agent logging).


Browser UI routes

No auth. Bind to loopback or LAN only, or front with a reverse proxy.

Method Path Description
GET / Dashboard — host tile grid.
GET /hosts/new New host registration form.
POST /hosts Create host (form submission).
GET /hosts/{id} Host detail page (summary, actions, run history).
POST /hosts/{id}/delete Delete host.
POST /hosts/{id}/start Start a vetting run (queue it).
POST /hosts/{id}/cancel Cancel the active run.
POST /hosts/{id}/override-wipe Override the wipe-probe guard and retry Storage.
GET /runs/{runID} Run detail page (stages, spec diffs, pipeline).
GET /reports/{runID} HTML report artifact.
GET /register/quick.sh Quick-register bash one-liner script.
GET /events SSE event stream (browser subscriptions).

Static assets:

Path Description
/static/* Embedded CSS + JS (internal/web/static/).
/live/* Live image files (vmlinuz + initrd.img) served from pxe.live_dir.
/assets/* Agent binary served from agent.asset_dir.

SSE events

The browser connects to GET /events and receives server-sent events. Each event has a name (the SSE event: field) and a data payload containing a pre-rendered HTML fragment with hx-swap-oob attributes that HTMX uses to swap the target DOM element.

Connection events

Event name Payload Description
hello ok Sent immediately on connection.
heartbeat <span data-heartbeat="<unix-ts>"></span> 15-second keep-alive.

Dashboard events

Event name Payload Description
tile-{hostID} Host tile HTML fragment Refreshed on state transitions, heartbeats, holds.

Host detail page events

Event name Payload Description
detail-summary-{hostID} Summary section HTML Host metadata + latest run status.
detail-actions-{hostID} Actions row HTML Start/Cancel/Override buttons.
detail-inflight-{hostID} In-flight banner HTML Active run progress indicator.
runrow-{runID} Run history row HTML Updated when a run completes or fails.

Run detail page events

Event name Payload Description
run-header-{runID} Run metadata HTML State, profile, timing.
detail-hold-{runID} Hold banner HTML SSH command + hold IP.
detail-specdiffs-{runID} Spec diffs list HTML Expected-vs-actual divergences.
pipeline-{runID} Pipeline dot visualization HTML Stage progress dots.
substep-{runID}-{stage}-{ordinal} Sub-step row HTML Per-disk, per-NIC, per-GPU detail.

Log events

Event name Payload Description
log-{runID} Log line HTML All log lines for a run.
log-{runID}-{stage} Log line HTML Stage-filtered log lines.

Authentication

Agent bearer token lifecycle

  1. Issuance — when a registered host's iPXE script is fetched (GET /ipxe/{mac}), the orchestrator generates a random token, hashes it with SHA-256, and stores the hash in runs.agent_token_hash. The plaintext token is embedded in the iPXE kernel cmdline as token=<plaintext>.

  2. Rotation — each iPXE fetch rotates the token. Only the most recent PXE boot can claim the run.

  3. Verification — every /api/v1/runs/{id}/* endpoint extracts the Bearer header, SHA-256 hashes it, and compares against the stored hash using crypto/subtle.ConstantTimeCompare.

  4. Scope — the token authenticates a single run. It cannot be used to access other runs or host-level endpoints.

LAN-trust model

Host-mode endpoints (POST /api/v1/hosts, POST /api/v1/hosts/{mac}/heartbeat) and the browser UI have no authentication. They share a LAN-trust assumption: anything that can reach the orchestrator's bind address is trusted. To add a password, front the orchestrator with a reverse proxy (Caddy, nginx, Traefik) that adds basic-auth or OIDC. See operations.md § Exposing outside the LAN.