Initial commit: full Phases 1-6 implementation
CI / Lint + build + test (push) Has been cancelled

Post-repair hardware validation pipeline for Proxmox cluster hosts.
Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq
PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
This commit is contained in:
2026-04-17 21:32:10 -04:00
commit 9bb4b09a04
98 changed files with 11960 additions and 0 deletions
+178
View File
@@ -0,0 +1,178 @@
# Architecture
A single Go binary runs the orchestrator. A second Go binary runs
inside a custom Debian live image (built with mkosi) and becomes the
per-run test agent. The two talk over HTTP + SSE.
```
Operator browser (HTMX + SSE, admin login)
│ HTTPS
┌───────────────────────────────────────────────────────────────┐
│ Orchestrator LXC — single Go binary `vetting` │
│ │
│ UI (Templ) ─┬─ Agent API ─┬─ SSE hub │
│ │ │ │
│ Orchestrator core (state machine, dispatcher sem=3, │
│ stage executors, WoL sender, token issuer) │
│ │ │
│ ┌─────┴─────┬──────────┐ │
│ ▼ ▼ ▼ │
│ SQLite flat-file logs dnsmasq subprocess │
│ (DHCP+TFTP+HTTP, MAC allowlist)│
│ │
│ Janitor goroutine (retention-based cleanup) │
│ Notifier registry (ntfy/discord/smtp) │
└─────────────────────────────────────────┬─────────────────────┘
│ LAN
Host under test (×23)
PXE → iPXE → Linux live image
└─ vetting-agent (HTTP+SSE back)
```
## Packages
| Package | Purpose |
|---|---|
| `cmd/vetting` | Orchestrator entrypoint. Wires config, stores, runner, dispatcher, iperf supervisor, PXE supervisor, janitor, HTTP router. |
| `cmd/vetting-agent` | In-image agent entrypoint. Reads kernel cmdline params, starts the agent loop. |
| `internal/config` | YAML loader + types. |
| `internal/db` | SQLite open + embedded migrations. Pure Go via modernc.org/sqlite. |
| `internal/model` | Plain structs: `Host`, `Run`, `Stage`, `Measurement`, `SpecDiff`, `Artifact`. |
| `internal/store` | Repository layer; SQL is hand-written. |
| `internal/orchestrator` | State machine, dispatcher, per-run runner, WoL sender, HMAC run tokens, iperf supervisor. |
| `internal/api` | HTTP handlers: `agent_handlers.go` (the agent-facing API) and `ui_handlers.go` (HTMX fragments + SSE). |
| `internal/httpserver` | chi router assembly — lives here to avoid `api ↔ orchestrator` cyclic imports. |
| `internal/web` | Embedded static assets + compiled Templ templates. |
| `internal/auth` | Single-admin bcrypt + signed-cookie sessions. |
| `internal/pxe` | dnsmasq subprocess supervisor + per-MAC iPXE script generator. |
| `internal/events` | In-process SSE hub (fan-out to live browser clients). |
| `internal/logs` | Per-run flat-file writer + SSE fan-out of live log tail. |
| `internal/spec` | Expected-vs-actual diff engine with severity classification. |
| `internal/notify` | Pluggable notifier registry (ntfy, Discord webhook, SMTP). |
| `internal/report` | HTML + JSON report generation (html/template, self-contained). |
| `internal/hold` | Per-run SSH key issuance for `FailedHolding`. |
| `internal/janitor` | Retention-based cleanup of old artifact files + log files. |
| `agent/` | In-image agent: claim loop, stage dispatch, heartbeat, log forwarder, thermal sidecar. |
| `agent/probes` | lshw, dmidecode, smartctl, lspci, hwmon, nvidia-smi wrappers. |
| `agent/tests` | Per-stage test implementations (SMART, CPUStress, Storage, Network, GPU, PSU). |
| `live-image/` | mkosi config + postinst for the Debian live image. |
| `deploy/` | systemd unit + example config + install.sh. |
| `test/e2e/` | Build-tagged (`-tags=e2e`) QEMU + PXE full-stack test. |
## State machine
Per-run state is the single source of truth; the UI is a pure
projection of DB + event stream.
```
Registered → Queued → WaitingWoL → Booting → InventoryCheck
→ SpecValidate → SMART → CPUStress → Storage → Network
→ GPU → PSU → Reporting → Completed
any stage → Failed → FailedHolding → Released
```
Key points:
- **Transitions are table-driven** (`internal/orchestrator/statemachine.go`).
Each `(state, event) → (next, action)` is encoded once.
- **Orchestrator-owned stages resolve inside `/result`:** `SpecValidate`
and `Reporting` flip state forward as part of the preceding stage's
result handler, so the agent never sees them as "its turn".
- **Stage rows persist before SSE fan-out** — the UI can re-derive
state by reading SQLite, and an SSE reconnect mid-run just fetches
fresh tile fragments.
## Agent ↔ orchestrator protocol
```
GET /ipxe/{MAC} → per-MAC iPXE script
POST /api/v1/runs/{id}/hello → "I booted; here's my address"
POST /api/v1/runs/{id}/claim → validate token, receive stage list
POST /api/v1/runs/{id}/heartbeat → liveness ping; response carries cmd
POST /api/v1/runs/{id}/log → batch of log lines
POST /api/v1/runs/{id}/sensor → batch of measurements (thermals, throughput)
POST /api/v1/runs/{id}/result → stage result; response says next_state
POST /api/v1/runs/{id}/hold → on FailedHolding, receive authorized_key
```
Auth on every `/api/v1/*` call: the bearer token is stored as a bcrypt
hash in `runs.agent_token_hash` and compared in constant time. The
plaintext is in the kernel cmdline — unforgeable by anyone not on the
trusted bridge, because the iPXE script is issued per-MAC and the MAC
must already be in the dnsmasq allowlist.
### Heartbeat control channel
The heartbeat response carries a `cmd` field the agent acts on:
| cmd | When fired | Agent action |
|---|---|---|
| `continue` | Normal case | No-op; keep running current stage |
| `shutdown` | Run reached `Completed` | `systemctl poweroff` |
| `abort` | Run in `FailedHolding` or `Released` | Stop heartbeat loop; let the operator drive |
| `retry_stage` | Operator pressed "Override wipe" | Re-enter the named stage with `override_flags` armed |
## Safety: destructive disk tests
Four layered gates:
1. **MAC allowlist** — dnsmasq only answers DHCP for registered MACs.
2. **Signed run token** — orchestrator issues a per-run HMAC token in
the iPXE kernel cmdline; the agent submits it on `/claim` and the
orchestrator verifies before handing back the stage list.
3. **Wipe probe** — before `badblocks`, the agent scans for filesystem
signatures / LVM metadata / partition tables. Anything found →
`FailedHolding` on `Storage`. The operator explicitly clicks
**Override wipe-probe** to proceed.
4. **Device allowlist** — the agent only targets block devices matching
the inventory's `expected_disks`. USB sticks and surprise disks are
skipped.
## Notifications
Fire-and-forget. The orchestrator fires four event kinds:
| Kind | Severity | When |
|---|---|---|
| `StageFailed` | critical | Any stage returns `passed=false` |
| `SpecMismatch` | critical | `SpecValidate` finds critical diffs |
| `HoldingOpened` | critical | Agent POSTs `/hold` (operator can SSH in) |
| `RunCompleted` | info | Pipeline reaches `Completed` |
The config maps event kinds and severities to one or more notifiers
(ntfy, Discord webhook, SMTP). Each notifier gets one attempt per
event with a 10s timeout; delivery failures are logged, nothing is
persisted.
## Why a separate notify package?
Keeps the `/result` and `/hold` handlers non-blocking. Each dispatch
starts a goroutine per target; a slow ntfy server doesn't back up an
SMTP notifier or delay the HTTP response to the agent.
## Data retention
The janitor goroutine (`internal/janitor`) runs a sweep every
`janitor.interval_minutes` (default 60) and deletes:
- artifact files older than `artifacts.retention_days`, plus their
`artifacts` table rows
- log files older than `logs.retention_days`
`runs`, `hosts`, `stages`, `measurements`, `spec_diffs` rows are
**never** deleted by the janitor — host histories and aggregate
metrics survive cleanups.
## Reproducible builds
The orchestrator and agent are pure Go; `make orchestrator-linux`
cross-compiles to `linux-amd64` from Windows or macOS.
The live image requires Linux-side tooling (mkosi, debootstrap,
squashfs-tools) so `make live-image` fails loudly on Windows and
redirects to `wsl make live-image`. Pinning to snapshot.debian.org in
`live-image/mkosi.conf` keeps image bits stable across time for a
given git SHA.
+171
View File
@@ -0,0 +1,171 @@
# Operations
Operator-facing runbook for the vetting orchestrator. If you're looking
for the "what does the system do" overview, see
[architecture.md](architecture.md). For what each test stage actually
measures, see [test-suite.md](test-suite.md).
## Install (Proxmox LXC)
Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster
you're vetting for. The LXC must be on the same L2 segment as the
repaired nodes so DHCP and WoL work.
1. On your workstation, cross-build the binary:
```
make orchestrator-linux
```
This produces `bin/vetting-linux-amd64`.
2. Copy the repo tree (or just `bin/`, `deploy/`) into the LXC, then
from inside the LXC:
```
sudo ./deploy/install.sh
```
The installer:
- `apt install`s `dnsmasq`, `iperf3`, `ca-certificates`
- creates the `vetting` system user (home = `/var/lib/vetting`)
- installs the binary into `/usr/local/bin/vetting`
- drops `vetting.example.yaml` into `/etc/vetting/vetting.yaml`
(only if there's no existing config — existing configs are
preserved)
- drops `/etc/systemd/system/vetting.service`
- disables the distro-default dnsmasq (the orchestrator supervises
its own)
The installer does **not** enable the service, because the default
config has a placeholder bcrypt password that the binary refuses to
start with.
3. Generate an admin password hash and a session secret, then edit
`/etc/vetting/vetting.yaml`:
```
./bin/gen-admin-password 'your-password-here' # prints a bcrypt hash
openssl rand -hex 32 # prints a 64-char hex string
```
Required fields:
- `auth.admin_password_bcrypt` — the bcrypt hash
- `auth.session_secret_hex` — the 32-byte hex string
- `server.public_url` — the URL your browser hits the LXC on
(e.g. `https://vetting.lan:8443`). This is used as the
click-through link in notifications, so it must be the *external*
URL, not the bind address.
4. (Optional) Configure notifiers in the same file — see the
commented-out example block for ntfy / Discord / SMTP.
5. Enable and start:
```
sudo systemctl enable --now vetting
sudo journalctl -fu vetting
```
## First vetting run
Against a QEMU VM first, before you point it at real hardware:
1. On the Proxmox host (or wherever your LXC lives):
```
sudo ip link add br-vetting type bridge
sudo ip addr add 10.77.0.1/24 dev br-vetting
sudo ip link set br-vetting up
```
2. In the UI at `https://<lxc>:8443`, log in and register a host:
- Name: `qemu-test`
- MAC: `52:54:00:12:34:56`
- WoL broadcast IP: `10.77.0.255`
- Expected spec: paste a minimal YAML like
```yaml
memory: { total_gib: 4 }
cpu: { logical_cores: 4 }
```
3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingWoL`.
4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:
```
sudo qemu-system-x86_64 \
-enable-kvm -cpu host -smp 4 -m 4096 \
-netdev bridge,id=n0,br=br-vetting \
-device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
-drive file=/tmp/test-disk.img,format=raw,if=virtio \
-boot n -serial mon:stdio -display none
```
5. Watch the tile advance through stages. On success, the tile shows
**View report** and the VM auto-shuts-down.
For real repaired hardware: same flow, but register the node's actual
MAC + expected spec, and make sure the node's BIOS is set to PXE-boot
from the NIC that's on the `br-vetting` network.
## A failed run — SSH to the held host
When a stage fails, the pipeline halts at `FailedHolding` and the
agent installs an orchestrator-issued SSH key into the live-image's
`/root/.ssh/authorized_keys`. The UI tile surfaces the IP and the
exact `ssh` command.
The hold key is **per-run**. Once you're done:
1. Power the host off (`poweroff` from the SSH session).
2. In the UI, click **Override wipe-probe** only when the failure was
at the `Storage` stage *and* you're sure the disks are expendable.
Otherwise click **Start vetting** on a fresh run from the host
dashboard after fixing the underlying issue.
## Log + artifact layout
```
/var/lib/vetting/
vetting.db # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
artifacts/
run-<N>/
report.html # operator-facing summary
report.json # machine-readable summary
inventory.json # raw probe output
fio-<disk>.log # storage stage output
iperf-<nic>.json # network stage output
hold-<N>.pub # per-run SSH pubkey (only if held)
/var/log/vetting/
run-<N>.log # append-only per-run log tail
```
Retention is governed by the `artifacts.retention_days` and
`logs.retention_days` settings. DB rows (run history) are preserved
indefinitely; only on-disk files get pruned.
## Troubleshooting
| Symptom | First check |
|---|---|
| Service refuses to start with `auth.admin_password_bcrypt is the placeholder` | You didn't replace the bcrypt hash in the config. Run `gen-admin-password`. |
| PXE client gets no DHCP offer | `journalctl -u vetting` for dnsmasq errors; confirm the LXC has `CAP_NET_ADMIN` (the shipped systemd unit does); confirm the host MAC is actually registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). |
| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
| Notification didn't fire | `journalctl -u vetting \| grep notify:` — delivery is fire-and-forget and the failure reason is logged but not persisted. |
## Upgrading
1. `make orchestrator-linux` on your workstation.
2. `scp bin/vetting-linux-amd64 lxc:/tmp/vetting.new`
3. On the LXC:
```
sudo systemctl stop vetting
sudo install -m 0755 /tmp/vetting.new /usr/local/bin/vetting
sudo systemctl start vetting
```
The DB migration runs at startup and is append-only — no manual schema
work unless a release's notes call it out.
+166
View File
@@ -0,0 +1,166 @@
# Test suite
What each stage measures, what "pass" means, and where the results
land. Stages run strictly in order. Any stage returning `passed=false`
halts the pipeline at `FailedHolding` — the operator decides whether
to fix, override, or abandon.
## Stage order
```
Inventory → SpecValidate → SMART → CPUStress → Storage
→ Network → GPU → PSU → Reporting
```
Stages marked *orchestrator-owned* resolve inside `/result` and never
show up as "the agent's turn".
---
## Inventory
**Owner:** agent.
**What it does:** `dmidecode`, `lscpu`, `lshw`, `lspci`, `smartctl -i`
over each block device, `nvidia-smi -q` if present. The raw output is
merged into a single JSON blob.
**Pass:** the probes run to completion; missing optional tools (e.g.
`nvidia-smi` on a GPU-less host) are tolerated.
**Artifacts:** `inventory.json` under `artifacts/run-<N>/`.
## SpecValidate *(orchestrator-owned)*
**Owner:** orchestrator (resolves inline inside the `/result` for the
preceding Inventory stage).
**What it does:** diffs the submitted inventory against the host's
`expected_spec_yaml`. The diff engine classifies each field as
`critical`, `warning`, or `info`.
**Pass:** zero `critical` diffs.
**Fail mode:** fires a `SpecMismatch` notification; transitions run
to `Failed → FailedHolding`.
**Artifacts:** `spec_diffs` table rows (one per divergence).
## SMART
**Owner:** agent.
**What it does:** `smartctl -a /dev/<disk>` for each disk in the
inventory's `expected_disks`. Parses reallocated-sector counts, pending
sectors, end-to-end error counters, overall-health attribute.
**Pass:** SMART overall-health is PASSED on every expected disk and
reallocated-sector count is below threshold.
**Artifacts:** `smart-<disk>.txt` raw output.
## CPUStress
**Owner:** agent.
**What it does:** runs `stress-ng --cpu N --vm M --vm-bytes 90% -t
120s` with `N = logical_cores` and `M ≈ logical_cores/2`. The `--vm`
flag is the **stand-in for Memtest86+**: it exercises the memory
subsystem under load and will fail if the RAM has latent faults that
surface under thermal + allocator pressure.
**Pass:** `stress-ng` exits 0 and thermal samples taken by the sidecar
stay below the configured per-host `max_temp_c`.
**Caveat:** weaker than a dedicated memtest pass; see
[architecture.md](architecture.md) for the reasoning (Memtest86+
can't be signalled back without IPMI serial).
## Storage
**Owner:** agent (destructive).
**What it does:**
1. **Wipe probe** — scans for filesystem signatures, LVM metadata,
partition tables on the expected disks. Any hit → halt with
`UnexpectedData`; operator must click **Override wipe-probe**.
2. `badblocks -svw` (destructive read/write) on each expected disk.
3. `fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1G` on
each disk; captures IOPS and p99 latency.
**Pass:** badblocks reports zero bad blocks; fio IOPS above a
per-class floor (configurable).
**Artifacts:** `fio-<disk>.json` per disk.
**Safety gate:** the wipe-probe + device allowlist are the second and
third lines of defense against wiping the wrong disk. See
[architecture.md § Safety](architecture.md#safety-destructive-disk-tests).
## Network
**Owner:** agent.
**What it does:** `iperf3 -c <orchestrator> -p <iperf_port> -t 10 -J`
to measure throughput to the orchestrator. The orchestrator-side
`iperf3 -s` is supervised by `internal/orchestrator/iperf.go` and
binds to the configured `network.iperf_port`.
**Pass:** throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps
for 10GbE).
**Artifacts:** `iperf-<nic>.json`.
## GPU
**Owner:** agent.
**What it does:** runs `nvidia-smi -q` and a short compute workload
(`gpu-burn` if present, else `nvidia-smi dmon` during a `stress-ng
--gpu` burst). Skipped cleanly when no GPU is present.
**Pass:** no ECC errors reported; temperature below threshold; compute
workload exits 0.
## PSU
**Owner:** agent.
**What it does:** reads `/sys/class/hwmon/*/power_average` and `in*_input`
during a synthetic load burst (CPU + disk + NIC simultaneously) to
look for voltage sag or wattage anomalies. Records the full envelope
as `measurements` rows with `kind=psu`.
**Pass:** no voltage dip below threshold across the load burst.
**Caveat:** only reports on what the BMC exposes via hwmon — servers
without exposed PSU telemetry pass trivially. Documented limitation.
## Reporting *(orchestrator-owned)*
**Owner:** orchestrator (resolves inline inside the `/result` for PSU).
**What it does:**
1. Gathers run, host, stages, spec_diffs, and measurement aggregates.
2. Renders `report.html` via `internal/report` (html/template with
inlined CSS; self-contained offline-viewable).
3. Writes `report.json` with the same data in machine-readable form.
4. Records both as `report_html` / `report_json` artifact rows.
5. Transitions run → `Completed`.
6. Fires `RunCompleted` notification.
7. The next agent heartbeat returns `cmd=shutdown`.
## Thermal sidecar
**Owner:** agent (always-on from `Booting` until the agent exits).
**What it does:** every 5 seconds, walks `/sys/class/hwmon/*` and
POSTs temperature samples as a batch to `/sensor`. Populates the
`measurements` table with `kind=thermal`.
**No pass/fail** on its own — stages that care about thermals read the
sidecar's data via `measurements`. A dead sensor just drops out of
the next batch.
---
## Where pass/fail lives
- `runs.state` — authoritative terminal state (`Completed`,
`FailedHolding`, `Released`).
- `runs.result``pass` or `fail` string once the run completes.
- `runs.failed_stage` — name of the stage that halted the pipeline, if
any. Cleared when the operator overrides and re-enters.
- `stages` — one row per attempted stage with `passed`, `started_at`,
`completed_at`, `summary_json`, `message`.
- `measurements` — time-series samples from the thermal sidecar and
from stages that capture numeric outputs.
- `artifacts` — on-disk files (report, fio logs, iperf logs, etc).
- `spec_diffs` — one row per expected-vs-actual divergence.
## Adding a new stage
1. Add the name to `store.DefaultStageOrder`.
2. Add a `model.State<Name>` const and wire it into
`internal/orchestrator/statemachine.go` (both the forward
transition table and the stage-for-state lookup).
3. Add a case to `agent/runner.go`'s `runStage` dispatch.
4. Drop the implementation into `agent/tests/`.
5. If the stage is orchestrator-owned, add a `resolve<Name>` helper to
`internal/api/agent_handlers.go` and invoke it from the `/result`
handler after the preceding stage's `NextState` resolves.