Files
josh 8367ec2a9f
CI / Lint + build + test (push) Successful in 1m36s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 49s
docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API
reference with full request/response schemas, database schema), expand
the README with a feature list and how-it-works walkthrough, fix
missing Firmware and Burn stages in architecture.md and test-suite.md,
add threshold engine and host-mode agent sections, and add godoc
comments to 11 packages and 6 model types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-23 18:37:26 -04:00

6.3 KiB

Vetting

Post-repair hardware validation pipeline for Proxmox cluster hosts. Register a host, click Start Vetting, and the orchestrator will PXE-boot it into a custom Linux live image and run it through a consistent battery of tests (CPU stress, RAM stress, SMART, disk I/O, network throughput, GPU, PSU telemetry). Pass → auto-shutdown + HTML report. Fail → pipeline halts, SSH drops in, notification fires.

Built for solo-operator home labs: one Go binary, SQLite + flat files, HTMX + SSE UI, bundled dnsmasq, optional ntfy / Discord / SMTP notifications.

Features

  • Automated PXE boot — dnsmasq proxy-DHCP serves a disposable Debian live image to registered MACs. No VLAN, no dedicated bridge.
  • 11-stage validation pipeline — Inventory, Firmware, SpecValidate, SMART, CPUStress, Storage, Network, Burn, GPU, PSU, Reporting.
  • Three vetting profiles — quick (~10 min), deep (~8-12 h), soak (~36-40 h). Same probes and gates; only durations scale.
  • Server-side threshold engine — per-run rules evaluate every sensor batch in real time. Critical breaches (thermal runaway, EDAC UE, voltage sag) fail the run immediately.
  • FailedHolding with SSH — when a stage fails the pipeline parks the host and issues a one-time SSH key so you can triage in the live image.
  • Real-time dashboard — HTMX + SSE push tile updates, stage progress, sub-step detail, and live log tailing to the browser.
  • Pluggable notifications — ntfy, Discord webhooks, and SMTP with severity-routed delivery.
  • Non-destructive mode — skip badblocks + wipe for hosts with data you want to keep.
  • Host-mode agent — a persistent reporter that heartbeats from installed hosts and reboots into the live image on command.
  • Self-contained HTML reports — offline-viewable summaries with inlined CSS; machine-readable JSON alongside.
  • Four-layer safety gates — MAC allowlist, signed run token, wipe probe, device allowlist protect against accidental disk wipes.
  • Janitor — automatic retention-based cleanup of artifact files and log files.

How it works

  1. Install the host-mode agent on each node (one-liner from the dashboard's quick-register script).
  2. Register the host in the web UI — name, MAC, expected hardware spec (YAML).
  3. Click Start Vetting and choose a profile (quick / deep / soak).
  4. The host-mode agent receives a reboot_for_vetting heartbeat command and reboots into PXE.
  5. dnsmasq serves the iPXE script; the host boots a disposable Linux live image containing the vetting agent.
  6. The agent claims the run (token auth), then walks through each stage — posting logs, sensor readings, and results back to the orchestrator.
  7. Thresholds are evaluated server-side on every sensor batch.
  8. Pass — auto-reboot to local disk, HTML report generated, notification fires.
  9. Fail — pipeline parks in FailedHolding, SSH key issued, notification fires. Operator triages and retries or releases.

Documentation

Quick start (local, against QEMU)

make all
./bin/vetting --config deploy/vetting.example.yaml
# → http://localhost:8080

The UI has no built-in auth — bind to loopback or LAN only, or front the service with a reverse proxy (Caddy/nginx basic-auth) if you want a password. The agent↔orchestrator channel keeps its own bearer-token auth and is unaffected.

For a full end-to-end QEMU walk-through (bridge setup, host registration, PXE boot), see docs/operations.md § First vetting run.

Production install (Proxmox LXC)

On a fresh Debian/Ubuntu LXC, as root:

curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh | bash

That installs Go (if missing), clones the repo to /opt/vetting-src, builds vetting-linux-amd64, and hands off to deploy/install.sh — which lays down the binary, systemd unit, example config, and vetting service user. Then:

# Edit /etc/vetting/vetting.yaml (server.bind + server.public_url)
sudo systemctl enable --now vetting
journalctl -fu vetting

Prefer to build yourself? The manual path:

make orchestrator-linux
scp -r bin deploy lxc:/opt/vetting/
ssh lxc "cd /opt/vetting && sudo ./deploy/install.sh"
ssh lxc "sudo systemctl enable --now vetting"

See docs/operations.md § Install for the full walkthrough.

Repository layout

cmd/                  orchestrator + agent entrypoints
internal/             core packages (see docs/architecture.md for the map)
agent/                in-image agent logic (claim loop, stage dispatch, probes)
live-image/           mkosi config for the PXE-bootable Debian live image
deploy/               systemd unit + install.sh + example config
docs/                 operator + developer docs
test/e2e/             build-tag-gated QEMU + PXE full-stack test
tools/                small CLI helpers

Development

  • make test — Go unit + smoke tests (cross-platform)
  • make vetgo vet on the whole module
  • make live-image — Linux-only; run under WSL from Windows
  • make e2e — requires Linux root + live image + running orchestrator
  • make run — build + launch the orchestrator with the example config

Windows hosts: everything except live-image and e2e works natively. The live image build calls mkosi which needs a real Linux userspace, so use WSL for those targets.

Status

All six phases in the original plan are implemented. The E2E QEMU harness is wired in test/e2e/qemu_test.go but requires a running orchestrator + registered host + queued run as preconditions — it's a developer-facing integration harness, not a unit test.