# Vetting Post-repair hardware validation pipeline for Proxmox cluster hosts. Register a host, click **Start Vetting**, and the orchestrator will PXE-boot it into a custom Linux live image and run it through a consistent battery of tests (CPU stress, RAM stress, SMART, disk I/O, network throughput, GPU, PSU telemetry). Pass → auto-shutdown + HTML report. Fail → pipeline halts, SSH drops in, notification fires. Built for solo-operator home labs: one Go binary, SQLite + flat files, HTMX + SSE UI, bundled dnsmasq, optional ntfy / Discord / SMTP notifications. ## Features - **Automated PXE boot** — dnsmasq proxy-DHCP serves a disposable Debian live image to registered MACs. No VLAN, no dedicated bridge. - **11-stage validation pipeline** — Inventory, Firmware, SpecValidate, SMART, CPUStress, Storage, Network, Burn, GPU, PSU, Reporting. - **Three vetting profiles** — quick (~10 min), deep (~8-12 h), soak (~36-40 h). Same probes and gates; only durations scale. - **Server-side threshold engine** — per-run rules evaluate every sensor batch in real time. Critical breaches (thermal runaway, EDAC UE, voltage sag) fail the run immediately. - **FailedHolding with SSH** — when a stage fails the pipeline parks the host and issues a one-time SSH key so you can triage in the live image. - **Real-time dashboard** — HTMX + SSE push tile updates, stage progress, sub-step detail, and live log tailing to the browser. - **Pluggable notifications** — ntfy, Discord webhooks, and SMTP with severity-routed delivery. - **Non-destructive mode** — skip badblocks + wipe for hosts with data you want to keep. - **Host-mode agent** — a persistent reporter that heartbeats from installed hosts and reboots into the live image on command. - **Self-contained HTML reports** — offline-viewable summaries with inlined CSS; machine-readable JSON alongside. - **Four-layer safety gates** — MAC allowlist, signed run token, wipe probe, device allowlist protect against accidental disk wipes. - **Janitor** — automatic retention-based cleanup of artifact files and log files. ## How it works 1. Install the host-mode agent on each node (one-liner from the dashboard's quick-register script). 2. Register the host in the web UI — name, MAC, expected hardware spec (YAML). 3. Click **Start Vetting** and choose a profile (quick / deep / soak). 4. The host-mode agent receives a `reboot_for_vetting` heartbeat command and reboots into PXE. 5. dnsmasq serves the iPXE script; the host boots a disposable Linux live image containing the vetting agent. 6. The agent claims the run (token auth), then walks through each stage — posting logs, sensor readings, and results back to the orchestrator. 7. Thresholds are evaluated server-side on every sensor batch. 8. **Pass** — auto-reboot to local disk, HTML report generated, notification fires. 9. **Fail** — pipeline parks in FailedHolding, SSH key issued, notification fires. Operator triages and retries or releases. ## Documentation - [docs/operations.md](docs/operations.md) — install, first run, troubleshooting - [docs/architecture.md](docs/architecture.md) — packages, state machine, protocol, safety model - [docs/test-suite.md](docs/test-suite.md) — what each stage measures - [docs/configuration.md](docs/configuration.md) — every YAML config knob, profiles, thresholds - [docs/api-reference.md](docs/api-reference.md) — HTTP API with request/response schemas, SSE events - [docs/database.md](docs/database.md) — SQLite schema, tables, entity relationships - [docs/development.md](docs/development.md) — dev setup, building, testing, adding stages ## Quick start (local, against QEMU) ```bash make all ./bin/vetting --config deploy/vetting.example.yaml # → http://localhost:8080 ``` The UI has no built-in auth — bind to loopback or LAN only, or front the service with a reverse proxy (Caddy/nginx basic-auth) if you want a password. The agent↔orchestrator channel keeps its own bearer-token auth and is unaffected. For a full end-to-end QEMU walk-through (bridge setup, host registration, PXE boot), see [docs/operations.md § First vetting run](docs/operations.md#first-vetting-run). ## Production install (Proxmox LXC) On a fresh Debian/Ubuntu LXC, as root: ```bash curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh | bash ``` That installs Go (if missing), clones the repo to `/opt/vetting-src`, builds `vetting-linux-amd64`, and hands off to `deploy/install.sh` — which lays down the binary, systemd unit, example config, and `vetting` service user. Then: ```bash # Edit /etc/vetting/vetting.yaml (server.bind + server.public_url) sudo systemctl enable --now vetting journalctl -fu vetting ``` Prefer to build yourself? The manual path: ```bash make orchestrator-linux scp -r bin deploy lxc:/opt/vetting/ ssh lxc "cd /opt/vetting && sudo ./deploy/install.sh" ssh lxc "sudo systemctl enable --now vetting" ``` See [docs/operations.md § Install](docs/operations.md#install-proxmox-lxc) for the full walkthrough. ## Repository layout ``` cmd/ orchestrator + agent entrypoints internal/ core packages (see docs/architecture.md for the map) agent/ in-image agent logic (claim loop, stage dispatch, probes) live-image/ mkosi config for the PXE-bootable Debian live image deploy/ systemd unit + install.sh + example config docs/ operator + developer docs test/e2e/ build-tag-gated QEMU + PXE full-stack test tools/ small CLI helpers ``` ## Development - `make test` — Go unit + smoke tests (cross-platform) - `make vet` — `go vet` on the whole module - `make live-image` — Linux-only; run under WSL from Windows - `make e2e` — requires Linux root + live image + running orchestrator - `make run` — build + launch the orchestrator with the example config Windows hosts: everything except `live-image` and `e2e` works natively. The live image build calls `mkosi` which needs a real Linux userspace, so use WSL for those targets. ## Status All six phases in the original plan are implemented. The E2E QEMU harness is wired in `test/e2e/qemu_test.go` but requires a running orchestrator + registered host + queued run as preconditions — it's a developer-facing integration harness, not a unit test.