Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping CPUStress. Two compounding bugs: 1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently. On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer fired, usually on the agent itself. Replaced with two sequential passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1, --vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify) for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a premature clean exit counts as failure instead of a silent pass. 2. On systemd-restart after the OOM, the agent hardcoded nextStage := "Inventory" and re-ran it. The orchestrator's /result handler advances run state via TriggerStageCompleted against the *current* RunState, not against body.Stage — so an Inventory result posted while the run was in StateCPUStress silently advanced CPUStress → Storage and marked CPUStress passed without it ever running. Two-layer defense for #2: - agent-side: /claim response now carries current_state; agent resumes at the matching stage on a re-claim (happy path). - server-side: new TriggerStageMismatch + StageNameForState helper backstop. If body.Stage doesn't match the run's current stage, /result parks the run in FailedHolding with failed_stage labeled "<got> (expected <expected>)" and returns 409. Other stages audited for similar unbounded concurrency — none found; only CPUStress was unsafe. Tests: - cpustress_test.go — parseMemAvailable parses real meminfo, errors on missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB headroom on normal/huge boxes. - statemachine_test.go — TriggerStageMismatch lands at FailedHolding from every stage state and is rejected from pre-stage/terminal states; StageNameForState round-trips the stageStates map. - agent_handlers_test.go — TestResult_RejectsMismatchedStage proves the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage proves the guard doesn't break the happy path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Vetting
Post-repair hardware validation pipeline for Proxmox cluster hosts. Register a host, click Start Vetting, and the orchestrator will PXE-boot it into a custom Linux live image and run it through a consistent battery of tests (CPU stress, RAM stress, SMART, disk I/O, network throughput, GPU, PSU telemetry). Pass → auto-shutdown + HTML report. Fail → pipeline halts, SSH drops in, notification fires.
Built for solo-operator home labs: one Go binary, SQLite + flat files, HTMX + SSE UI, bundled dnsmasq, optional ntfy / Discord / SMTP notifications.
Documentation
- docs/operations.md — install + first run + troubleshooting
- docs/architecture.md — packages, state machine, protocol
- docs/test-suite.md — what each stage measures
Quick start (local, against QEMU)
make all
./bin/vetting --config deploy/vetting.example.yaml
# → http://localhost:8080
The UI has no built-in auth — bind to loopback or LAN only, or front the service with a reverse proxy (Caddy/nginx basic-auth) if you want a password. The agent↔orchestrator channel keeps its own bearer-token auth and is unaffected.
For a full end-to-end QEMU walk-through (bridge setup, host registration, PXE boot), see docs/operations.md § First vetting run.
Production install (Proxmox LXC)
On a fresh Debian/Ubuntu LXC, as root:
curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh | bash
That installs Go (if missing), clones the repo to /opt/vetting-src,
builds vetting-linux-amd64, and hands off to deploy/install.sh —
which lays down the binary, systemd unit, example config, and
vetting service user. Then:
# Edit /etc/vetting/vetting.yaml (server.bind + server.public_url)
sudo systemctl enable --now vetting
journalctl -fu vetting
Prefer to build yourself? The manual path:
make orchestrator-linux
scp -r bin deploy lxc:/opt/vetting/
ssh lxc "cd /opt/vetting && sudo ./deploy/install.sh"
ssh lxc "sudo systemctl enable --now vetting"
See docs/operations.md § Install for the full walkthrough.
Repository layout
cmd/ orchestrator + agent entrypoints
internal/ core packages (see docs/architecture.md for the map)
agent/ in-image agent logic (claim loop, stage dispatch, probes)
live-image/ mkosi config for the PXE-bootable Debian live image
deploy/ systemd unit + install.sh + example config
docs/ operator + developer docs
test/e2e/ build-tag-gated QEMU + PXE full-stack test
tools/ small CLI helpers
Development
make test— Go unit + smoke tests (cross-platform)make vet—go veton the whole modulemake live-image— Linux-only; run under WSL from Windowsmake e2e— requires Linux root + live image + running orchestratormake run— build + launch the orchestrator with the example config
Windows hosts: everything except live-image and e2e works natively.
The live image build calls mkosi which needs a real Linux userspace,
so use WSL for those targets.
Status
All six phases in the original plan are implemented. The E2E QEMU
harness is wired in test/e2e/qemu_test.go but requires a running
orchestrator + registered host + queued run as preconditions — it's a
developer-facing integration harness, not a unit test.