josh/Vetting

Fork 0

T

josh 27098fc7ed

CI / Lint + build + test (push) Successful in 1m23s

Details

Release / release (push) Successful in 6m2s

Details

cpustress+orchestrator: serial CPU/RAM passes + silent-skip guard

Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping
CPUStress. Two compounding bugs:

1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently.
   On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer
   fired, usually on the agent itself. Replaced with two sequential
   passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1,
   --vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify)
   for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a
   premature clean exit counts as failure instead of a silent pass.

2. On systemd-restart after the OOM, the agent hardcoded nextStage :=
   "Inventory" and re-ran it. The orchestrator's /result handler
   advances run state via TriggerStageCompleted against the *current*
   RunState, not against body.Stage — so an Inventory result posted
   while the run was in StateCPUStress silently advanced CPUStress →
   Storage and marked CPUStress passed without it ever running.

Two-layer defense for #2:
- agent-side: /claim response now carries current_state; agent resumes
  at the matching stage on a re-claim (happy path).
- server-side: new TriggerStageMismatch + StageNameForState helper
  backstop. If body.Stage doesn't match the run's current stage, /result
  parks the run in FailedHolding with failed_stage labeled
  "<got> (expected <expected>)" and returns 409.

Other stages audited for similar unbounded concurrency — none found;
only CPUStress was unsafe.

Tests:
- cpustress_test.go — parseMemAvailable parses real meminfo, errors on
  missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB
  headroom on normal/huge boxes.
- statemachine_test.go — TriggerStageMismatch lands at FailedHolding
  from every stage state and is rejected from pre-stage/terminal
  states; StageNameForState round-trips the stageStates map.
- agent_handlers_test.go — TestResult_RejectsMismatchedStage proves
  the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage
  proves the guard doesn't break the happy path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-18 17:29:13 -04:00

.gitea/workflows

live-image: pack full rootfs as initrd so PXE actually boots userspace

2026-04-18 14:14:08 -04:00

agent

cpustress+orchestrator: serial CPU/RAM passes + silent-skip guard

2026-04-18 17:29:13 -04:00

cmd

ui: stream host-detail fragments over SSE so the page updates live

2026-04-18 16:36:13 -04:00

deploy

deploy: show speed + ETA in bundle-download progress meter

2026-04-18 15:04:26 -04:00

docs

install: stage pxe-setup.sh at /usr/local/sbin/vetting-pxe-setup

2026-04-18 12:10:23 -04:00

internal

cpustress+orchestrator: serial CPU/RAM passes + silent-skip guard

2026-04-18 17:29:13 -04:00

live-image

live-image: install stage tools and fail loudly if any are missing

2026-04-18 16:39:28 -04:00

test/e2e

docs+e2e: document proxy-DHCP topology; default e2e bridge to LAN

2026-04-18 12:07:05 -04:00

.gitattributes

live-image: real /init + verbose boot for first-boot diagnosis

2026-04-18 14:31:40 -04:00

.gitignore

live-image: fix firmware so i915 actually loads at boot

2026-04-18 13:38:40 -04:00

.golangci.yml

Initial commit: full Phases 1-6 implementation

2026-04-17 21:32:10 -04:00

go.mod

Initial commit: full Phases 1-6 implementation

2026-04-17 21:32:10 -04:00

go.sum

deps: add missing go.sum entry for golang.org/x/term v0.25.0

2026-04-18 02:38:13 -04:00

Makefile

live-image: generate initrd explicitly; fail release on missing files

2026-04-18 10:47:26 -04:00

README.md

Remove operator auth — trust the LAN

2026-04-17 22:31:49 -04:00

README.md

Vetting

Post-repair hardware validation pipeline for Proxmox cluster hosts. Register a host, click Start Vetting, and the orchestrator will PXE-boot it into a custom Linux live image and run it through a consistent battery of tests (CPU stress, RAM stress, SMART, disk I/O, network throughput, GPU, PSU telemetry). Pass → auto-shutdown + HTML report. Fail → pipeline halts, SSH drops in, notification fires.

Built for solo-operator home labs: one Go binary, SQLite + flat files, HTMX + SSE UI, bundled dnsmasq, optional ntfy / Discord / SMTP notifications.

Documentation

docs/operations.md — install + first run + troubleshooting
docs/architecture.md — packages, state machine, protocol
docs/test-suite.md — what each stage measures

Quick start (local, against QEMU)

make all
./bin/vetting --config deploy/vetting.example.yaml
# → http://localhost:8080

The UI has no built-in auth — bind to loopback or LAN only, or front the service with a reverse proxy (Caddy/nginx basic-auth) if you want a password. The agent↔orchestrator channel keeps its own bearer-token auth and is unaffected.

For a full end-to-end QEMU walk-through (bridge setup, host registration, PXE boot), see docs/operations.md § First vetting run.

Production install (Proxmox LXC)

On a fresh Debian/Ubuntu LXC, as root:

curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh | bash

That installs Go (if missing), clones the repo to /opt/vetting-src, builds vetting-linux-amd64, and hands off to deploy/install.sh — which lays down the binary, systemd unit, example config, and vetting service user. Then:

# Edit /etc/vetting/vetting.yaml (server.bind + server.public_url)
sudo systemctl enable --now vetting
journalctl -fu vetting

Prefer to build yourself? The manual path:

make orchestrator-linux
scp -r bin deploy lxc:/opt/vetting/
ssh lxc "cd /opt/vetting && sudo ./deploy/install.sh"
ssh lxc "sudo systemctl enable --now vetting"

See docs/operations.md § Install for the full walkthrough.

Repository layout

cmd/                  orchestrator + agent entrypoints
internal/             core packages (see docs/architecture.md for the map)
agent/                in-image agent logic (claim loop, stage dispatch, probes)
live-image/           mkosi config for the PXE-bootable Debian live image
deploy/               systemd unit + install.sh + example config
docs/                 operator + developer docs
test/e2e/             build-tag-gated QEMU + PXE full-stack test
tools/                small CLI helpers

Development

make test — Go unit + smoke tests (cross-platform)
make vet — go vet on the whole module
make live-image — Linux-only; run under WSL from Windows
make e2e — requires Linux root + live image + running orchestrator
make run — build + launch the orchestrator with the example config

Windows hosts: everything except live-image and e2e works natively. The live image build calls mkosi which needs a real Linux userspace, so use WSL for those targets.

Status

All six phases in the original plan are implemented. The E2E QEMU harness is wired in test/e2e/qemu_test.go but requires a running orchestrator + registered host + queued run as preconditions — it's a developer-facing integration harness, not a unit test.

Languages

Go 81.1%

Shell 6.7%

templ 5.5%

CSS 3.8%

Go Template 1%

Other 1.9%