josh/Vetting

Fork 0

T

josh 017c3c38fe

CI / Lint + build + test (push) Successful in 1m43s

Details

Release / detect (push) Successful in 6s

Details

Release / build-live-image (push) Has been skipped

Details

Release / bundle (push) Successful in 52s

Details

feat(ui): 15-point UX overhaul — affordances, feedback, and navigation

Address friction points identified in a full interface audit:
- Re-add status badge to dashboard tiles so run state is visible at a glance
- Add active nav indicator and SSE connection health monitor (live/stale)
- Show manual registration form by default instead of hiding behind <details>
- Add copy-to-clipboard buttons on SSH hold command and quick-register one-liner
- Replace tooltip-only profile descriptions with inline visible text
- Clarify non-destructive toggle with explicit stage impact description
- Replace disabled "Start vetting" button with actionable offline guidance
- Swap browser confirm() dialogs for styled inline confirmations
- Add colored badge to spec diffs summary visible when collapsed
- Add distinct "cancelled" mood for cancelled runs (vs idle)
- Add match count to log search and aria-label for accessibility
- Add styled 404 page rendered inside the app shell

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-23 20:08:07 -04:00

.gitea/workflows

chore(release): add registry auth diagnostic to build-live-image

2026-04-20 21:27:23 -04:00

agent

fix(inventory): read GPU model from device field, not vendor field

2026-04-19 22:53:42 -04:00

cmd

docs: comprehensive documentation expansion

2026-04-23 18:37:26 -04:00

deploy

feat(install): polish install UX with banner, spinner, progress bar, summary

2026-04-20 22:29:44 -04:00

docs

docs: comprehensive documentation expansion

2026-04-23 18:37:26 -04:00

internal

feat(ui): 15-point UX overhaul — affordances, feedback, and navigation

2026-04-23 20:08:07 -04:00

live-image

bump live-image

2026-04-20 21:31:09 -04:00

test/e2e

docs+e2e: document proxy-DHCP topology; default e2e bridge to LAN

2026-04-18 12:07:05 -04:00

.gitattributes

live-image: real /init + verbose boot for first-boot diagnosis

2026-04-18 14:31:40 -04:00

.gitignore

live-image: fix firmware so i915 actually loads at boot

2026-04-18 13:38:40 -04:00

.golangci.yml

Initial commit: full Phases 1-6 implementation

2026-04-17 21:32:10 -04:00

go.mod

Initial commit: full Phases 1-6 implementation

2026-04-17 21:32:10 -04:00

go.sum

deps: add missing go.sum entry for golang.org/x/term v0.25.0

2026-04-18 02:38:13 -04:00

Makefile

feat(release): version live-image, skip rebuild+redownload when unchanged

2026-04-20 21:04:14 -04:00

README.md

docs: comprehensive documentation expansion

2026-04-23 18:37:26 -04:00

README.md

Vetting

Post-repair hardware validation pipeline for Proxmox cluster hosts. Register a host, click Start Vetting, and the orchestrator will PXE-boot it into a custom Linux live image and run it through a consistent battery of tests (CPU stress, RAM stress, SMART, disk I/O, network throughput, GPU, PSU telemetry). Pass → auto-shutdown + HTML report. Fail → pipeline halts, SSH drops in, notification fires.

Built for solo-operator home labs: one Go binary, SQLite + flat files, HTMX + SSE UI, bundled dnsmasq, optional ntfy / Discord / SMTP notifications.

Features

Automated PXE boot — dnsmasq proxy-DHCP serves a disposable Debian live image to registered MACs. No VLAN, no dedicated bridge.
11-stage validation pipeline — Inventory, Firmware, SpecValidate, SMART, CPUStress, Storage, Network, Burn, GPU, PSU, Reporting.
Three vetting profiles — quick (~10 min), deep (~8-12 h), soak (~36-40 h). Same probes and gates; only durations scale.
Server-side threshold engine — per-run rules evaluate every sensor batch in real time. Critical breaches (thermal runaway, EDAC UE, voltage sag) fail the run immediately.
FailedHolding with SSH — when a stage fails the pipeline parks the host and issues a one-time SSH key so you can triage in the live image.
Real-time dashboard — HTMX + SSE push tile updates, stage progress, sub-step detail, and live log tailing to the browser.
Pluggable notifications — ntfy, Discord webhooks, and SMTP with severity-routed delivery.
Non-destructive mode — skip badblocks + wipe for hosts with data you want to keep.
Host-mode agent — a persistent reporter that heartbeats from installed hosts and reboots into the live image on command.
Self-contained HTML reports — offline-viewable summaries with inlined CSS; machine-readable JSON alongside.
Four-layer safety gates — MAC allowlist, signed run token, wipe probe, device allowlist protect against accidental disk wipes.
Janitor — automatic retention-based cleanup of artifact files and log files.

How it works

Install the host-mode agent on each node (one-liner from the dashboard's quick-register script).
Register the host in the web UI — name, MAC, expected hardware spec (YAML).
Click Start Vetting and choose a profile (quick / deep / soak).
The host-mode agent receives a reboot_for_vetting heartbeat command and reboots into PXE.
dnsmasq serves the iPXE script; the host boots a disposable Linux live image containing the vetting agent.
The agent claims the run (token auth), then walks through each stage — posting logs, sensor readings, and results back to the orchestrator.
Thresholds are evaluated server-side on every sensor batch.
Pass — auto-reboot to local disk, HTML report generated, notification fires.
Fail — pipeline parks in FailedHolding, SSH key issued, notification fires. Operator triages and retries or releases.

Documentation

docs/operations.md — install, first run, troubleshooting
docs/architecture.md — packages, state machine, protocol, safety model
docs/test-suite.md — what each stage measures
docs/configuration.md — every YAML config knob, profiles, thresholds
docs/api-reference.md — HTTP API with request/response schemas, SSE events
docs/database.md — SQLite schema, tables, entity relationships
docs/development.md — dev setup, building, testing, adding stages

Quick start (local, against QEMU)

make all
./bin/vetting --config deploy/vetting.example.yaml
# → http://localhost:8080

The UI has no built-in auth — bind to loopback or LAN only, or front the service with a reverse proxy (Caddy/nginx basic-auth) if you want a password. The agent↔orchestrator channel keeps its own bearer-token auth and is unaffected.

For a full end-to-end QEMU walk-through (bridge setup, host registration, PXE boot), see docs/operations.md § First vetting run.

Production install (Proxmox LXC)

On a fresh Debian/Ubuntu LXC, as root:

curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh | bash

That installs Go (if missing), clones the repo to /opt/vetting-src, builds vetting-linux-amd64, and hands off to deploy/install.sh — which lays down the binary, systemd unit, example config, and vetting service user. Then:

# Edit /etc/vetting/vetting.yaml (server.bind + server.public_url)
sudo systemctl enable --now vetting
journalctl -fu vetting

Prefer to build yourself? The manual path:

make orchestrator-linux
scp -r bin deploy lxc:/opt/vetting/
ssh lxc "cd /opt/vetting && sudo ./deploy/install.sh"
ssh lxc "sudo systemctl enable --now vetting"

See docs/operations.md § Install for the full walkthrough.

Repository layout

cmd/                  orchestrator + agent entrypoints
internal/             core packages (see docs/architecture.md for the map)
agent/                in-image agent logic (claim loop, stage dispatch, probes)
live-image/           mkosi config for the PXE-bootable Debian live image
deploy/               systemd unit + install.sh + example config
docs/                 operator + developer docs
test/e2e/             build-tag-gated QEMU + PXE full-stack test
tools/                small CLI helpers

Development

make test — Go unit + smoke tests (cross-platform)
make vet — go vet on the whole module
make live-image — Linux-only; run under WSL from Windows
make e2e — requires Linux root + live image + running orchestrator
make run — build + launch the orchestrator with the example config

Windows hosts: everything except live-image and e2e works natively. The live image build calls mkosi which needs a real Linux userspace, so use WSL for those targets.

Status

All six phases in the original plan are implemented. The E2E QEMU harness is wired in test/e2e/qemu_test.go but requires a running orchestrator + registered host + queued run as preconditions — it's a developer-facing integration harness, not a unit test.

Languages

Go 81.1%

Shell 6.7%

templ 5.5%

CSS 3.8%

Go Template 1%

Other 1.9%