8367ec2a9f
Add 4 new doc files (configuration reference, development guide, API reference with full request/response schemas, database schema), expand the README with a feature list and how-it-works walkthrough, fix missing Firmware and Burn stages in architecture.md and test-suite.md, add threshold engine and host-mode agent sections, and add godoc comments to 11 packages and 6 model types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
157 lines
6.3 KiB
Markdown
157 lines
6.3 KiB
Markdown
# Vetting
|
|
|
|
Post-repair hardware validation pipeline for Proxmox cluster hosts.
|
|
Register a host, click **Start Vetting**, and the orchestrator will
|
|
PXE-boot it into a custom Linux live image and run it through a
|
|
consistent battery of tests (CPU stress, RAM stress, SMART, disk I/O,
|
|
network throughput, GPU, PSU telemetry). Pass → auto-shutdown + HTML
|
|
report. Fail → pipeline halts, SSH drops in, notification fires.
|
|
|
|
Built for solo-operator home labs: one Go binary, SQLite + flat files,
|
|
HTMX + SSE UI, bundled dnsmasq, optional ntfy / Discord / SMTP
|
|
notifications.
|
|
|
|
## Features
|
|
|
|
- **Automated PXE boot** — dnsmasq proxy-DHCP serves a disposable
|
|
Debian live image to registered MACs. No VLAN, no dedicated bridge.
|
|
- **11-stage validation pipeline** — Inventory, Firmware, SpecValidate,
|
|
SMART, CPUStress, Storage, Network, Burn, GPU, PSU, Reporting.
|
|
- **Three vetting profiles** — quick (~10 min), deep (~8-12 h),
|
|
soak (~36-40 h). Same probes and gates; only durations scale.
|
|
- **Server-side threshold engine** — per-run rules evaluate every
|
|
sensor batch in real time. Critical breaches (thermal runaway,
|
|
EDAC UE, voltage sag) fail the run immediately.
|
|
- **FailedHolding with SSH** — when a stage fails the pipeline parks
|
|
the host and issues a one-time SSH key so you can triage in the
|
|
live image.
|
|
- **Real-time dashboard** — HTMX + SSE push tile updates, stage
|
|
progress, sub-step detail, and live log tailing to the browser.
|
|
- **Pluggable notifications** — ntfy, Discord webhooks, and SMTP with
|
|
severity-routed delivery.
|
|
- **Non-destructive mode** — skip badblocks + wipe for hosts with
|
|
data you want to keep.
|
|
- **Host-mode agent** — a persistent reporter that heartbeats from
|
|
installed hosts and reboots into the live image on command.
|
|
- **Self-contained HTML reports** — offline-viewable summaries with
|
|
inlined CSS; machine-readable JSON alongside.
|
|
- **Four-layer safety gates** — MAC allowlist, signed run token,
|
|
wipe probe, device allowlist protect against accidental disk wipes.
|
|
- **Janitor** — automatic retention-based cleanup of artifact files
|
|
and log files.
|
|
|
|
## How it works
|
|
|
|
1. Install the host-mode agent on each node (one-liner from the
|
|
dashboard's quick-register script).
|
|
2. Register the host in the web UI — name, MAC, expected hardware
|
|
spec (YAML).
|
|
3. Click **Start Vetting** and choose a profile (quick / deep / soak).
|
|
4. The host-mode agent receives a `reboot_for_vetting` heartbeat
|
|
command and reboots into PXE.
|
|
5. dnsmasq serves the iPXE script; the host boots a disposable Linux
|
|
live image containing the vetting agent.
|
|
6. The agent claims the run (token auth), then walks through each
|
|
stage — posting logs, sensor readings, and results back to the
|
|
orchestrator.
|
|
7. Thresholds are evaluated server-side on every sensor batch.
|
|
8. **Pass** — auto-reboot to local disk, HTML report generated,
|
|
notification fires.
|
|
9. **Fail** — pipeline parks in FailedHolding, SSH key issued,
|
|
notification fires. Operator triages and retries or releases.
|
|
|
|
## Documentation
|
|
|
|
- [docs/operations.md](docs/operations.md) — install, first run,
|
|
troubleshooting
|
|
- [docs/architecture.md](docs/architecture.md) — packages, state
|
|
machine, protocol, safety model
|
|
- [docs/test-suite.md](docs/test-suite.md) — what each stage measures
|
|
- [docs/configuration.md](docs/configuration.md) — every YAML config
|
|
knob, profiles, thresholds
|
|
- [docs/api-reference.md](docs/api-reference.md) — HTTP API with
|
|
request/response schemas, SSE events
|
|
- [docs/database.md](docs/database.md) — SQLite schema, tables,
|
|
entity relationships
|
|
- [docs/development.md](docs/development.md) — dev setup, building,
|
|
testing, adding stages
|
|
|
|
## Quick start (local, against QEMU)
|
|
|
|
```bash
|
|
make all
|
|
./bin/vetting --config deploy/vetting.example.yaml
|
|
# → http://localhost:8080
|
|
```
|
|
|
|
The UI has no built-in auth — bind to loopback or LAN only, or front
|
|
the service with a reverse proxy (Caddy/nginx basic-auth) if you
|
|
want a password. The agent↔orchestrator channel keeps its own
|
|
bearer-token auth and is unaffected.
|
|
|
|
For a full end-to-end QEMU walk-through (bridge setup, host registration,
|
|
PXE boot), see [docs/operations.md § First vetting run](docs/operations.md#first-vetting-run).
|
|
|
|
## Production install (Proxmox LXC)
|
|
|
|
On a fresh Debian/Ubuntu LXC, as root:
|
|
|
|
```bash
|
|
curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh | bash
|
|
```
|
|
|
|
That installs Go (if missing), clones the repo to `/opt/vetting-src`,
|
|
builds `vetting-linux-amd64`, and hands off to `deploy/install.sh` —
|
|
which lays down the binary, systemd unit, example config, and
|
|
`vetting` service user. Then:
|
|
|
|
```bash
|
|
# Edit /etc/vetting/vetting.yaml (server.bind + server.public_url)
|
|
sudo systemctl enable --now vetting
|
|
journalctl -fu vetting
|
|
```
|
|
|
|
Prefer to build yourself? The manual path:
|
|
|
|
```bash
|
|
make orchestrator-linux
|
|
scp -r bin deploy lxc:/opt/vetting/
|
|
ssh lxc "cd /opt/vetting && sudo ./deploy/install.sh"
|
|
ssh lxc "sudo systemctl enable --now vetting"
|
|
```
|
|
|
|
See [docs/operations.md § Install](docs/operations.md#install-proxmox-lxc)
|
|
for the full walkthrough.
|
|
|
|
## Repository layout
|
|
|
|
```
|
|
cmd/ orchestrator + agent entrypoints
|
|
internal/ core packages (see docs/architecture.md for the map)
|
|
agent/ in-image agent logic (claim loop, stage dispatch, probes)
|
|
live-image/ mkosi config for the PXE-bootable Debian live image
|
|
deploy/ systemd unit + install.sh + example config
|
|
docs/ operator + developer docs
|
|
test/e2e/ build-tag-gated QEMU + PXE full-stack test
|
|
tools/ small CLI helpers
|
|
```
|
|
|
|
## Development
|
|
|
|
- `make test` — Go unit + smoke tests (cross-platform)
|
|
- `make vet` — `go vet` on the whole module
|
|
- `make live-image` — Linux-only; run under WSL from Windows
|
|
- `make e2e` — requires Linux root + live image + running orchestrator
|
|
- `make run` — build + launch the orchestrator with the example config
|
|
|
|
Windows hosts: everything except `live-image` and `e2e` works natively.
|
|
The live image build calls `mkosi` which needs a real Linux userspace,
|
|
so use WSL for those targets.
|
|
|
|
## Status
|
|
|
|
All six phases in the original plan are implemented. The E2E QEMU
|
|
harness is wired in `test/e2e/qemu_test.go` but requires a running
|
|
orchestrator + registered host + queued run as preconditions — it's a
|
|
developer-facing integration harness, not a unit test.
|