Post-repair hardware validation pipeline for Proxmox cluster hosts. Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
6.3 KiB
Operations
Operator-facing runbook for the vetting orchestrator. If you're looking for the "what does the system do" overview, see architecture.md. For what each test stage actually measures, see test-suite.md.
Install (Proxmox LXC)
Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster you're vetting for. The LXC must be on the same L2 segment as the repaired nodes so DHCP and WoL work.
-
On your workstation, cross-build the binary:
make orchestrator-linuxThis produces
bin/vetting-linux-amd64. -
Copy the repo tree (or just
bin/,deploy/) into the LXC, then from inside the LXC:sudo ./deploy/install.shThe installer:
apt installsdnsmasq,iperf3,ca-certificates- creates the
vettingsystem user (home =/var/lib/vetting) - installs the binary into
/usr/local/bin/vetting - drops
vetting.example.yamlinto/etc/vetting/vetting.yaml(only if there's no existing config — existing configs are preserved) - drops
/etc/systemd/system/vetting.service - disables the distro-default dnsmasq (the orchestrator supervises its own)
The installer does not enable the service, because the default config has a placeholder bcrypt password that the binary refuses to start with.
-
Generate an admin password hash and a session secret, then edit
/etc/vetting/vetting.yaml:./bin/gen-admin-password 'your-password-here' # prints a bcrypt hash openssl rand -hex 32 # prints a 64-char hex stringRequired fields:
auth.admin_password_bcrypt— the bcrypt hashauth.session_secret_hex— the 32-byte hex stringserver.public_url— the URL your browser hits the LXC on (e.g.https://vetting.lan:8443). This is used as the click-through link in notifications, so it must be the external URL, not the bind address.
-
(Optional) Configure notifiers in the same file — see the commented-out example block for ntfy / Discord / SMTP.
-
Enable and start:
sudo systemctl enable --now vetting sudo journalctl -fu vetting
First vetting run
Against a QEMU VM first, before you point it at real hardware:
-
On the Proxmox host (or wherever your LXC lives):
sudo ip link add br-vetting type bridge sudo ip addr add 10.77.0.1/24 dev br-vetting sudo ip link set br-vetting up -
In the UI at
https://<lxc>:8443, log in and register a host:- Name:
qemu-test - MAC:
52:54:00:12:34:56 - WoL broadcast IP:
10.77.0.255 - Expected spec: paste a minimal YAML like
memory: { total_gib: 4 } cpu: { logical_cores: 4 }
- Name:
-
Click Start Vetting. The UI tile will sit at
Queued → WaitingWoL. -
Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:
sudo qemu-system-x86_64 \ -enable-kvm -cpu host -smp 4 -m 4096 \ -netdev bridge,id=n0,br=br-vetting \ -device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \ -drive file=/tmp/test-disk.img,format=raw,if=virtio \ -boot n -serial mon:stdio -display none -
Watch the tile advance through stages. On success, the tile shows View report and the VM auto-shuts-down.
For real repaired hardware: same flow, but register the node's actual
MAC + expected spec, and make sure the node's BIOS is set to PXE-boot
from the NIC that's on the br-vetting network.
A failed run — SSH to the held host
When a stage fails, the pipeline halts at FailedHolding and the
agent installs an orchestrator-issued SSH key into the live-image's
/root/.ssh/authorized_keys. The UI tile surfaces the IP and the
exact ssh command.
The hold key is per-run. Once you're done:
- Power the host off (
powerofffrom the SSH session). - In the UI, click Override wipe-probe only when the failure was
at the
Storagestage and you're sure the disks are expendable. Otherwise click Start vetting on a fresh run from the host dashboard after fixing the underlying issue.
Log + artifact layout
/var/lib/vetting/
vetting.db # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
artifacts/
run-<N>/
report.html # operator-facing summary
report.json # machine-readable summary
inventory.json # raw probe output
fio-<disk>.log # storage stage output
iperf-<nic>.json # network stage output
hold-<N>.pub # per-run SSH pubkey (only if held)
/var/log/vetting/
run-<N>.log # append-only per-run log tail
Retention is governed by the artifacts.retention_days and
logs.retention_days settings. DB rows (run history) are preserved
indefinitely; only on-disk files get pruned.
Troubleshooting
| Symptom | First check |
|---|---|
Service refuses to start with auth.admin_password_bcrypt is the placeholder |
You didn't replace the bcrypt hash in the config. Run gen-admin-password. |
| PXE client gets no DHCP offer | journalctl -u vetting for dnsmasq errors; confirm the LXC has CAP_NET_ADMIN (the shipped systemd unit does); confirm the host MAC is actually registered (sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'). |
Agent /hello never fires |
Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), systemctl status vetting-agent. |
Tile stuck on Booting |
Most likely the live image booted but the agent can't reach the orchestrator. Verify vetting.orchestrator= in the kernel cmdline resolves from the host's network. |
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
| Notification didn't fire | journalctl -u vetting | grep notify: — delivery is fire-and-forget and the failure reason is logged but not persisted. |
Upgrading
make orchestrator-linuxon your workstation.scp bin/vetting-linux-amd64 lxc:/tmp/vetting.new- On the LXC:
sudo systemctl stop vetting sudo install -m 0755 /tmp/vetting.new /usr/local/bin/vetting sudo systemctl start vetting
The DB migration runs at startup and is append-only — no manual schema work unless a release's notes call it out.