Collapses the LXC side of PXE enablement from a six-step manual dance (build, fetch iPXE, scp, bridge, hand-edit yaml) into: make release # dev box (Linux/WSL) scp bundle.tar.gz lxc:/tmp/ sudo ./install.sh # base install, unchanged sudo ./pxe-setup.sh --interface ... --dhcp-range ... --orchestrator-url ... pxe-setup.sh fetches iPXE from boot.ipxe.org, verifies against pinned SHA256s in deploy/ipxe-shas.txt (fail-closed), places vmlinuz/initrd.img from the bundle, and rewrites only the pxe: block of vetting.yaml. Idempotent; --force gates overwriting a hand-edited block. Adds Supervisor.Validate() — called before dnsmasq spawn — so typo'd configs fail at orchestrator startup with clear errors naming the missing file or yaml key, instead of silently serving broken TFTP until a real host tries to PXE-boot. Nine tests cover missing files, bogus interface, malformed dhcp_range, bad orchestrator_url, and aggregate reporting. Hypervisor bridge creation stays documented (LXC can't do it) but everything downstream of the bridge is now scripted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7.7 KiB
Operations
Operator-facing runbook for the vetting orchestrator. If you're looking for the "what does the system do" overview, see architecture.md. For what each test stage actually measures, see test-suite.md.
Install (Proxmox LXC)
Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster you're vetting for. The LXC must be on the same L2 segment as the repaired nodes so DHCP and WoL work.
One-shot release bundle (recommended)
On your dev workstation (Linux, or WSL on Windows):
make release
Produces bin/vetting-bundle-<sha>.tar.gz containing the orchestrator
binary, agent binary, live image (vmlinuz + initrd.img), install
scripts, vetting.service, the production yaml, and the pinned iPXE
SHA256 file.
Ship it to the LXC:
scp bin/vetting-bundle-<sha>.tar.gz lxc:/tmp/
ssh lxc 'cd /tmp && tar xzf vetting-bundle-*.tar.gz'
ssh lxc 'cd /tmp/vetting-bundle-<sha> && sudo ./install.sh'
install.sh does the base install (user, binaries, config, systemd
unit). If you don't need PXE (e.g. host-mode reporter only, no
automated live-boots), you can stop here — edit
/etc/vetting/vetting.yaml to tune server.bind / public_url,
then sudo systemctl enable --now vetting.
PXE enablement
PXE is gated behind a second script so non-PXE installs stay simple.
Prerequisite: dedicated PXE bridge on the Proxmox hypervisor. The LXC can't create bridges on its host, so do this once on the Proxmox node (not inside the LXC):
sudo ip link add br-vetting type bridge
sudo ip addr add 10.77.0.1/24 dev br-vetting
sudo ip link set br-vetting up
Attach a veth from the LXC onto br-vetting (e.g. eth1 inside the
LXC at 10.77.0.2/24). Repaired nodes PXE-boot from a NIC cabled or
bridged onto br-vetting only — keep this network isolated from your
household DHCP, or both DHCP servers will fight.
On the LXC, inside the extracted bundle:
sudo ./pxe-setup.sh \
--interface eth1 \
--dhcp-range 10.77.0.100,10.77.0.200,12h \
--orchestrator-url http://10.77.0.2:8080
The script:
- Fetches
ipxe.efi+undionly.kpxefrom boot.ipxe.org and verifies SHA256 againstipxe-shas.txt(fail-closed on mismatch). - Places
vmlinuz+initrd.imginto/var/lib/vetting/live/. - Rewrites the
pxe:block of/etc/vetting/vetting.yamlto enable PXE with the flags you passed.
It does not restart the service — review the rendered config, then:
sudo systemctl restart vetting
sudo journalctl -fu vetting
The orchestrator validates PXE preconditions at startup (interface
exists, iPXE binaries are on disk, dhcp_range parses) and exits
non-zero with a clear error if anything's wrong, instead of failing
silently when a host first PXE-boots.
pxe-setup.sh is idempotent — safe to re-run. Pass --force to
overwrite a hand-edited pxe: block.
Manual install (no release tarball)
For dev-loop iteration on the LXC itself:
- On your workstation:
make orchestrator-linux && make agent-linux - Copy the repo tree (or just
bin/+deploy/) onto the LXC sudo ./deploy/install.sh→ base install- For PXE:
wsl make live-imageon your workstation,scp live-image/build/vmlinuz lxc:/tmp/ && scp live-image/build/initrd.img lxc:/tmp/, then runpxe-setup.sh --bundle-dir /tmp(or accept the default repo-tree detection when running from the repo root).
First vetting run
Against a QEMU VM first, before you point it at real hardware:
-
Make sure the
br-vettingbridge exists on the hypervisor (see above). From inside the LXC, confirm it's reachable on your PXE-side interface. -
In the UI at
http://<lxc>:8080, register a host:- Name:
qemu-test - MAC:
52:54:00:12:34:56 - WoL broadcast IP:
10.77.0.255 - Expected spec: paste a minimal YAML like
memory: { total_gib: 4 } cpu: { logical_cores: 4 }
- Name:
-
Click Start Vetting. The UI tile will sit at
Queued → WaitingReboot. -
Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:
sudo qemu-system-x86_64 \ -enable-kvm -cpu host -smp 4 -m 4096 \ -netdev bridge,id=n0,br=br-vetting \ -device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \ -drive file=/tmp/test-disk.img,format=raw,if=virtio \ -boot n -serial mon:stdio -display none -
Watch the tile advance through stages. On success, the tile shows View report and the VM auto-shuts-down.
For real repaired hardware: same flow, but register the node's actual
MAC + expected spec, and make sure the node's BIOS is set to PXE-boot
from the NIC that's on the br-vetting network.
A failed run — SSH to the held host
When a stage fails, the pipeline halts at FailedHolding and the
agent installs an orchestrator-issued SSH key into the live-image's
/root/.ssh/authorized_keys. The UI tile surfaces the IP and the
exact ssh command.
The hold key is per-run. Once you're done:
- Power the host off (
powerofffrom the SSH session). - In the UI, click Override wipe-probe only when the failure was
at the
Storagestage and you're sure the disks are expendable. Otherwise click Start vetting on a fresh run from the host dashboard after fixing the underlying issue.
Log + artifact layout
/var/lib/vetting/
vetting.db # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
artifacts/
run-<N>/
report.html # operator-facing summary
report.json # machine-readable summary
inventory.json # raw probe output
fio-<disk>.log # storage stage output
iperf-<nic>.json # network stage output
hold-<N>.pub # per-run SSH pubkey (only if held)
/var/log/vetting/
run-<N>.log # append-only per-run log tail
Retention is governed by the artifacts.retention_days and
logs.retention_days settings. DB rows (run history) are preserved
indefinitely; only on-disk files get pruned.
Exposing outside the LAN
The orchestrator UI has no built-in auth. It's designed to live on a trusted home LAN and trust whatever reaches it. If you want to reach it from outside that LAN, don't expose the bind port directly — put it behind a reverse proxy (Caddy, nginx, Traefik) that terminates TLS and adds basic-auth or OIDC. The agent↔orchestrator bearer token auth is independent and keeps working either way.
Troubleshooting
| Symptom | First check |
|---|---|
| PXE client gets no DHCP offer | journalctl -u vetting for dnsmasq errors; confirm the LXC has CAP_NET_ADMIN (the shipped systemd unit does); confirm the host MAC is actually registered (sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'). |
Agent /hello never fires |
Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), systemctl status vetting-agent. |
Tile stuck on Booting |
Most likely the live image booted but the agent can't reach the orchestrator. Verify vetting.orchestrator= in the kernel cmdline resolves from the host's network. |
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
| Notification didn't fire | journalctl -u vetting | grep notify: — delivery is fire-and-forget and the failure reason is logged but not persisted. |
Upgrading
make orchestrator-linuxon your workstation.scp bin/vetting-linux-amd64 lxc:/tmp/vetting.new- On the LXC:
sudo systemctl stop vetting sudo install -m 0755 /tmp/vetting.new /usr/local/bin/vetting sudo systemctl start vetting
The DB migration runs at startup and is append-only — no manual schema work unless a release's notes call it out.