Rewrites the PXE section of the ops runbook around the new proxy-DHCP model (no dedicated bridge, coexists with UniFi/pfSense/etc.) and swaps the e2e test's default bridge + orchestrator URL to match. The e2e file now calls out the LAN-DHCP precondition in its header so future-me (or CI) doesn't hang at PXE wondering why nothing answers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9.3 KiB
Operations
Operator-facing runbook for the vetting orchestrator. If you're looking for the "what does the system do" overview, see architecture.md. For what each test stage actually measures, see test-suite.md.
Install (Proxmox LXC)
Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster you're vetting for. The LXC must be on the same L2 segment as the repaired nodes so DHCP and WoL work.
One-liner install (recommended)
Every push to main kicks off a Gitea Actions run that builds a full
release bundle (orchestrator + agent + live image + install scripts +
pinned iPXE SHAs) and publishes it to the Gitea package registry. The
LXC installer fetches the prebuilt tarball — no source clone, no Go
toolchain, no make, no WSL.
On the LXC:
curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh \
| sudo bash
To pin a specific build instead of the rolling latest:
VETTING_VERSION=sha-abc1234 curl -fsSL .../proxmox-install.sh | sudo bash
proxmox-install.sh curls the bundle from
${REGISTRY_URL}/api/packages/${PACKAGE_OWNER}/generic/vetting/${VETTING_VERSION}/vetting-bundle.tar.gz,
extracts it, and hands off to the bundled install.sh for the base
install (user, binaries, config, systemd unit).
If you don't need PXE (e.g. host-mode reporter only, no automated
live-boots), you can stop here — edit /etc/vetting/vetting.yaml to
tune server.bind / public_url, then
sudo systemctl enable --now vetting.
Offline / air-gapped install
If the LXC can't reach the registry, build the tarball locally and
scp it across:
make release # on a Linux/WSL workstation
scp bin/vetting-bundle-<sha>.tar.gz lxc:/tmp/
ssh lxc 'cd /tmp && tar xzf vetting-bundle-*.tar.gz \
&& cd vetting-bundle-* && sudo ./install.sh'
Same bundle layout either way.
PXE enablement
PXE is gated behind a second script so non-PXE installs stay simple.
How it works on the network. dnsmasq runs in proxy-DHCP mode: it binds to the LXC's LAN interface and coexists with your existing DHCP server (UniFi, pfSense, Asus, etc.). The router still hands out LAN IPs the normal way; dnsmasq only answers the PXE options (boot server + filename) and only for MACs you've registered in the UI. A random laptop booting from network on the same LAN gets a LAN IP from the router and nothing from us — the MAC allowlist is the safety barrier.
That means no dedicated bridge, no VLAN, no cabling changes. The
LXC just needs an interface on the same L2 segment as the hosts
you're repairing — typically eth0 on the LAN bridge.
On the LXC, inside the extracted bundle:
sudo ./pxe-setup.sh \
--interface eth0 \
--subnet 192.168.1.0/24 \
--orchestrator-url http://<lxc-lan-ip>:8080
The script:
- Fetches
ipxe.efi+undionly.kpxefrom boot.ipxe.org and verifies SHA256 againstipxe-shas.txt(fail-closed on mismatch). - Places
vmlinuz+initrd.imginto/var/lib/vetting/live/. - Rewrites the
pxe:block of/etc/vetting/vetting.yamlto enable PXE with the flags you passed.
It does not restart the service — review the rendered config, then:
sudo systemctl restart vetting
sudo journalctl -fu vetting
The orchestrator validates PXE preconditions at startup (interface
exists, iPXE binaries are on disk, subnet parses as CIDR) and
exits non-zero with a clear error if anything's wrong, instead of
failing silently when a host first PXE-boots.
pxe-setup.sh is idempotent — safe to re-run. Pass --force to
overwrite a hand-edited pxe: block.
Router caveat. Most home/prosumer routers (UniFi, Asus, Netgear, etc.) don't send PXE options, so proxy mode just works. pfSense and OPNsense can serve PXE themselves — if yours does, disable its TFTP/netboot feature so there's only one PXE authority on the segment.
Dev-loop install (from a source checkout)
For iterating on the orchestrator without waiting for a CI publish:
- On your workstation:
make orchestrator-linux && make agent-linux - Copy the repo tree (or just
bin/+deploy/) onto the LXC sudo ./deploy/install.sh→ base install- For PXE:
wsl make live-imageon your workstation,scp live-image/build/vmlinuz lxc:/tmp/ && scp live-image/build/initrd.img lxc:/tmp/, then runpxe-setup.sh --bundle-dir /tmp(or accept the default repo-tree detection when running from the repo root).
First vetting run
Against a QEMU VM first, before you point it at real hardware:
-
In the UI at
http://<lxc-lan-ip>:8080, register a host:- Name:
qemu-test - MAC:
52:54:00:12:34:56 - WoL broadcast IP: your LAN broadcast, e.g.
192.168.1.255 - Expected spec: paste a minimal YAML like
memory: { total_gib: 4 } cpu: { logical_cores: 4 }
- Name:
-
Click Start Vetting. The UI tile will sit at
Queued → WaitingReboot. -
Launch the QEMU VM on the LAN bridge so it PXE-boots via the router's DHCP + our proxy-DHCP reply:
sudo qemu-system-x86_64 \ -enable-kvm -cpu host -smp 4 -m 4096 \ -netdev bridge,id=n0,br=vmbr0 \ -device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \ -drive file=/tmp/test-disk.img,format=raw,if=virtio \ -boot n -serial mon:stdio -display none(Swap
vmbr0for whatever your Proxmox LAN bridge is called.) -
Watch the tile advance through stages. On success, the tile shows View report and the VM auto-shuts-down.
For real repaired hardware: same flow, but register the node's actual LAN MAC + expected spec, and make sure the node's BIOS is set to PXE-boot from the NIC that's on the LAN.
A failed run — SSH to the held host
When a stage fails, the pipeline halts at FailedHolding and the
agent installs an orchestrator-issued SSH key into the live-image's
/root/.ssh/authorized_keys. The UI tile surfaces the IP and the
exact ssh command.
The hold key is per-run. Once you're done:
- Power the host off (
powerofffrom the SSH session). - In the UI, click Override wipe-probe only when the failure was
at the
Storagestage and you're sure the disks are expendable. Otherwise click Start vetting on a fresh run from the host dashboard after fixing the underlying issue.
Log + artifact layout
/var/lib/vetting/
vetting.db # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
artifacts/
run-<N>/
report.html # operator-facing summary
report.json # machine-readable summary
inventory.json # raw probe output
fio-<disk>.log # storage stage output
iperf-<nic>.json # network stage output
hold-<N>.pub # per-run SSH pubkey (only if held)
/var/log/vetting/
run-<N>.log # append-only per-run log tail
Retention is governed by the artifacts.retention_days and
logs.retention_days settings. DB rows (run history) are preserved
indefinitely; only on-disk files get pruned.
Exposing outside the LAN
The orchestrator UI has no built-in auth. It's designed to live on a trusted home LAN and trust whatever reaches it. If you want to reach it from outside that LAN, don't expose the bind port directly — put it behind a reverse proxy (Caddy, nginx, Traefik) that terminates TLS and adds basic-auth or OIDC. The agent↔orchestrator bearer token auth is independent and keeps working either way.
Troubleshooting
| Symptom | First check |
|---|---|
| Host sits at PXE, no boot filename | Confirm the MAC is registered (sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'). If it is, sudo tcpdump -i <lan-iface> -n -e 'port 67 or port 68 or port 4011' while the host PXEs — if you see DISCOVER/OFFER from the router but no proxy reply from us, check journalctl -u vetting for dnsmasq errors. |
| PXE boots but iPXE can't fetch the script | Verify the LXC's LAN IP matches pxe.orchestrator_url in /etc/vetting/vetting.yaml — iPXE bakes that URL in at chainload. |
Agent /hello never fires |
Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), systemctl status vetting-agent. |
Tile stuck on Booting |
Most likely the live image booted but the agent can't reach the orchestrator. Verify vetting.orchestrator= in the kernel cmdline resolves from the host's network. |
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
| Notification didn't fire | journalctl -u vetting | grep notify: — delivery is fire-and-forget and the failure reason is logged but not persisted. |
Upgrading
Rerun the registry-fetch one-liner on the LXC:
curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh \
| sudo bash
That's it — install.sh auto-restarts vetting.service when it's
already enabled, and re-stages vmlinuz/initrd.img into
/var/lib/vetting/live/ so PXE-enabled LXCs come back up with the
fresh live image. Watch the logs with journalctl -fu vetting.
Pin to a specific build with VETTING_VERSION=sha-abc1234 if you
need to roll back or test a commit. The DB migration runs at startup
and is append-only — no manual schema work unless a release's notes
call it out.