Files
Vetting/docs/operations.md
T
josh f188c7add4
CI / Lint + build + test (push) Has been cancelled
Release / release (push) Has been cancelled
proxmox-install: fetch prebuilt bundle from Gitea package registry
Drops the per-install Go toolchain dance + source build. The installer
now just curls the bundle from
${REGISTRY_URL}/api/packages/${PACKAGE_OWNER}/generic/vetting/${VETTING_VERSION}/vetting-bundle.tar.gz,
extracts it, and hands off to the bundled install.sh with explicit
--binary / --agent-binary paths so the in-bundle layout is picked up.

Default version is `latest` (rolling alias, overwritten by release.yml
on each push to main). Pin via `VETTING_VERSION=sha-abc1234 curl ... |
bash` when rolling back or testing a specific commit.

Removes the `apt install build-essential git` + Go toolchain download
+ templ install + `make orchestrator-linux agent-linux` path — the CI
workflow already produced all of that. Install time on a cold LXC
drops from minutes to under a minute, and live-image kernel/initrd
now arrive with every install instead of requiring a separate WSL
build.

Also rewrites docs/operations.md's install section around the
one-liner, keeps the `make release` + scp path as the offline
fallback, and swaps the upgrade section to just "rerun the one-liner."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 02:16:02 -04:00

8.4 KiB

Operations

Operator-facing runbook for the vetting orchestrator. If you're looking for the "what does the system do" overview, see architecture.md. For what each test stage actually measures, see test-suite.md.

Install (Proxmox LXC)

Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster you're vetting for. The LXC must be on the same L2 segment as the repaired nodes so DHCP and WoL work.

Every push to main kicks off a Gitea Actions run that builds a full release bundle (orchestrator + agent + live image + install scripts + pinned iPXE SHAs) and publishes it to the Gitea package registry. The LXC installer fetches the prebuilt tarball — no source clone, no Go toolchain, no make, no WSL.

On the LXC:

curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh \
    | sudo bash

To pin a specific build instead of the rolling latest:

VETTING_VERSION=sha-abc1234 curl -fsSL .../proxmox-install.sh | sudo bash

proxmox-install.sh curls the bundle from ${REGISTRY_URL}/api/packages/${PACKAGE_OWNER}/generic/vetting/${VETTING_VERSION}/vetting-bundle.tar.gz, extracts it, and hands off to the bundled install.sh for the base install (user, binaries, config, systemd unit).

If you don't need PXE (e.g. host-mode reporter only, no automated live-boots), you can stop here — edit /etc/vetting/vetting.yaml to tune server.bind / public_url, then sudo systemctl enable --now vetting.

Offline / air-gapped install

If the LXC can't reach the registry, build the tarball locally and scp it across:

make release                                 # on a Linux/WSL workstation
scp bin/vetting-bundle-<sha>.tar.gz lxc:/tmp/
ssh lxc 'cd /tmp && tar xzf vetting-bundle-*.tar.gz \
    && cd vetting-bundle-* && sudo ./install.sh'

Same bundle layout either way.

PXE enablement

PXE is gated behind a second script so non-PXE installs stay simple.

Prerequisite: dedicated PXE bridge on the Proxmox hypervisor. The LXC can't create bridges on its host, so do this once on the Proxmox node (not inside the LXC):

sudo ip link add br-vetting type bridge
sudo ip addr add 10.77.0.1/24 dev br-vetting
sudo ip link set br-vetting up

Attach a veth from the LXC onto br-vetting (e.g. eth1 inside the LXC at 10.77.0.2/24). Repaired nodes PXE-boot from a NIC cabled or bridged onto br-vetting only — keep this network isolated from your household DHCP, or both DHCP servers will fight.

On the LXC, inside the extracted bundle:

sudo ./pxe-setup.sh \
    --interface eth1 \
    --dhcp-range 10.77.0.100,10.77.0.200,12h \
    --orchestrator-url http://10.77.0.2:8080

The script:

  • Fetches ipxe.efi + undionly.kpxe from boot.ipxe.org and verifies SHA256 against ipxe-shas.txt (fail-closed on mismatch).
  • Places vmlinuz + initrd.img into /var/lib/vetting/live/.
  • Rewrites the pxe: block of /etc/vetting/vetting.yaml to enable PXE with the flags you passed.

It does not restart the service — review the rendered config, then:

sudo systemctl restart vetting
sudo journalctl -fu vetting

The orchestrator validates PXE preconditions at startup (interface exists, iPXE binaries are on disk, dhcp_range parses) and exits non-zero with a clear error if anything's wrong, instead of failing silently when a host first PXE-boots.

pxe-setup.sh is idempotent — safe to re-run. Pass --force to overwrite a hand-edited pxe: block.

Dev-loop install (from a source checkout)

For iterating on the orchestrator without waiting for a CI publish:

  1. On your workstation: make orchestrator-linux && make agent-linux
  2. Copy the repo tree (or just bin/ + deploy/) onto the LXC
  3. sudo ./deploy/install.sh → base install
  4. For PXE: wsl make live-image on your workstation, scp live-image/build/vmlinuz lxc:/tmp/ && scp live-image/build/initrd.img lxc:/tmp/, then run pxe-setup.sh --bundle-dir /tmp (or accept the default repo-tree detection when running from the repo root).

First vetting run

Against a QEMU VM first, before you point it at real hardware:

  1. Make sure the br-vetting bridge exists on the hypervisor (see above). From inside the LXC, confirm it's reachable on your PXE-side interface.

  2. In the UI at http://<lxc>:8080, register a host:

    • Name: qemu-test
    • MAC: 52:54:00:12:34:56
    • WoL broadcast IP: 10.77.0.255
    • Expected spec: paste a minimal YAML like
      memory: { total_gib: 4 }
      cpu: { logical_cores: 4 }
      
  3. Click Start Vetting. The UI tile will sit at Queued → WaitingReboot.

  4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:

    sudo qemu-system-x86_64 \
      -enable-kvm -cpu host -smp 4 -m 4096 \
      -netdev bridge,id=n0,br=br-vetting \
      -device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
      -drive file=/tmp/test-disk.img,format=raw,if=virtio \
      -boot n -serial mon:stdio -display none
    
  5. Watch the tile advance through stages. On success, the tile shows View report and the VM auto-shuts-down.

For real repaired hardware: same flow, but register the node's actual MAC + expected spec, and make sure the node's BIOS is set to PXE-boot from the NIC that's on the br-vetting network.

A failed run — SSH to the held host

When a stage fails, the pipeline halts at FailedHolding and the agent installs an orchestrator-issued SSH key into the live-image's /root/.ssh/authorized_keys. The UI tile surfaces the IP and the exact ssh command.

The hold key is per-run. Once you're done:

  1. Power the host off (poweroff from the SSH session).
  2. In the UI, click Override wipe-probe only when the failure was at the Storage stage and you're sure the disks are expendable. Otherwise click Start vetting on a fresh run from the host dashboard after fixing the underlying issue.

Log + artifact layout

/var/lib/vetting/
  vetting.db                 # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
  artifacts/
    run-<N>/
      report.html            # operator-facing summary
      report.json            # machine-readable summary
      inventory.json         # raw probe output
      fio-<disk>.log         # storage stage output
      iperf-<nic>.json       # network stage output
      hold-<N>.pub           # per-run SSH pubkey (only if held)
/var/log/vetting/
  run-<N>.log                # append-only per-run log tail

Retention is governed by the artifacts.retention_days and logs.retention_days settings. DB rows (run history) are preserved indefinitely; only on-disk files get pruned.

Exposing outside the LAN

The orchestrator UI has no built-in auth. It's designed to live on a trusted home LAN and trust whatever reaches it. If you want to reach it from outside that LAN, don't expose the bind port directly — put it behind a reverse proxy (Caddy, nginx, Traefik) that terminates TLS and adds basic-auth or OIDC. The agent↔orchestrator bearer token auth is independent and keeps working either way.

Troubleshooting

Symptom First check
PXE client gets no DHCP offer journalctl -u vetting for dnsmasq errors; confirm the LXC has CAP_NET_ADMIN (the shipped systemd unit does); confirm the host MAC is actually registered (sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;').
Agent /hello never fires Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), systemctl status vetting-agent.
Tile stuck on Booting Most likely the live image booted but the agent can't reach the orchestrator. Verify vetting.orchestrator= in the kernel cmdline resolves from the host's network.
UI shows stale stage Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips.
Notification didn't fire journalctl -u vetting | grep notify: — delivery is fire-and-forget and the failure reason is logged but not persisted.

Upgrading

Rerun the registry-fetch one-liner on the LXC:

curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh \
    | sudo bash
sudo systemctl restart vetting

Pin to a specific build with VETTING_VERSION=sha-abc1234 if you need to roll back or test a commit. The DB migration runs at startup and is append-only — no manual schema work unless a release's notes call it out.