Files
Vetting/docs/operations.md
T
josh 42da48864f
CI / Lint + build + test (push) Failing after 5m15s
Remove operator auth — trust the LAN
Can't log in from a fresh LXC deploy, and the service is LAN-only by
design. Rip out the whole bcrypt-password / signed-cookie session
layer: internal/auth, login templates, gen-admin-password binary +
Makefile targets, auth config block, login/logout routes and the
RequireSession middleware wrap. Agent bearer-token auth on
/api/v1/runs/{id}/* is untouched.

Operators who want a password can front the service with a reverse
proxy — noted in README and docs/operations.md.
2026-04-17 22:31:49 -04:00

6.3 KiB

Operations

Operator-facing runbook for the vetting orchestrator. If you're looking for the "what does the system do" overview, see architecture.md. For what each test stage actually measures, see test-suite.md.

Install (Proxmox LXC)

Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster you're vetting for. The LXC must be on the same L2 segment as the repaired nodes so DHCP and WoL work.

  1. On your workstation, cross-build the binary:

    make orchestrator-linux
    

    This produces bin/vetting-linux-amd64.

  2. Copy the repo tree (or just bin/, deploy/) into the LXC, then from inside the LXC:

    sudo ./deploy/install.sh
    

    The installer:

    • apt installs dnsmasq, iperf3, ca-certificates
    • creates the vetting system user (home = /var/lib/vetting)
    • installs the binary into /usr/local/bin/vetting
    • drops vetting.example.yaml into /etc/vetting/vetting.yaml (only if there's no existing config — existing configs are preserved)
    • drops /etc/systemd/system/vetting.service
    • disables the distro-default dnsmasq (the orchestrator supervises its own)

    The installer does not enable the service. You'll want to edit the config first.

  3. Edit /etc/vetting/vetting.yaml:

    • server.bind — defaults to 127.0.0.1:8080. Switch to 0.0.0.0:8080 (or bind to a specific LAN IP) once you're ready to expose it. There is no built-in auth — see Exposing outside the LAN below.
    • server.public_url — the URL your browser hits the LXC on (e.g. http://vetting.lan:8080). Used as the click-through link in notifications.
  4. (Optional) Configure notifiers in the same file — see the commented-out example block for ntfy / Discord / SMTP.

  5. Enable and start:

    sudo systemctl enable --now vetting
    sudo journalctl -fu vetting
    

First vetting run

Against a QEMU VM first, before you point it at real hardware:

  1. On the Proxmox host (or wherever your LXC lives):

    sudo ip link add br-vetting type bridge
    sudo ip addr add 10.77.0.1/24 dev br-vetting
    sudo ip link set br-vetting up
    
  2. In the UI at http://<lxc>:8080, register a host:

    • Name: qemu-test
    • MAC: 52:54:00:12:34:56
    • WoL broadcast IP: 10.77.0.255
    • Expected spec: paste a minimal YAML like
      memory: { total_gib: 4 }
      cpu: { logical_cores: 4 }
      
  3. Click Start Vetting. The UI tile will sit at Queued → WaitingWoL.

  4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:

    sudo qemu-system-x86_64 \
      -enable-kvm -cpu host -smp 4 -m 4096 \
      -netdev bridge,id=n0,br=br-vetting \
      -device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
      -drive file=/tmp/test-disk.img,format=raw,if=virtio \
      -boot n -serial mon:stdio -display none
    
  5. Watch the tile advance through stages. On success, the tile shows View report and the VM auto-shuts-down.

For real repaired hardware: same flow, but register the node's actual MAC + expected spec, and make sure the node's BIOS is set to PXE-boot from the NIC that's on the br-vetting network.

A failed run — SSH to the held host

When a stage fails, the pipeline halts at FailedHolding and the agent installs an orchestrator-issued SSH key into the live-image's /root/.ssh/authorized_keys. The UI tile surfaces the IP and the exact ssh command.

The hold key is per-run. Once you're done:

  1. Power the host off (poweroff from the SSH session).
  2. In the UI, click Override wipe-probe only when the failure was at the Storage stage and you're sure the disks are expendable. Otherwise click Start vetting on a fresh run from the host dashboard after fixing the underlying issue.

Log + artifact layout

/var/lib/vetting/
  vetting.db                 # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
  artifacts/
    run-<N>/
      report.html            # operator-facing summary
      report.json            # machine-readable summary
      inventory.json         # raw probe output
      fio-<disk>.log         # storage stage output
      iperf-<nic>.json       # network stage output
      hold-<N>.pub           # per-run SSH pubkey (only if held)
/var/log/vetting/
  run-<N>.log                # append-only per-run log tail

Retention is governed by the artifacts.retention_days and logs.retention_days settings. DB rows (run history) are preserved indefinitely; only on-disk files get pruned.

Exposing outside the LAN

The orchestrator UI has no built-in auth. It's designed to live on a trusted home LAN and trust whatever reaches it. If you want to reach it from outside that LAN, don't expose the bind port directly — put it behind a reverse proxy (Caddy, nginx, Traefik) that terminates TLS and adds basic-auth or OIDC. The agent↔orchestrator bearer token auth is independent and keeps working either way.

Troubleshooting

Symptom First check
PXE client gets no DHCP offer journalctl -u vetting for dnsmasq errors; confirm the LXC has CAP_NET_ADMIN (the shipped systemd unit does); confirm the host MAC is actually registered (sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;').
Agent /hello never fires Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), systemctl status vetting-agent.
Tile stuck on Booting Most likely the live image booted but the agent can't reach the orchestrator. Verify vetting.orchestrator= in the kernel cmdline resolves from the host's network.
UI shows stale stage Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips.
Notification didn't fire journalctl -u vetting | grep notify: — delivery is fire-and-forget and the failure reason is logged but not persisted.

Upgrading

  1. make orchestrator-linux on your workstation.
  2. scp bin/vetting-linux-amd64 lxc:/tmp/vetting.new
  3. On the LXC:
    sudo systemctl stop vetting
    sudo install -m 0755 /tmp/vetting.new /usr/local/bin/vetting
    sudo systemctl start vetting
    

The DB migration runs at startup and is append-only — no manual schema work unless a release's notes call it out.