Files
Vetting/docs/operations.md
T
josh f927a4a66b
CI / Lint + build + test (push) Successful in 1m38s
Release / release (push) Successful in 1m45s
install.sh: stage live image and auto-restart on upgrade
Single-command upgrades were leaving /var/lib/vetting/live/ stale on
PXE-enabled LXCs because install.sh explicitly punted live-image
staging to pxe-setup.sh. That was right when make-release ran on a
dev box, but the new registry-pull flow ships vmlinuz+initrd.img
inside the bundle — they should land in place during every install.

install.sh now:
  - auto-detects live-image/{vmlinuz,initrd.img} (release bundle
    layout) or ../live-image/build/ (repo dev checkout) and stages
    them into --live-dir (default /var/lib/vetting/live).
  - restarts vetting.service when already enabled, so the
    curl | sudo bash one-liner is the full upgrade loop. First-
    install path still leaves the service stopped for config edits.

pxe-setup.sh's own live-image copy is now redundant on upgrade but
still runs for first-time PXE setup (it also writes the pxe: block
of vetting.yaml, which install.sh has no business touching).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 10:38:34 -04:00

234 lines
8.7 KiB
Markdown

# Operations
Operator-facing runbook for the vetting orchestrator. If you're looking
for the "what does the system do" overview, see
[architecture.md](architecture.md). For what each test stage actually
measures, see [test-suite.md](test-suite.md).
## Install (Proxmox LXC)
Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster
you're vetting for. The LXC must be on the same L2 segment as the
repaired nodes so DHCP and WoL work.
### One-liner install (recommended)
Every push to `main` kicks off a Gitea Actions run that builds a full
release bundle (orchestrator + agent + live image + install scripts +
pinned iPXE SHAs) and publishes it to the Gitea package registry. The
LXC installer fetches the prebuilt tarball — no source clone, no Go
toolchain, no `make`, no WSL.
On the LXC:
```
curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh \
| sudo bash
```
To pin a specific build instead of the rolling `latest`:
```
VETTING_VERSION=sha-abc1234 curl -fsSL .../proxmox-install.sh | sudo bash
```
`proxmox-install.sh` curls the bundle from
`${REGISTRY_URL}/api/packages/${PACKAGE_OWNER}/generic/vetting/${VETTING_VERSION}/vetting-bundle.tar.gz`,
extracts it, and hands off to the bundled `install.sh` for the base
install (user, binaries, config, systemd unit).
If you don't need PXE (e.g. host-mode reporter only, no automated
live-boots), you can stop here — edit `/etc/vetting/vetting.yaml` to
tune `server.bind` / `public_url`, then
`sudo systemctl enable --now vetting`.
### Offline / air-gapped install
If the LXC can't reach the registry, build the tarball locally and
`scp` it across:
```
make release # on a Linux/WSL workstation
scp bin/vetting-bundle-<sha>.tar.gz lxc:/tmp/
ssh lxc 'cd /tmp && tar xzf vetting-bundle-*.tar.gz \
&& cd vetting-bundle-* && sudo ./install.sh'
```
Same bundle layout either way.
### PXE enablement
PXE is gated behind a second script so non-PXE installs stay simple.
**Prerequisite: dedicated PXE bridge on the Proxmox hypervisor.** The
LXC can't create bridges on its host, so do this once on the Proxmox
node (not inside the LXC):
```
sudo ip link add br-vetting type bridge
sudo ip addr add 10.77.0.1/24 dev br-vetting
sudo ip link set br-vetting up
```
Attach a veth from the LXC onto `br-vetting` (e.g. `eth1` inside the
LXC at `10.77.0.2/24`). Repaired nodes PXE-boot from a NIC cabled or
bridged onto `br-vetting` only — keep this network isolated from your
household DHCP, or both DHCP servers will fight.
On the LXC, inside the extracted bundle:
```
sudo ./pxe-setup.sh \
--interface eth1 \
--dhcp-range 10.77.0.100,10.77.0.200,12h \
--orchestrator-url http://10.77.0.2:8080
```
The script:
- Fetches `ipxe.efi` + `undionly.kpxe` from boot.ipxe.org and verifies
SHA256 against `ipxe-shas.txt` (fail-closed on mismatch).
- Places `vmlinuz` + `initrd.img` into `/var/lib/vetting/live/`.
- Rewrites the `pxe:` block of `/etc/vetting/vetting.yaml` to enable
PXE with the flags you passed.
It does **not** restart the service — review the rendered config,
then:
```
sudo systemctl restart vetting
sudo journalctl -fu vetting
```
The orchestrator validates PXE preconditions at startup (interface
exists, iPXE binaries are on disk, `dhcp_range` parses) and exits
non-zero with a clear error if anything's wrong, instead of failing
silently when a host first PXE-boots.
`pxe-setup.sh` is idempotent — safe to re-run. Pass `--force` to
overwrite a hand-edited `pxe:` block.
### Dev-loop install (from a source checkout)
For iterating on the orchestrator without waiting for a CI publish:
1. On your workstation: `make orchestrator-linux && make agent-linux`
2. Copy the repo tree (or just `bin/` + `deploy/`) onto the LXC
3. `sudo ./deploy/install.sh` → base install
4. For PXE: `wsl make live-image` on your workstation,
`scp live-image/build/vmlinuz lxc:/tmp/ && scp live-image/build/initrd.img lxc:/tmp/`,
then run `pxe-setup.sh --bundle-dir /tmp` (or accept the default
repo-tree detection when running from the repo root).
## First vetting run
Against a QEMU VM first, before you point it at real hardware:
1. Make sure the `br-vetting` bridge exists on the hypervisor (see
above). From inside the LXC, confirm it's reachable on your
PXE-side interface.
2. In the UI at `http://<lxc>:8080`, register a host:
- Name: `qemu-test`
- MAC: `52:54:00:12:34:56`
- WoL broadcast IP: `10.77.0.255`
- Expected spec: paste a minimal YAML like
```yaml
memory: { total_gib: 4 }
cpu: { logical_cores: 4 }
```
3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`.
4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:
```
sudo qemu-system-x86_64 \
-enable-kvm -cpu host -smp 4 -m 4096 \
-netdev bridge,id=n0,br=br-vetting \
-device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
-drive file=/tmp/test-disk.img,format=raw,if=virtio \
-boot n -serial mon:stdio -display none
```
5. Watch the tile advance through stages. On success, the tile shows
**View report** and the VM auto-shuts-down.
For real repaired hardware: same flow, but register the node's actual
MAC + expected spec, and make sure the node's BIOS is set to PXE-boot
from the NIC that's on the `br-vetting` network.
## A failed run — SSH to the held host
When a stage fails, the pipeline halts at `FailedHolding` and the
agent installs an orchestrator-issued SSH key into the live-image's
`/root/.ssh/authorized_keys`. The UI tile surfaces the IP and the
exact `ssh` command.
The hold key is **per-run**. Once you're done:
1. Power the host off (`poweroff` from the SSH session).
2. In the UI, click **Override wipe-probe** only when the failure was
at the `Storage` stage *and* you're sure the disks are expendable.
Otherwise click **Start vetting** on a fresh run from the host
dashboard after fixing the underlying issue.
## Log + artifact layout
```
/var/lib/vetting/
vetting.db # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
artifacts/
run-<N>/
report.html # operator-facing summary
report.json # machine-readable summary
inventory.json # raw probe output
fio-<disk>.log # storage stage output
iperf-<nic>.json # network stage output
hold-<N>.pub # per-run SSH pubkey (only if held)
/var/log/vetting/
run-<N>.log # append-only per-run log tail
```
Retention is governed by the `artifacts.retention_days` and
`logs.retention_days` settings. DB rows (run history) are preserved
indefinitely; only on-disk files get pruned.
## Exposing outside the LAN
The orchestrator UI has no built-in auth. It's designed to live on a
trusted home LAN and trust whatever reaches it. If you want to reach
it from outside that LAN, don't expose the bind port directly — put
it behind a reverse proxy (Caddy, nginx, Traefik) that terminates TLS
and adds basic-auth or OIDC. The agent↔orchestrator bearer token
auth is independent and keeps working either way.
## Troubleshooting
| Symptom | First check |
|---|---|
| PXE client gets no DHCP offer | `journalctl -u vetting` for dnsmasq errors; confirm the LXC has `CAP_NET_ADMIN` (the shipped systemd unit does); confirm the host MAC is actually registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). |
| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
| Notification didn't fire | `journalctl -u vetting \| grep notify:` — delivery is fire-and-forget and the failure reason is logged but not persisted. |
## Upgrading
Rerun the registry-fetch one-liner on the LXC:
```
curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh \
| sudo bash
```
That's it — `install.sh` auto-restarts `vetting.service` when it's
already enabled, and re-stages `vmlinuz`/`initrd.img` into
`/var/lib/vetting/live/` so PXE-enabled LXCs come back up with the
fresh live image. Watch the logs with `journalctl -fu vetting`.
Pin to a specific build with `VETTING_VERSION=sha-abc1234` if you
need to roll back or test a commit. The DB migration runs at startup
and is append-only — no manual schema work unless a release's notes
call it out.