Files
Vetting/docs/operations.md
T
josh 211abdf08f
CI / Lint + build + test (push) Successful in 1m41s
Release / detect (push) Successful in 7s
Release / build-live-image (push) Failing after 3m58s
Release / bundle (push) Has been skipped
feat(release): version live-image, skip rebuild+redownload when unchanged
Splits the release workflow into three jobs (detect, build-live-image,
bundle) so the ~9 min mkosi build only runs when live-image/VERSION
bumps. The slim bundle (~30 MB: orchestrator + agent + deploy scripts
+ a live-image/VERSION pointer) rebuilds every push; the ~300 MB
vmlinuz+initrd.img are published separately under the immutable
live-image/<version>/ path. install.sh compares the pointer to
/var/lib/vetting/live/VERSION and fetches the files only on mismatch,
cutting repeat-install wall-clock from ~30 s + 300 MB to ~10 s + 0 MB
on the common no-live-image-change release.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 21:04:14 -04:00

255 lines
10 KiB
Markdown

# Operations
Operator-facing runbook for the vetting orchestrator. If you're looking
for the "what does the system do" overview, see
[architecture.md](architecture.md). For what each test stage actually
measures, see [test-suite.md](test-suite.md).
## Install (Proxmox LXC)
Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster
you're vetting for. The LXC must be on the same L2 segment as the
repaired nodes so DHCP and WoL work.
### One-liner install (recommended)
Every push to `main` kicks off a Gitea Actions run that rebuilds the
slim release bundle (orchestrator + agent + install scripts + a
pointer file for the live image's version) and publishes it to the
Gitea package registry. The ~300 MB live image (`vmlinuz` + `initrd.img`)
is published separately under `live-image/<version>/` and only
rebuilds when [`live-image/VERSION`](../live-image/VERSION) changes.
The LXC installer fetches the slim bundle on every run (~30 MB,
fast), then fetches the live image files only when the bundle's
pointer differs from what's on disk — no Go toolchain, no `make`,
no WSL, and no 300 MB transfer on ordinary releases.
On the LXC:
```
curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh \
| sudo bash
```
Force-refresh the on-disk live image even when versions match
(useful if the staged files got corrupted):
```
curl -fsSL .../proxmox-install.sh | sudo bash -s -- --force-live-image
```
`proxmox-install.sh` curls the bundle from
`${REGISTRY_URL}/api/packages/${PACKAGE_OWNER}/generic/vetting/latest/vetting-bundle.tar.gz`,
extracts it, and hands off to the bundled `install.sh` for the base
install (user, binaries, config, systemd unit). `install.sh` then
compares `live-image/VERSION` inside the bundle against
`/var/lib/vetting/live/VERSION` and fetches
`live-image/<version>/{vmlinuz,initrd.img}` from the registry when
they differ.
If you don't need PXE (e.g. host-mode reporter only, no automated
live-boots), you can stop here — edit `/etc/vetting/vetting.yaml` to
tune `server.bind` / `public_url`, then
`sudo systemctl enable --now vetting`.
### Offline / air-gapped install
If the LXC can't reach the registry, build the slim bundle locally
and `scp` it across. The live image files must also be copied in
separately (either into the bundle's `live-image/` dir before running
install.sh, or into `/var/lib/vetting/live/` directly):
```
make release # on any host with Go + templ
scp bin/vetting-bundle.tar.gz lxc:/tmp/
ssh lxc 'cd /tmp && tar xzf vetting-bundle.tar.gz \
&& cp /path/to/vmlinuz /path/to/initrd.img vetting-bundle/live-image/ \
&& cd vetting-bundle && sudo ./install.sh'
```
`install.sh` recognizes local `vmlinuz`/`initrd.img` under
`live-image/` and stages them without a registry fetch.
### PXE enablement
PXE is gated behind a second script so non-PXE installs stay simple.
**How it works on the network.** dnsmasq runs in **proxy-DHCP mode**:
it binds to the LXC's LAN interface and *coexists* with your
existing DHCP server (UniFi, pfSense, Asus, etc.). The router still
hands out LAN IPs the normal way; dnsmasq only answers the PXE
options (boot server + filename) and only for MACs you've registered
in the UI. A random laptop booting from network on the same LAN gets
a LAN IP from the router and nothing from us — the MAC allowlist is
the safety barrier.
That means **no dedicated bridge, no VLAN, no cabling changes**. The
LXC just needs an interface on the same L2 segment as the hosts
you're repairing — typically `eth0` on the LAN bridge.
On the LXC, after the one-liner install completes:
```
sudo vetting-pxe-setup \
--interface eth0 \
--subnet 192.168.1.0/24 \
--orchestrator-url http://<lxc-lan-ip>:8080
```
(`vetting-pxe-setup` is a symlink installed into `/usr/local/sbin/` by
`install.sh`, pointing at the `pxe-setup.sh` script and `ipxe-shas.txt`
staged under `/usr/local/share/vetting/`.)
The script:
- Fetches `ipxe.efi` + `undionly.kpxe` from boot.ipxe.org and verifies
SHA256 against `ipxe-shas.txt` (fail-closed on mismatch).
- Places `vmlinuz` + `initrd.img` into `/var/lib/vetting/live/`.
- Rewrites the `pxe:` block of `/etc/vetting/vetting.yaml` to enable
PXE with the flags you passed.
It does **not** restart the service — review the rendered config,
then:
```
sudo systemctl restart vetting
sudo journalctl -fu vetting
```
The orchestrator validates PXE preconditions at startup (interface
exists, iPXE binaries are on disk, `subnet` parses as CIDR) and
exits non-zero with a clear error if anything's wrong, instead of
failing silently when a host first PXE-boots.
`vetting-pxe-setup` is idempotent — safe to re-run. Pass `--force` to
overwrite a hand-edited `pxe:` block.
**Router caveat.** Most home/prosumer routers (UniFi, Asus, Netgear,
etc.) don't send PXE options, so proxy mode just works. pfSense and
OPNsense *can* serve PXE themselves — if yours does, disable its
TFTP/netboot feature so there's only one PXE authority on the
segment.
### Dev-loop install (from a source checkout)
For iterating on the orchestrator without waiting for a CI publish:
1. On your workstation: `make orchestrator-linux && make agent-linux`
2. Copy the repo tree (or just `bin/` + `deploy/`) onto the LXC
3. `sudo ./deploy/install.sh` → base install
4. For PXE: `wsl make live-image` on your workstation,
`scp live-image/build/vmlinuz lxc:/tmp/ && scp live-image/build/initrd.img lxc:/tmp/`,
then run `pxe-setup.sh --bundle-dir /tmp` (or accept the default
repo-tree detection when running from the repo root).
## First vetting run
Against a QEMU VM first, before you point it at real hardware:
1. In the UI at `http://<lxc-lan-ip>:8080`, register a host:
- Name: `qemu-test`
- MAC: `52:54:00:12:34:56`
- WoL broadcast IP: your LAN broadcast, e.g. `192.168.1.255`
- Expected spec: paste a minimal YAML like
```yaml
memory: { total_gib: 4 }
cpu: { logical_cores: 4 }
```
2. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`.
3. Launch the QEMU VM on the LAN bridge so it PXE-boots via the
router's DHCP + our proxy-DHCP reply:
```
sudo qemu-system-x86_64 \
-enable-kvm -cpu host -smp 4 -m 4096 \
-netdev bridge,id=n0,br=vmbr0 \
-device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
-drive file=/tmp/test-disk.img,format=raw,if=virtio \
-boot n -serial mon:stdio -display none
```
(Swap `vmbr0` for whatever your Proxmox LAN bridge is called.)
4. Watch the tile advance through stages. On success, the tile shows
**View report** and the VM auto-shuts-down.
For real repaired hardware: same flow, but register the node's actual
LAN MAC + expected spec, and make sure the node's BIOS is set to
PXE-boot from the NIC that's on the LAN.
## A failed run — SSH to the held host
When a stage fails, the pipeline halts at `FailedHolding` and the
agent installs an orchestrator-issued SSH key into the live-image's
`/root/.ssh/authorized_keys`. The UI tile surfaces the IP and the
exact `ssh` command.
The hold key is **per-run**. Once you're done:
1. Power the host off (`poweroff` from the SSH session).
2. In the UI, click **Override wipe-probe** only when the failure was
at the `Storage` stage *and* you're sure the disks are expendable.
Otherwise click **Start vetting** on a fresh run from the host
dashboard after fixing the underlying issue.
## Log + artifact layout
```
/var/lib/vetting/
vetting.db # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
artifacts/
run-<N>/
report.html # operator-facing summary
report.json # machine-readable summary
inventory.json # raw probe output
fio-<disk>.log # storage stage output
iperf-<nic>.json # network stage output
hold-<N>.pub # per-run SSH pubkey (only if held)
/var/log/vetting/
run-<N>.log # append-only per-run log tail
```
Retention is governed by the `artifacts.retention_days` and
`logs.retention_days` settings. DB rows (run history) are preserved
indefinitely; only on-disk files get pruned.
## Exposing outside the LAN
The orchestrator UI has no built-in auth. It's designed to live on a
trusted home LAN and trust whatever reaches it. If you want to reach
it from outside that LAN, don't expose the bind port directly — put
it behind a reverse proxy (Caddy, nginx, Traefik) that terminates TLS
and adds basic-auth or OIDC. The agent↔orchestrator bearer token
auth is independent and keeps working either way.
## Troubleshooting
| Symptom | First check |
|---|---|
| Host sits at PXE, no boot filename | Confirm the MAC is registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). If it is, `sudo tcpdump -i <lan-iface> -n -e 'port 67 or port 68 or port 4011'` while the host PXEs — if you see DISCOVER/OFFER from the router but no proxy reply from us, check `journalctl -u vetting` for dnsmasq errors. |
| PXE boots but iPXE can't fetch the script | Verify the LXC's LAN IP matches `pxe.orchestrator_url` in `/etc/vetting/vetting.yaml` — iPXE bakes that URL in at chainload. |
| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
| Notification didn't fire | `journalctl -u vetting \| grep notify:` — delivery is fire-and-forget and the failure reason is logged but not persisted. |
## Upgrading
Rerun the registry-fetch one-liner on the LXC:
```
curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh \
| sudo bash
```
That's it — `install.sh` auto-restarts `vetting.service` when it's
already enabled, and re-stages `vmlinuz`/`initrd.img` into
`/var/lib/vetting/live/` only when the bundle points at a new
`live-image/VERSION`. Watch the logs with `journalctl -fu vetting`.
The DB migration runs at startup and is append-only — no manual
schema work unless a release's notes call it out.