211abdf08f
Splits the release workflow into three jobs (detect, build-live-image, bundle) so the ~9 min mkosi build only runs when live-image/VERSION bumps. The slim bundle (~30 MB: orchestrator + agent + deploy scripts + a live-image/VERSION pointer) rebuilds every push; the ~300 MB vmlinuz+initrd.img are published separately under the immutable live-image/<version>/ path. install.sh compares the pointer to /var/lib/vetting/live/VERSION and fetches the files only on mismatch, cutting repeat-install wall-clock from ~30 s + 300 MB to ~10 s + 0 MB on the common no-live-image-change release. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
255 lines
10 KiB
Markdown
255 lines
10 KiB
Markdown
# Operations
|
|
|
|
Operator-facing runbook for the vetting orchestrator. If you're looking
|
|
for the "what does the system do" overview, see
|
|
[architecture.md](architecture.md). For what each test stage actually
|
|
measures, see [test-suite.md](test-suite.md).
|
|
|
|
## Install (Proxmox LXC)
|
|
|
|
Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster
|
|
you're vetting for. The LXC must be on the same L2 segment as the
|
|
repaired nodes so DHCP and WoL work.
|
|
|
|
### One-liner install (recommended)
|
|
|
|
Every push to `main` kicks off a Gitea Actions run that rebuilds the
|
|
slim release bundle (orchestrator + agent + install scripts + a
|
|
pointer file for the live image's version) and publishes it to the
|
|
Gitea package registry. The ~300 MB live image (`vmlinuz` + `initrd.img`)
|
|
is published separately under `live-image/<version>/` and only
|
|
rebuilds when [`live-image/VERSION`](../live-image/VERSION) changes.
|
|
|
|
The LXC installer fetches the slim bundle on every run (~30 MB,
|
|
fast), then fetches the live image files only when the bundle's
|
|
pointer differs from what's on disk — no Go toolchain, no `make`,
|
|
no WSL, and no 300 MB transfer on ordinary releases.
|
|
|
|
On the LXC:
|
|
|
|
```
|
|
curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh \
|
|
| sudo bash
|
|
```
|
|
|
|
Force-refresh the on-disk live image even when versions match
|
|
(useful if the staged files got corrupted):
|
|
|
|
```
|
|
curl -fsSL .../proxmox-install.sh | sudo bash -s -- --force-live-image
|
|
```
|
|
|
|
`proxmox-install.sh` curls the bundle from
|
|
`${REGISTRY_URL}/api/packages/${PACKAGE_OWNER}/generic/vetting/latest/vetting-bundle.tar.gz`,
|
|
extracts it, and hands off to the bundled `install.sh` for the base
|
|
install (user, binaries, config, systemd unit). `install.sh` then
|
|
compares `live-image/VERSION` inside the bundle against
|
|
`/var/lib/vetting/live/VERSION` and fetches
|
|
`live-image/<version>/{vmlinuz,initrd.img}` from the registry when
|
|
they differ.
|
|
|
|
If you don't need PXE (e.g. host-mode reporter only, no automated
|
|
live-boots), you can stop here — edit `/etc/vetting/vetting.yaml` to
|
|
tune `server.bind` / `public_url`, then
|
|
`sudo systemctl enable --now vetting`.
|
|
|
|
### Offline / air-gapped install
|
|
|
|
If the LXC can't reach the registry, build the slim bundle locally
|
|
and `scp` it across. The live image files must also be copied in
|
|
separately (either into the bundle's `live-image/` dir before running
|
|
install.sh, or into `/var/lib/vetting/live/` directly):
|
|
|
|
```
|
|
make release # on any host with Go + templ
|
|
scp bin/vetting-bundle.tar.gz lxc:/tmp/
|
|
ssh lxc 'cd /tmp && tar xzf vetting-bundle.tar.gz \
|
|
&& cp /path/to/vmlinuz /path/to/initrd.img vetting-bundle/live-image/ \
|
|
&& cd vetting-bundle && sudo ./install.sh'
|
|
```
|
|
|
|
`install.sh` recognizes local `vmlinuz`/`initrd.img` under
|
|
`live-image/` and stages them without a registry fetch.
|
|
|
|
### PXE enablement
|
|
|
|
PXE is gated behind a second script so non-PXE installs stay simple.
|
|
|
|
**How it works on the network.** dnsmasq runs in **proxy-DHCP mode**:
|
|
it binds to the LXC's LAN interface and *coexists* with your
|
|
existing DHCP server (UniFi, pfSense, Asus, etc.). The router still
|
|
hands out LAN IPs the normal way; dnsmasq only answers the PXE
|
|
options (boot server + filename) and only for MACs you've registered
|
|
in the UI. A random laptop booting from network on the same LAN gets
|
|
a LAN IP from the router and nothing from us — the MAC allowlist is
|
|
the safety barrier.
|
|
|
|
That means **no dedicated bridge, no VLAN, no cabling changes**. The
|
|
LXC just needs an interface on the same L2 segment as the hosts
|
|
you're repairing — typically `eth0` on the LAN bridge.
|
|
|
|
On the LXC, after the one-liner install completes:
|
|
|
|
```
|
|
sudo vetting-pxe-setup \
|
|
--interface eth0 \
|
|
--subnet 192.168.1.0/24 \
|
|
--orchestrator-url http://<lxc-lan-ip>:8080
|
|
```
|
|
|
|
(`vetting-pxe-setup` is a symlink installed into `/usr/local/sbin/` by
|
|
`install.sh`, pointing at the `pxe-setup.sh` script and `ipxe-shas.txt`
|
|
staged under `/usr/local/share/vetting/`.)
|
|
|
|
The script:
|
|
|
|
- Fetches `ipxe.efi` + `undionly.kpxe` from boot.ipxe.org and verifies
|
|
SHA256 against `ipxe-shas.txt` (fail-closed on mismatch).
|
|
- Places `vmlinuz` + `initrd.img` into `/var/lib/vetting/live/`.
|
|
- Rewrites the `pxe:` block of `/etc/vetting/vetting.yaml` to enable
|
|
PXE with the flags you passed.
|
|
|
|
It does **not** restart the service — review the rendered config,
|
|
then:
|
|
|
|
```
|
|
sudo systemctl restart vetting
|
|
sudo journalctl -fu vetting
|
|
```
|
|
|
|
The orchestrator validates PXE preconditions at startup (interface
|
|
exists, iPXE binaries are on disk, `subnet` parses as CIDR) and
|
|
exits non-zero with a clear error if anything's wrong, instead of
|
|
failing silently when a host first PXE-boots.
|
|
|
|
`vetting-pxe-setup` is idempotent — safe to re-run. Pass `--force` to
|
|
overwrite a hand-edited `pxe:` block.
|
|
|
|
**Router caveat.** Most home/prosumer routers (UniFi, Asus, Netgear,
|
|
etc.) don't send PXE options, so proxy mode just works. pfSense and
|
|
OPNsense *can* serve PXE themselves — if yours does, disable its
|
|
TFTP/netboot feature so there's only one PXE authority on the
|
|
segment.
|
|
|
|
### Dev-loop install (from a source checkout)
|
|
|
|
For iterating on the orchestrator without waiting for a CI publish:
|
|
|
|
1. On your workstation: `make orchestrator-linux && make agent-linux`
|
|
2. Copy the repo tree (or just `bin/` + `deploy/`) onto the LXC
|
|
3. `sudo ./deploy/install.sh` → base install
|
|
4. For PXE: `wsl make live-image` on your workstation,
|
|
`scp live-image/build/vmlinuz lxc:/tmp/ && scp live-image/build/initrd.img lxc:/tmp/`,
|
|
then run `pxe-setup.sh --bundle-dir /tmp` (or accept the default
|
|
repo-tree detection when running from the repo root).
|
|
|
|
## First vetting run
|
|
|
|
Against a QEMU VM first, before you point it at real hardware:
|
|
|
|
1. In the UI at `http://<lxc-lan-ip>:8080`, register a host:
|
|
- Name: `qemu-test`
|
|
- MAC: `52:54:00:12:34:56`
|
|
- WoL broadcast IP: your LAN broadcast, e.g. `192.168.1.255`
|
|
- Expected spec: paste a minimal YAML like
|
|
```yaml
|
|
memory: { total_gib: 4 }
|
|
cpu: { logical_cores: 4 }
|
|
```
|
|
|
|
2. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`.
|
|
|
|
3. Launch the QEMU VM on the LAN bridge so it PXE-boots via the
|
|
router's DHCP + our proxy-DHCP reply:
|
|
|
|
```
|
|
sudo qemu-system-x86_64 \
|
|
-enable-kvm -cpu host -smp 4 -m 4096 \
|
|
-netdev bridge,id=n0,br=vmbr0 \
|
|
-device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
|
|
-drive file=/tmp/test-disk.img,format=raw,if=virtio \
|
|
-boot n -serial mon:stdio -display none
|
|
```
|
|
|
|
(Swap `vmbr0` for whatever your Proxmox LAN bridge is called.)
|
|
|
|
4. Watch the tile advance through stages. On success, the tile shows
|
|
**View report** and the VM auto-shuts-down.
|
|
|
|
For real repaired hardware: same flow, but register the node's actual
|
|
LAN MAC + expected spec, and make sure the node's BIOS is set to
|
|
PXE-boot from the NIC that's on the LAN.
|
|
|
|
## A failed run — SSH to the held host
|
|
|
|
When a stage fails, the pipeline halts at `FailedHolding` and the
|
|
agent installs an orchestrator-issued SSH key into the live-image's
|
|
`/root/.ssh/authorized_keys`. The UI tile surfaces the IP and the
|
|
exact `ssh` command.
|
|
|
|
The hold key is **per-run**. Once you're done:
|
|
|
|
1. Power the host off (`poweroff` from the SSH session).
|
|
2. In the UI, click **Override wipe-probe** only when the failure was
|
|
at the `Storage` stage *and* you're sure the disks are expendable.
|
|
Otherwise click **Start vetting** on a fresh run from the host
|
|
dashboard after fixing the underlying issue.
|
|
|
|
## Log + artifact layout
|
|
|
|
```
|
|
/var/lib/vetting/
|
|
vetting.db # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
|
|
artifacts/
|
|
run-<N>/
|
|
report.html # operator-facing summary
|
|
report.json # machine-readable summary
|
|
inventory.json # raw probe output
|
|
fio-<disk>.log # storage stage output
|
|
iperf-<nic>.json # network stage output
|
|
hold-<N>.pub # per-run SSH pubkey (only if held)
|
|
/var/log/vetting/
|
|
run-<N>.log # append-only per-run log tail
|
|
```
|
|
|
|
Retention is governed by the `artifacts.retention_days` and
|
|
`logs.retention_days` settings. DB rows (run history) are preserved
|
|
indefinitely; only on-disk files get pruned.
|
|
|
|
## Exposing outside the LAN
|
|
|
|
The orchestrator UI has no built-in auth. It's designed to live on a
|
|
trusted home LAN and trust whatever reaches it. If you want to reach
|
|
it from outside that LAN, don't expose the bind port directly — put
|
|
it behind a reverse proxy (Caddy, nginx, Traefik) that terminates TLS
|
|
and adds basic-auth or OIDC. The agent↔orchestrator bearer token
|
|
auth is independent and keeps working either way.
|
|
|
|
## Troubleshooting
|
|
|
|
| Symptom | First check |
|
|
|---|---|
|
|
| Host sits at PXE, no boot filename | Confirm the MAC is registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). If it is, `sudo tcpdump -i <lan-iface> -n -e 'port 67 or port 68 or port 4011'` while the host PXEs — if you see DISCOVER/OFFER from the router but no proxy reply from us, check `journalctl -u vetting` for dnsmasq errors. |
|
|
| PXE boots but iPXE can't fetch the script | Verify the LXC's LAN IP matches `pxe.orchestrator_url` in `/etc/vetting/vetting.yaml` — iPXE bakes that URL in at chainload. |
|
|
| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
|
|
| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
|
|
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
|
|
| Notification didn't fire | `journalctl -u vetting \| grep notify:` — delivery is fire-and-forget and the failure reason is logged but not persisted. |
|
|
|
|
## Upgrading
|
|
|
|
Rerun the registry-fetch one-liner on the LXC:
|
|
|
|
```
|
|
curl -fsSL https://gitea.thewrightserver.net/josh/Vetting/raw/branch/main/deploy/proxmox-install.sh \
|
|
| sudo bash
|
|
```
|
|
|
|
That's it — `install.sh` auto-restarts `vetting.service` when it's
|
|
already enabled, and re-stages `vmlinuz`/`initrd.img` into
|
|
`/var/lib/vetting/live/` only when the bundle points at a new
|
|
`live-image/VERSION`. Watch the logs with `journalctl -fu vetting`.
|
|
|
|
The DB migration runs at startup and is append-only — no manual
|
|
schema work unless a release's notes call it out.
|