Vetting/docs/operations.md

# Operations

Operator-facing runbook for the vetting orchestrator. If you're looking
for the "what does the system do" overview, see
[architecture.md](architecture.md). For what each test stage actually
measures, see [test-suite.md](test-suite.md).

## Install (Proxmox LXC)

Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster
you're vetting for. The LXC must be on the same L2 segment as the
repaired nodes so DHCP and WoL work.

1. On your workstation, cross-build the binary:

   ```
   make orchestrator-linux
   ```

   This produces `bin/vetting-linux-amd64`.

2. Copy the repo tree (or just `bin/`, `deploy/`) into the LXC, then
   from inside the LXC:

   ```
   sudo ./deploy/install.sh
   ```

   The installer:
   - `apt install`s `dnsmasq`, `iperf3`, `ca-certificates`
   - creates the `vetting` system user (home = `/var/lib/vetting`)
   - installs the binary into `/usr/local/bin/vetting`
   - drops `vetting.example.yaml` into `/etc/vetting/vetting.yaml`
     (only if there's no existing config — existing configs are
     preserved)
   - drops `/etc/systemd/system/vetting.service`
   - disables the distro-default dnsmasq (the orchestrator supervises
     its own)

   The installer does **not** enable the service. You'll want to edit
   the config first.

3. Edit `/etc/vetting/vetting.yaml`:

   - `server.bind` — defaults to `127.0.0.1:8080`. Switch to
     `0.0.0.0:8080` (or bind to a specific LAN IP) once you're ready
     to expose it. There is no built-in auth — see *Exposing outside
     the LAN* below.
   - `server.public_url` — the URL your browser hits the LXC on
     (e.g. `http://vetting.lan:8080`). Used as the click-through link
     in notifications.

4. (Optional) Configure notifiers in the same file — see the
   commented-out example block for ntfy / Discord / SMTP.

5. Enable and start:

   ```
   sudo systemctl enable --now vetting
   sudo journalctl -fu vetting
   ```

## First vetting run

Against a QEMU VM first, before you point it at real hardware:

1. On the Proxmox host (or wherever your LXC lives):

   ```
   sudo ip link add br-vetting type bridge
   sudo ip addr add 10.77.0.1/24 dev br-vetting
   sudo ip link set br-vetting up
   ```

2. In the UI at `http://<lxc>:8080`, register a host:
   - Name: `qemu-test`
   - MAC: `52:54:00:12:34:56`
   - WoL broadcast IP: `10.77.0.255`
   - Expected spec: paste a minimal YAML like
     ```yaml
     memory: { total_gib: 4 }
     cpu: { logical_cores: 4 }
     ```

3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingWoL`.

4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:

   ```
   sudo qemu-system-x86_64 \
     -enable-kvm -cpu host -smp 4 -m 4096 \
     -netdev bridge,id=n0,br=br-vetting \
     -device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
     -drive file=/tmp/test-disk.img,format=raw,if=virtio \
     -boot n -serial mon:stdio -display none
   ```

5. Watch the tile advance through stages. On success, the tile shows
   **View report** and the VM auto-shuts-down.

For real repaired hardware: same flow, but register the node's actual
MAC + expected spec, and make sure the node's BIOS is set to PXE-boot
from the NIC that's on the `br-vetting` network.

## A failed run — SSH to the held host

When a stage fails, the pipeline halts at `FailedHolding` and the
agent installs an orchestrator-issued SSH key into the live-image's
`/root/.ssh/authorized_keys`. The UI tile surfaces the IP and the
exact `ssh` command.

The hold key is **per-run**. Once you're done:

1. Power the host off (`poweroff` from the SSH session).
2. In the UI, click **Override wipe-probe** only when the failure was
   at the `Storage` stage *and* you're sure the disks are expendable.
   Otherwise click **Start vetting** on a fresh run from the host
   dashboard after fixing the underlying issue.

## Log + artifact layout

```
/var/lib/vetting/
  vetting.db                 # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
  artifacts/
    run-<N>/
      report.html            # operator-facing summary
      report.json            # machine-readable summary
      inventory.json         # raw probe output
      fio-<disk>.log         # storage stage output
      iperf-<nic>.json       # network stage output
      hold-<N>.pub           # per-run SSH pubkey (only if held)
/var/log/vetting/
  run-<N>.log                # append-only per-run log tail
```

Retention is governed by the `artifacts.retention_days` and
`logs.retention_days` settings. DB rows (run history) are preserved
indefinitely; only on-disk files get pruned.

## Exposing outside the LAN

The orchestrator UI has no built-in auth. It's designed to live on a
trusted home LAN and trust whatever reaches it. If you want to reach
it from outside that LAN, don't expose the bind port directly — put
it behind a reverse proxy (Caddy, nginx, Traefik) that terminates TLS
and adds basic-auth or OIDC. The agent↔orchestrator bearer token
auth is independent and keeps working either way.

## Troubleshooting

| Symptom | First check |
|---|---|
| PXE client gets no DHCP offer | `journalctl -u vetting` for dnsmasq errors; confirm the LXC has `CAP_NET_ADMIN` (the shipped systemd unit does); confirm the host MAC is actually registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). |
| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
| Notification didn't fire | `journalctl -u vetting \| grep notify:` — delivery is fire-and-forget and the failure reason is logged but not persisted. |

## Upgrading

1. `make orchestrator-linux` on your workstation.
2. `scp bin/vetting-linux-amd64 lxc:/tmp/vetting.new`
3. On the LXC:
   ```
   sudo systemctl stop vetting
   sudo install -m 0755 /tmp/vetting.new /usr/local/bin/vetting
   sudo systemctl start vetting
   ```

The DB migration runs at startup and is append-only — no manual schema
work unless a release's notes call it out.