9bb4b09a04
CI / Lint + build + test (push) Has been cancelled
Post-repair hardware validation pipeline for Proxmox cluster hosts. Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
172 lines
6.3 KiB
Markdown
172 lines
6.3 KiB
Markdown
# Operations
|
|
|
|
Operator-facing runbook for the vetting orchestrator. If you're looking
|
|
for the "what does the system do" overview, see
|
|
[architecture.md](architecture.md). For what each test stage actually
|
|
measures, see [test-suite.md](test-suite.md).
|
|
|
|
## Install (Proxmox LXC)
|
|
|
|
Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster
|
|
you're vetting for. The LXC must be on the same L2 segment as the
|
|
repaired nodes so DHCP and WoL work.
|
|
|
|
1. On your workstation, cross-build the binary:
|
|
|
|
```
|
|
make orchestrator-linux
|
|
```
|
|
|
|
This produces `bin/vetting-linux-amd64`.
|
|
|
|
2. Copy the repo tree (or just `bin/`, `deploy/`) into the LXC, then
|
|
from inside the LXC:
|
|
|
|
```
|
|
sudo ./deploy/install.sh
|
|
```
|
|
|
|
The installer:
|
|
- `apt install`s `dnsmasq`, `iperf3`, `ca-certificates`
|
|
- creates the `vetting` system user (home = `/var/lib/vetting`)
|
|
- installs the binary into `/usr/local/bin/vetting`
|
|
- drops `vetting.example.yaml` into `/etc/vetting/vetting.yaml`
|
|
(only if there's no existing config — existing configs are
|
|
preserved)
|
|
- drops `/etc/systemd/system/vetting.service`
|
|
- disables the distro-default dnsmasq (the orchestrator supervises
|
|
its own)
|
|
|
|
The installer does **not** enable the service, because the default
|
|
config has a placeholder bcrypt password that the binary refuses to
|
|
start with.
|
|
|
|
3. Generate an admin password hash and a session secret, then edit
|
|
`/etc/vetting/vetting.yaml`:
|
|
|
|
```
|
|
./bin/gen-admin-password 'your-password-here' # prints a bcrypt hash
|
|
openssl rand -hex 32 # prints a 64-char hex string
|
|
```
|
|
|
|
Required fields:
|
|
- `auth.admin_password_bcrypt` — the bcrypt hash
|
|
- `auth.session_secret_hex` — the 32-byte hex string
|
|
- `server.public_url` — the URL your browser hits the LXC on
|
|
(e.g. `https://vetting.lan:8443`). This is used as the
|
|
click-through link in notifications, so it must be the *external*
|
|
URL, not the bind address.
|
|
|
|
4. (Optional) Configure notifiers in the same file — see the
|
|
commented-out example block for ntfy / Discord / SMTP.
|
|
|
|
5. Enable and start:
|
|
|
|
```
|
|
sudo systemctl enable --now vetting
|
|
sudo journalctl -fu vetting
|
|
```
|
|
|
|
## First vetting run
|
|
|
|
Against a QEMU VM first, before you point it at real hardware:
|
|
|
|
1. On the Proxmox host (or wherever your LXC lives):
|
|
|
|
```
|
|
sudo ip link add br-vetting type bridge
|
|
sudo ip addr add 10.77.0.1/24 dev br-vetting
|
|
sudo ip link set br-vetting up
|
|
```
|
|
|
|
2. In the UI at `https://<lxc>:8443`, log in and register a host:
|
|
- Name: `qemu-test`
|
|
- MAC: `52:54:00:12:34:56`
|
|
- WoL broadcast IP: `10.77.0.255`
|
|
- Expected spec: paste a minimal YAML like
|
|
```yaml
|
|
memory: { total_gib: 4 }
|
|
cpu: { logical_cores: 4 }
|
|
```
|
|
|
|
3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingWoL`.
|
|
|
|
4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:
|
|
|
|
```
|
|
sudo qemu-system-x86_64 \
|
|
-enable-kvm -cpu host -smp 4 -m 4096 \
|
|
-netdev bridge,id=n0,br=br-vetting \
|
|
-device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
|
|
-drive file=/tmp/test-disk.img,format=raw,if=virtio \
|
|
-boot n -serial mon:stdio -display none
|
|
```
|
|
|
|
5. Watch the tile advance through stages. On success, the tile shows
|
|
**View report** and the VM auto-shuts-down.
|
|
|
|
For real repaired hardware: same flow, but register the node's actual
|
|
MAC + expected spec, and make sure the node's BIOS is set to PXE-boot
|
|
from the NIC that's on the `br-vetting` network.
|
|
|
|
## A failed run — SSH to the held host
|
|
|
|
When a stage fails, the pipeline halts at `FailedHolding` and the
|
|
agent installs an orchestrator-issued SSH key into the live-image's
|
|
`/root/.ssh/authorized_keys`. The UI tile surfaces the IP and the
|
|
exact `ssh` command.
|
|
|
|
The hold key is **per-run**. Once you're done:
|
|
|
|
1. Power the host off (`poweroff` from the SSH session).
|
|
2. In the UI, click **Override wipe-probe** only when the failure was
|
|
at the `Storage` stage *and* you're sure the disks are expendable.
|
|
Otherwise click **Start vetting** on a fresh run from the host
|
|
dashboard after fixing the underlying issue.
|
|
|
|
## Log + artifact layout
|
|
|
|
```
|
|
/var/lib/vetting/
|
|
vetting.db # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
|
|
artifacts/
|
|
run-<N>/
|
|
report.html # operator-facing summary
|
|
report.json # machine-readable summary
|
|
inventory.json # raw probe output
|
|
fio-<disk>.log # storage stage output
|
|
iperf-<nic>.json # network stage output
|
|
hold-<N>.pub # per-run SSH pubkey (only if held)
|
|
/var/log/vetting/
|
|
run-<N>.log # append-only per-run log tail
|
|
```
|
|
|
|
Retention is governed by the `artifacts.retention_days` and
|
|
`logs.retention_days` settings. DB rows (run history) are preserved
|
|
indefinitely; only on-disk files get pruned.
|
|
|
|
## Troubleshooting
|
|
|
|
| Symptom | First check |
|
|
|---|---|
|
|
| Service refuses to start with `auth.admin_password_bcrypt is the placeholder` | You didn't replace the bcrypt hash in the config. Run `gen-admin-password`. |
|
|
| PXE client gets no DHCP offer | `journalctl -u vetting` for dnsmasq errors; confirm the LXC has `CAP_NET_ADMIN` (the shipped systemd unit does); confirm the host MAC is actually registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). |
|
|
| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
|
|
| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
|
|
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
|
|
| Notification didn't fire | `journalctl -u vetting \| grep notify:` — delivery is fire-and-forget and the failure reason is logged but not persisted. |
|
|
|
|
## Upgrading
|
|
|
|
1. `make orchestrator-linux` on your workstation.
|
|
2. `scp bin/vetting-linux-amd64 lxc:/tmp/vetting.new`
|
|
3. On the LXC:
|
|
```
|
|
sudo systemctl stop vetting
|
|
sudo install -m 0755 /tmp/vetting.new /usr/local/bin/vetting
|
|
sudo systemctl start vetting
|
|
```
|
|
|
|
The DB migration runs at startup and is append-only — no manual schema
|
|
work unless a release's notes call it out.
|