Automate PXE setup: release bundle + pxe-setup.sh + startup validation
CI / Lint + build + test (push) Has been cancelled

Collapses the LXC side of PXE enablement from a six-step manual dance
(build, fetch iPXE, scp, bridge, hand-edit yaml) into:

  make release                   # dev box (Linux/WSL)
  scp bundle.tar.gz lxc:/tmp/
  sudo ./install.sh              # base install, unchanged
  sudo ./pxe-setup.sh --interface ... --dhcp-range ... --orchestrator-url ...

pxe-setup.sh fetches iPXE from boot.ipxe.org, verifies against pinned
SHA256s in deploy/ipxe-shas.txt (fail-closed), places vmlinuz/initrd.img
from the bundle, and rewrites only the pxe: block of vetting.yaml.
Idempotent; --force gates overwriting a hand-edited block.

Adds Supervisor.Validate() — called before dnsmasq spawn — so typo'd
configs fail at orchestrator startup with clear errors naming the
missing file or yaml key, instead of silently serving broken TFTP
until a real host tries to PXE-boot. Nine tests cover missing files,
bogus interface, malformed dhcp_range, bad orchestrator_url, and
aggregate reporting.

Hypervisor bridge creation stays documented (LXC can't do it) but
everything downstream of the bridge is now scripted.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-04-18 01:38:43 -04:00
parent d245fa6235
commit a5055b3c7a
8 changed files with 660 additions and 54 deletions
+83 -45
View File
@@ -11,66 +11,104 @@ Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster
you're vetting for. The LXC must be on the same L2 segment as the
repaired nodes so DHCP and WoL work.
1. On your workstation, cross-build the binary:
### One-shot release bundle (recommended)
```
make orchestrator-linux
```
On your dev workstation (Linux, or WSL on Windows):
This produces `bin/vetting-linux-amd64`.
```
make release
```
2. Copy the repo tree (or just `bin/`, `deploy/`) into the LXC, then
from inside the LXC:
Produces `bin/vetting-bundle-<sha>.tar.gz` containing the orchestrator
binary, agent binary, live image (`vmlinuz` + `initrd.img`), install
scripts, `vetting.service`, the production yaml, and the pinned iPXE
SHA256 file.
```
sudo ./deploy/install.sh
```
Ship it to the LXC:
The installer:
- `apt install`s `dnsmasq`, `iperf3`, `ca-certificates`
- creates the `vetting` system user (home = `/var/lib/vetting`)
- installs the binary into `/usr/local/bin/vetting`
- drops `vetting.example.yaml` into `/etc/vetting/vetting.yaml`
(only if there's no existing config — existing configs are
preserved)
- drops `/etc/systemd/system/vetting.service`
- disables the distro-default dnsmasq (the orchestrator supervises
its own)
```
scp bin/vetting-bundle-<sha>.tar.gz lxc:/tmp/
ssh lxc 'cd /tmp && tar xzf vetting-bundle-*.tar.gz'
ssh lxc 'cd /tmp/vetting-bundle-<sha> && sudo ./install.sh'
```
The installer does **not** enable the service. You'll want to edit
the config first.
`install.sh` does the base install (user, binaries, config, systemd
unit). If you don't need PXE (e.g. host-mode reporter only, no
automated live-boots), you can stop here — edit
`/etc/vetting/vetting.yaml` to tune `server.bind` / `public_url`,
then `sudo systemctl enable --now vetting`.
3. Edit `/etc/vetting/vetting.yaml`:
### PXE enablement
- `server.bind` — defaults to `127.0.0.1:8080`. Switch to
`0.0.0.0:8080` (or bind to a specific LAN IP) once you're ready
to expose it. There is no built-in auth — see *Exposing outside
the LAN* below.
- `server.public_url` — the URL your browser hits the LXC on
(e.g. `http://vetting.lan:8080`). Used as the click-through link
in notifications.
PXE is gated behind a second script so non-PXE installs stay simple.
4. (Optional) Configure notifiers in the same file — see the
commented-out example block for ntfy / Discord / SMTP.
**Prerequisite: dedicated PXE bridge on the Proxmox hypervisor.** The
LXC can't create bridges on its host, so do this once on the Proxmox
node (not inside the LXC):
5. Enable and start:
```
sudo ip link add br-vetting type bridge
sudo ip addr add 10.77.0.1/24 dev br-vetting
sudo ip link set br-vetting up
```
```
sudo systemctl enable --now vetting
sudo journalctl -fu vetting
```
Attach a veth from the LXC onto `br-vetting` (e.g. `eth1` inside the
LXC at `10.77.0.2/24`). Repaired nodes PXE-boot from a NIC cabled or
bridged onto `br-vetting` only — keep this network isolated from your
household DHCP, or both DHCP servers will fight.
On the LXC, inside the extracted bundle:
```
sudo ./pxe-setup.sh \
--interface eth1 \
--dhcp-range 10.77.0.100,10.77.0.200,12h \
--orchestrator-url http://10.77.0.2:8080
```
The script:
- Fetches `ipxe.efi` + `undionly.kpxe` from boot.ipxe.org and verifies
SHA256 against `ipxe-shas.txt` (fail-closed on mismatch).
- Places `vmlinuz` + `initrd.img` into `/var/lib/vetting/live/`.
- Rewrites the `pxe:` block of `/etc/vetting/vetting.yaml` to enable
PXE with the flags you passed.
It does **not** restart the service — review the rendered config,
then:
```
sudo systemctl restart vetting
sudo journalctl -fu vetting
```
The orchestrator validates PXE preconditions at startup (interface
exists, iPXE binaries are on disk, `dhcp_range` parses) and exits
non-zero with a clear error if anything's wrong, instead of failing
silently when a host first PXE-boots.
`pxe-setup.sh` is idempotent — safe to re-run. Pass `--force` to
overwrite a hand-edited `pxe:` block.
### Manual install (no release tarball)
For dev-loop iteration on the LXC itself:
1. On your workstation: `make orchestrator-linux && make agent-linux`
2. Copy the repo tree (or just `bin/` + `deploy/`) onto the LXC
3. `sudo ./deploy/install.sh` → base install
4. For PXE: `wsl make live-image` on your workstation,
`scp live-image/build/vmlinuz lxc:/tmp/ && scp live-image/build/initrd.img lxc:/tmp/`,
then run `pxe-setup.sh --bundle-dir /tmp` (or accept the default
repo-tree detection when running from the repo root).
## First vetting run
Against a QEMU VM first, before you point it at real hardware:
1. On the Proxmox host (or wherever your LXC lives):
```
sudo ip link add br-vetting type bridge
sudo ip addr add 10.77.0.1/24 dev br-vetting
sudo ip link set br-vetting up
```
1. Make sure the `br-vetting` bridge exists on the hypervisor (see
above). From inside the LXC, confirm it's reachable on your
PXE-side interface.
2. In the UI at `http://<lxc>:8080`, register a host:
- Name: `qemu-test`
@@ -82,7 +120,7 @@ Against a QEMU VM first, before you point it at real hardware:
cpu: { logical_cores: 4 }
```
3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingWoL`.
3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`.
4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq: