docs+e2e: document proxy-DHCP topology; default e2e bridge to LAN
CI / Lint + build + test (push) Successful in 1m37s
Release / release (push) Has been cancelled

Rewrites the PXE section of the ops runbook around the new proxy-DHCP
model (no dedicated bridge, coexists with UniFi/pfSense/etc.) and
swaps the e2e test's default bridge + orchestrator URL to match. The
e2e file now calls out the LAN-DHCP precondition in its header so
future-me (or CI) doesn't hang at PXE wondering why nothing answers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-04-18 12:07:05 -04:00
parent 506c856046
commit bcbbc35489
2 changed files with 45 additions and 34 deletions
+36 -32
View File
@@ -60,28 +60,26 @@ Same bundle layout either way.
PXE is gated behind a second script so non-PXE installs stay simple.
**Prerequisite: dedicated PXE bridge on the Proxmox hypervisor.** The
LXC can't create bridges on its host, so do this once on the Proxmox
node (not inside the LXC):
**How it works on the network.** dnsmasq runs in **proxy-DHCP mode**:
it binds to the LXC's LAN interface and *coexists* with your
existing DHCP server (UniFi, pfSense, Asus, etc.). The router still
hands out LAN IPs the normal way; dnsmasq only answers the PXE
options (boot server + filename) and only for MACs you've registered
in the UI. A random laptop booting from network on the same LAN gets
a LAN IP from the router and nothing from us — the MAC allowlist is
the safety barrier.
```
sudo ip link add br-vetting type bridge
sudo ip addr add 10.77.0.1/24 dev br-vetting
sudo ip link set br-vetting up
```
Attach a veth from the LXC onto `br-vetting` (e.g. `eth1` inside the
LXC at `10.77.0.2/24`). Repaired nodes PXE-boot from a NIC cabled or
bridged onto `br-vetting` only — keep this network isolated from your
household DHCP, or both DHCP servers will fight.
That means **no dedicated bridge, no VLAN, no cabling changes**. The
LXC just needs an interface on the same L2 segment as the hosts
you're repairing — typically `eth0` on the LAN bridge.
On the LXC, inside the extracted bundle:
```
sudo ./pxe-setup.sh \
--interface eth1 \
--dhcp-range 10.77.0.100,10.77.0.200,12h \
--orchestrator-url http://10.77.0.2:8080
--interface eth0 \
--subnet 192.168.1.0/24 \
--orchestrator-url http://<lxc-lan-ip>:8080
```
The script:
@@ -101,13 +99,19 @@ sudo journalctl -fu vetting
```
The orchestrator validates PXE preconditions at startup (interface
exists, iPXE binaries are on disk, `dhcp_range` parses) and exits
non-zero with a clear error if anything's wrong, instead of failing
silently when a host first PXE-boots.
exists, iPXE binaries are on disk, `subnet` parses as CIDR) and
exits non-zero with a clear error if anything's wrong, instead of
failing silently when a host first PXE-boots.
`pxe-setup.sh` is idempotent — safe to re-run. Pass `--force` to
overwrite a hand-edited `pxe:` block.
**Router caveat.** Most home/prosumer routers (UniFi, Asus, Netgear,
etc.) don't send PXE options, so proxy mode just works. pfSense and
OPNsense *can* serve PXE themselves — if yours does, disable its
TFTP/netboot feature so there's only one PXE authority on the
segment.
### Dev-loop install (from a source checkout)
For iterating on the orchestrator without waiting for a CI publish:
@@ -124,39 +128,38 @@ For iterating on the orchestrator without waiting for a CI publish:
Against a QEMU VM first, before you point it at real hardware:
1. Make sure the `br-vetting` bridge exists on the hypervisor (see
above). From inside the LXC, confirm it's reachable on your
PXE-side interface.
2. In the UI at `http://<lxc>:8080`, register a host:
1. In the UI at `http://<lxc-lan-ip>:8080`, register a host:
- Name: `qemu-test`
- MAC: `52:54:00:12:34:56`
- WoL broadcast IP: `10.77.0.255`
- WoL broadcast IP: your LAN broadcast, e.g. `192.168.1.255`
- Expected spec: paste a minimal YAML like
```yaml
memory: { total_gib: 4 }
cpu: { logical_cores: 4 }
```
3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`.
2. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`.
4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:
3. Launch the QEMU VM on the LAN bridge so it PXE-boots via the
router's DHCP + our proxy-DHCP reply:
```
sudo qemu-system-x86_64 \
-enable-kvm -cpu host -smp 4 -m 4096 \
-netdev bridge,id=n0,br=br-vetting \
-netdev bridge,id=n0,br=vmbr0 \
-device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
-drive file=/tmp/test-disk.img,format=raw,if=virtio \
-boot n -serial mon:stdio -display none
```
5. Watch the tile advance through stages. On success, the tile shows
(Swap `vmbr0` for whatever your Proxmox LAN bridge is called.)
4. Watch the tile advance through stages. On success, the tile shows
**View report** and the VM auto-shuts-down.
For real repaired hardware: same flow, but register the node's actual
MAC + expected spec, and make sure the node's BIOS is set to PXE-boot
from the NIC that's on the `br-vetting` network.
LAN MAC + expected spec, and make sure the node's BIOS is set to
PXE-boot from the NIC that's on the LAN.
## A failed run — SSH to the held host
@@ -207,7 +210,8 @@ auth is independent and keeps working either way.
| Symptom | First check |
|---|---|
| PXE client gets no DHCP offer | `journalctl -u vetting` for dnsmasq errors; confirm the LXC has `CAP_NET_ADMIN` (the shipped systemd unit does); confirm the host MAC is actually registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). |
| Host sits at PXE, no boot filename | Confirm the MAC is registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). If it is, `sudo tcpdump -i <lan-iface> -n -e 'port 67 or port 68 or port 4011'` while the host PXEs — if you see DISCOVER/OFFER from the router but no proxy reply from us, check `journalctl -u vetting` for dnsmasq errors. |
| PXE boots but iPXE can't fetch the script | Verify the LXC's LAN IP matches `pxe.orchestrator_url` in `/etc/vetting/vetting.yaml` — iPXE bakes that URL in at chainload. |
| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |