docs+e2e: document proxy-DHCP topology; default e2e bridge to LAN
Rewrites the PXE section of the ops runbook around the new proxy-DHCP model (no dedicated bridge, coexists with UniFi/pfSense/etc.) and swaps the e2e test's default bridge + orchestrator URL to match. The e2e file now calls out the LAN-DHCP precondition in its header so future-me (or CI) doesn't hang at PXE wondering why nothing answers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
+36
-32
@@ -60,28 +60,26 @@ Same bundle layout either way.
|
||||
|
||||
PXE is gated behind a second script so non-PXE installs stay simple.
|
||||
|
||||
**Prerequisite: dedicated PXE bridge on the Proxmox hypervisor.** The
|
||||
LXC can't create bridges on its host, so do this once on the Proxmox
|
||||
node (not inside the LXC):
|
||||
**How it works on the network.** dnsmasq runs in **proxy-DHCP mode**:
|
||||
it binds to the LXC's LAN interface and *coexists* with your
|
||||
existing DHCP server (UniFi, pfSense, Asus, etc.). The router still
|
||||
hands out LAN IPs the normal way; dnsmasq only answers the PXE
|
||||
options (boot server + filename) and only for MACs you've registered
|
||||
in the UI. A random laptop booting from network on the same LAN gets
|
||||
a LAN IP from the router and nothing from us — the MAC allowlist is
|
||||
the safety barrier.
|
||||
|
||||
```
|
||||
sudo ip link add br-vetting type bridge
|
||||
sudo ip addr add 10.77.0.1/24 dev br-vetting
|
||||
sudo ip link set br-vetting up
|
||||
```
|
||||
|
||||
Attach a veth from the LXC onto `br-vetting` (e.g. `eth1` inside the
|
||||
LXC at `10.77.0.2/24`). Repaired nodes PXE-boot from a NIC cabled or
|
||||
bridged onto `br-vetting` only — keep this network isolated from your
|
||||
household DHCP, or both DHCP servers will fight.
|
||||
That means **no dedicated bridge, no VLAN, no cabling changes**. The
|
||||
LXC just needs an interface on the same L2 segment as the hosts
|
||||
you're repairing — typically `eth0` on the LAN bridge.
|
||||
|
||||
On the LXC, inside the extracted bundle:
|
||||
|
||||
```
|
||||
sudo ./pxe-setup.sh \
|
||||
--interface eth1 \
|
||||
--dhcp-range 10.77.0.100,10.77.0.200,12h \
|
||||
--orchestrator-url http://10.77.0.2:8080
|
||||
--interface eth0 \
|
||||
--subnet 192.168.1.0/24 \
|
||||
--orchestrator-url http://<lxc-lan-ip>:8080
|
||||
```
|
||||
|
||||
The script:
|
||||
@@ -101,13 +99,19 @@ sudo journalctl -fu vetting
|
||||
```
|
||||
|
||||
The orchestrator validates PXE preconditions at startup (interface
|
||||
exists, iPXE binaries are on disk, `dhcp_range` parses) and exits
|
||||
non-zero with a clear error if anything's wrong, instead of failing
|
||||
silently when a host first PXE-boots.
|
||||
exists, iPXE binaries are on disk, `subnet` parses as CIDR) and
|
||||
exits non-zero with a clear error if anything's wrong, instead of
|
||||
failing silently when a host first PXE-boots.
|
||||
|
||||
`pxe-setup.sh` is idempotent — safe to re-run. Pass `--force` to
|
||||
overwrite a hand-edited `pxe:` block.
|
||||
|
||||
**Router caveat.** Most home/prosumer routers (UniFi, Asus, Netgear,
|
||||
etc.) don't send PXE options, so proxy mode just works. pfSense and
|
||||
OPNsense *can* serve PXE themselves — if yours does, disable its
|
||||
TFTP/netboot feature so there's only one PXE authority on the
|
||||
segment.
|
||||
|
||||
### Dev-loop install (from a source checkout)
|
||||
|
||||
For iterating on the orchestrator without waiting for a CI publish:
|
||||
@@ -124,39 +128,38 @@ For iterating on the orchestrator without waiting for a CI publish:
|
||||
|
||||
Against a QEMU VM first, before you point it at real hardware:
|
||||
|
||||
1. Make sure the `br-vetting` bridge exists on the hypervisor (see
|
||||
above). From inside the LXC, confirm it's reachable on your
|
||||
PXE-side interface.
|
||||
|
||||
2. In the UI at `http://<lxc>:8080`, register a host:
|
||||
1. In the UI at `http://<lxc-lan-ip>:8080`, register a host:
|
||||
- Name: `qemu-test`
|
||||
- MAC: `52:54:00:12:34:56`
|
||||
- WoL broadcast IP: `10.77.0.255`
|
||||
- WoL broadcast IP: your LAN broadcast, e.g. `192.168.1.255`
|
||||
- Expected spec: paste a minimal YAML like
|
||||
```yaml
|
||||
memory: { total_gib: 4 }
|
||||
cpu: { logical_cores: 4 }
|
||||
```
|
||||
|
||||
3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`.
|
||||
2. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`.
|
||||
|
||||
4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:
|
||||
3. Launch the QEMU VM on the LAN bridge so it PXE-boots via the
|
||||
router's DHCP + our proxy-DHCP reply:
|
||||
|
||||
```
|
||||
sudo qemu-system-x86_64 \
|
||||
-enable-kvm -cpu host -smp 4 -m 4096 \
|
||||
-netdev bridge,id=n0,br=br-vetting \
|
||||
-netdev bridge,id=n0,br=vmbr0 \
|
||||
-device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
|
||||
-drive file=/tmp/test-disk.img,format=raw,if=virtio \
|
||||
-boot n -serial mon:stdio -display none
|
||||
```
|
||||
|
||||
5. Watch the tile advance through stages. On success, the tile shows
|
||||
(Swap `vmbr0` for whatever your Proxmox LAN bridge is called.)
|
||||
|
||||
4. Watch the tile advance through stages. On success, the tile shows
|
||||
**View report** and the VM auto-shuts-down.
|
||||
|
||||
For real repaired hardware: same flow, but register the node's actual
|
||||
MAC + expected spec, and make sure the node's BIOS is set to PXE-boot
|
||||
from the NIC that's on the `br-vetting` network.
|
||||
LAN MAC + expected spec, and make sure the node's BIOS is set to
|
||||
PXE-boot from the NIC that's on the LAN.
|
||||
|
||||
## A failed run — SSH to the held host
|
||||
|
||||
@@ -207,7 +210,8 @@ auth is independent and keeps working either way.
|
||||
|
||||
| Symptom | First check |
|
||||
|---|---|
|
||||
| PXE client gets no DHCP offer | `journalctl -u vetting` for dnsmasq errors; confirm the LXC has `CAP_NET_ADMIN` (the shipped systemd unit does); confirm the host MAC is actually registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). |
|
||||
| Host sits at PXE, no boot filename | Confirm the MAC is registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). If it is, `sudo tcpdump -i <lan-iface> -n -e 'port 67 or port 68 or port 4011'` while the host PXEs — if you see DISCOVER/OFFER from the router but no proxy reply from us, check `journalctl -u vetting` for dnsmasq errors. |
|
||||
| PXE boots but iPXE can't fetch the script | Verify the LXC's LAN IP matches `pxe.orchestrator_url` in `/etc/vetting/vetting.yaml` — iPXE bakes that URL in at chainload. |
|
||||
| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
|
||||
| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
|
||||
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
|
||||
|
||||
@@ -13,6 +13,13 @@
|
||||
//
|
||||
// sudo go test -tags=e2e -run TestQEMUFullRun ./test/e2e/...
|
||||
//
|
||||
// Network precondition: dnsmasq runs in proxy-DHCP mode on the LAN.
|
||||
// The QEMU VM attaches to the LAN bridge (default `vmbr0`) and gets
|
||||
// its IP from the LAN's real DHCP server (e.g. UniFi) while the
|
||||
// orchestrator's dnsmasq layers on the PXE options. There must be a
|
||||
// reachable DHCP server on that bridge — tests will hang at PXE
|
||||
// otherwise. Override the bridge with VETTING_E2E_BRIDGE.
|
||||
//
|
||||
// See docs/operations.md for the manual QEMU invocation equivalent.
|
||||
package e2e
|
||||
|
||||
@@ -34,11 +41,11 @@ import (
|
||||
// Tunables — overridable via env for CI, defaults match the manual
|
||||
// setup documented in docs/operations.md.
|
||||
var (
|
||||
bridgeName = envOr("VETTING_E2E_BRIDGE", "br-vetting")
|
||||
bridgeName = envOr("VETTING_E2E_BRIDGE", "vmbr0")
|
||||
liveKernel = envOr("VETTING_E2E_KERNEL", "live-image/out/vmlinuz")
|
||||
liveInitrd = envOr("VETTING_E2E_INITRD", "live-image/out/initrd.img")
|
||||
testMAC = envOr("VETTING_E2E_MAC", "52:54:00:12:34:56")
|
||||
publicURL = envOr("VETTING_E2E_URL", "http://10.77.0.1:8080")
|
||||
publicURL = envOr("VETTING_E2E_URL", "http://127.0.0.1:8080")
|
||||
// Overall budget for the run to reach Completed. Stage timeouts in
|
||||
// the config should be tuned down for E2E to well under this.
|
||||
runBudget = 10 * time.Minute
|
||||
|
||||
Reference in New Issue
Block a user