diff --git a/docs/operations.md b/docs/operations.md index ad50e39..d88d9e7 100644 --- a/docs/operations.md +++ b/docs/operations.md @@ -60,28 +60,26 @@ Same bundle layout either way. PXE is gated behind a second script so non-PXE installs stay simple. -**Prerequisite: dedicated PXE bridge on the Proxmox hypervisor.** The -LXC can't create bridges on its host, so do this once on the Proxmox -node (not inside the LXC): +**How it works on the network.** dnsmasq runs in **proxy-DHCP mode**: +it binds to the LXC's LAN interface and *coexists* with your +existing DHCP server (UniFi, pfSense, Asus, etc.). The router still +hands out LAN IPs the normal way; dnsmasq only answers the PXE +options (boot server + filename) and only for MACs you've registered +in the UI. A random laptop booting from network on the same LAN gets +a LAN IP from the router and nothing from us — the MAC allowlist is +the safety barrier. -``` -sudo ip link add br-vetting type bridge -sudo ip addr add 10.77.0.1/24 dev br-vetting -sudo ip link set br-vetting up -``` - -Attach a veth from the LXC onto `br-vetting` (e.g. `eth1` inside the -LXC at `10.77.0.2/24`). Repaired nodes PXE-boot from a NIC cabled or -bridged onto `br-vetting` only — keep this network isolated from your -household DHCP, or both DHCP servers will fight. +That means **no dedicated bridge, no VLAN, no cabling changes**. The +LXC just needs an interface on the same L2 segment as the hosts +you're repairing — typically `eth0` on the LAN bridge. On the LXC, inside the extracted bundle: ``` sudo ./pxe-setup.sh \ - --interface eth1 \ - --dhcp-range 10.77.0.100,10.77.0.200,12h \ - --orchestrator-url http://10.77.0.2:8080 + --interface eth0 \ + --subnet 192.168.1.0/24 \ + --orchestrator-url http://:8080 ``` The script: @@ -101,13 +99,19 @@ sudo journalctl -fu vetting ``` The orchestrator validates PXE preconditions at startup (interface -exists, iPXE binaries are on disk, `dhcp_range` parses) and exits -non-zero with a clear error if anything's wrong, instead of failing -silently when a host first PXE-boots. +exists, iPXE binaries are on disk, `subnet` parses as CIDR) and +exits non-zero with a clear error if anything's wrong, instead of +failing silently when a host first PXE-boots. `pxe-setup.sh` is idempotent — safe to re-run. Pass `--force` to overwrite a hand-edited `pxe:` block. +**Router caveat.** Most home/prosumer routers (UniFi, Asus, Netgear, +etc.) don't send PXE options, so proxy mode just works. pfSense and +OPNsense *can* serve PXE themselves — if yours does, disable its +TFTP/netboot feature so there's only one PXE authority on the +segment. + ### Dev-loop install (from a source checkout) For iterating on the orchestrator without waiting for a CI publish: @@ -124,39 +128,38 @@ For iterating on the orchestrator without waiting for a CI publish: Against a QEMU VM first, before you point it at real hardware: -1. Make sure the `br-vetting` bridge exists on the hypervisor (see - above). From inside the LXC, confirm it's reachable on your - PXE-side interface. - -2. In the UI at `http://:8080`, register a host: +1. In the UI at `http://:8080`, register a host: - Name: `qemu-test` - MAC: `52:54:00:12:34:56` - - WoL broadcast IP: `10.77.0.255` + - WoL broadcast IP: your LAN broadcast, e.g. `192.168.1.255` - Expected spec: paste a minimal YAML like ```yaml memory: { total_gib: 4 } cpu: { logical_cores: 4 } ``` -3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`. +2. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`. -4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq: +3. Launch the QEMU VM on the LAN bridge so it PXE-boots via the + router's DHCP + our proxy-DHCP reply: ``` sudo qemu-system-x86_64 \ -enable-kvm -cpu host -smp 4 -m 4096 \ - -netdev bridge,id=n0,br=br-vetting \ + -netdev bridge,id=n0,br=vmbr0 \ -device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \ -drive file=/tmp/test-disk.img,format=raw,if=virtio \ -boot n -serial mon:stdio -display none ``` -5. Watch the tile advance through stages. On success, the tile shows + (Swap `vmbr0` for whatever your Proxmox LAN bridge is called.) + +4. Watch the tile advance through stages. On success, the tile shows **View report** and the VM auto-shuts-down. For real repaired hardware: same flow, but register the node's actual -MAC + expected spec, and make sure the node's BIOS is set to PXE-boot -from the NIC that's on the `br-vetting` network. +LAN MAC + expected spec, and make sure the node's BIOS is set to +PXE-boot from the NIC that's on the LAN. ## A failed run — SSH to the held host @@ -207,7 +210,8 @@ auth is independent and keeps working either way. | Symptom | First check | |---|---| -| PXE client gets no DHCP offer | `journalctl -u vetting` for dnsmasq errors; confirm the LXC has `CAP_NET_ADMIN` (the shipped systemd unit does); confirm the host MAC is actually registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). | +| Host sits at PXE, no boot filename | Confirm the MAC is registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). If it is, `sudo tcpdump -i -n -e 'port 67 or port 68 or port 4011'` while the host PXEs — if you see DISCOVER/OFFER from the router but no proxy reply from us, check `journalctl -u vetting` for dnsmasq errors. | +| PXE boots but iPXE can't fetch the script | Verify the LXC's LAN IP matches `pxe.orchestrator_url` in `/etc/vetting/vetting.yaml` — iPXE bakes that URL in at chainload. | | Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. | | Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. | | UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. | diff --git a/test/e2e/qemu_test.go b/test/e2e/qemu_test.go index 52a42a8..313ad79 100644 --- a/test/e2e/qemu_test.go +++ b/test/e2e/qemu_test.go @@ -13,6 +13,13 @@ // // sudo go test -tags=e2e -run TestQEMUFullRun ./test/e2e/... // +// Network precondition: dnsmasq runs in proxy-DHCP mode on the LAN. +// The QEMU VM attaches to the LAN bridge (default `vmbr0`) and gets +// its IP from the LAN's real DHCP server (e.g. UniFi) while the +// orchestrator's dnsmasq layers on the PXE options. There must be a +// reachable DHCP server on that bridge — tests will hang at PXE +// otherwise. Override the bridge with VETTING_E2E_BRIDGE. +// // See docs/operations.md for the manual QEMU invocation equivalent. package e2e @@ -34,11 +41,11 @@ import ( // Tunables — overridable via env for CI, defaults match the manual // setup documented in docs/operations.md. var ( - bridgeName = envOr("VETTING_E2E_BRIDGE", "br-vetting") + bridgeName = envOr("VETTING_E2E_BRIDGE", "vmbr0") liveKernel = envOr("VETTING_E2E_KERNEL", "live-image/out/vmlinuz") liveInitrd = envOr("VETTING_E2E_INITRD", "live-image/out/initrd.img") testMAC = envOr("VETTING_E2E_MAC", "52:54:00:12:34:56") - publicURL = envOr("VETTING_E2E_URL", "http://10.77.0.1:8080") + publicURL = envOr("VETTING_E2E_URL", "http://127.0.0.1:8080") // Overall budget for the run to reach Completed. Stage timeouts in // the config should be tuned down for E2E to well under this. runBudget = 10 * time.Minute