docs+e2e: document proxy-DHCP topology; default e2e bridge to LAN
Rewrites the PXE section of the ops runbook around the new proxy-DHCP model (no dedicated bridge, coexists with UniFi/pfSense/etc.) and swaps the e2e test's default bridge + orchestrator URL to match. The e2e file now calls out the LAN-DHCP precondition in its header so future-me (or CI) doesn't hang at PXE wondering why nothing answers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
+36
-32
@@ -60,28 +60,26 @@ Same bundle layout either way.
|
|||||||
|
|
||||||
PXE is gated behind a second script so non-PXE installs stay simple.
|
PXE is gated behind a second script so non-PXE installs stay simple.
|
||||||
|
|
||||||
**Prerequisite: dedicated PXE bridge on the Proxmox hypervisor.** The
|
**How it works on the network.** dnsmasq runs in **proxy-DHCP mode**:
|
||||||
LXC can't create bridges on its host, so do this once on the Proxmox
|
it binds to the LXC's LAN interface and *coexists* with your
|
||||||
node (not inside the LXC):
|
existing DHCP server (UniFi, pfSense, Asus, etc.). The router still
|
||||||
|
hands out LAN IPs the normal way; dnsmasq only answers the PXE
|
||||||
|
options (boot server + filename) and only for MACs you've registered
|
||||||
|
in the UI. A random laptop booting from network on the same LAN gets
|
||||||
|
a LAN IP from the router and nothing from us — the MAC allowlist is
|
||||||
|
the safety barrier.
|
||||||
|
|
||||||
```
|
That means **no dedicated bridge, no VLAN, no cabling changes**. The
|
||||||
sudo ip link add br-vetting type bridge
|
LXC just needs an interface on the same L2 segment as the hosts
|
||||||
sudo ip addr add 10.77.0.1/24 dev br-vetting
|
you're repairing — typically `eth0` on the LAN bridge.
|
||||||
sudo ip link set br-vetting up
|
|
||||||
```
|
|
||||||
|
|
||||||
Attach a veth from the LXC onto `br-vetting` (e.g. `eth1` inside the
|
|
||||||
LXC at `10.77.0.2/24`). Repaired nodes PXE-boot from a NIC cabled or
|
|
||||||
bridged onto `br-vetting` only — keep this network isolated from your
|
|
||||||
household DHCP, or both DHCP servers will fight.
|
|
||||||
|
|
||||||
On the LXC, inside the extracted bundle:
|
On the LXC, inside the extracted bundle:
|
||||||
|
|
||||||
```
|
```
|
||||||
sudo ./pxe-setup.sh \
|
sudo ./pxe-setup.sh \
|
||||||
--interface eth1 \
|
--interface eth0 \
|
||||||
--dhcp-range 10.77.0.100,10.77.0.200,12h \
|
--subnet 192.168.1.0/24 \
|
||||||
--orchestrator-url http://10.77.0.2:8080
|
--orchestrator-url http://<lxc-lan-ip>:8080
|
||||||
```
|
```
|
||||||
|
|
||||||
The script:
|
The script:
|
||||||
@@ -101,13 +99,19 @@ sudo journalctl -fu vetting
|
|||||||
```
|
```
|
||||||
|
|
||||||
The orchestrator validates PXE preconditions at startup (interface
|
The orchestrator validates PXE preconditions at startup (interface
|
||||||
exists, iPXE binaries are on disk, `dhcp_range` parses) and exits
|
exists, iPXE binaries are on disk, `subnet` parses as CIDR) and
|
||||||
non-zero with a clear error if anything's wrong, instead of failing
|
exits non-zero with a clear error if anything's wrong, instead of
|
||||||
silently when a host first PXE-boots.
|
failing silently when a host first PXE-boots.
|
||||||
|
|
||||||
`pxe-setup.sh` is idempotent — safe to re-run. Pass `--force` to
|
`pxe-setup.sh` is idempotent — safe to re-run. Pass `--force` to
|
||||||
overwrite a hand-edited `pxe:` block.
|
overwrite a hand-edited `pxe:` block.
|
||||||
|
|
||||||
|
**Router caveat.** Most home/prosumer routers (UniFi, Asus, Netgear,
|
||||||
|
etc.) don't send PXE options, so proxy mode just works. pfSense and
|
||||||
|
OPNsense *can* serve PXE themselves — if yours does, disable its
|
||||||
|
TFTP/netboot feature so there's only one PXE authority on the
|
||||||
|
segment.
|
||||||
|
|
||||||
### Dev-loop install (from a source checkout)
|
### Dev-loop install (from a source checkout)
|
||||||
|
|
||||||
For iterating on the orchestrator without waiting for a CI publish:
|
For iterating on the orchestrator without waiting for a CI publish:
|
||||||
@@ -124,39 +128,38 @@ For iterating on the orchestrator without waiting for a CI publish:
|
|||||||
|
|
||||||
Against a QEMU VM first, before you point it at real hardware:
|
Against a QEMU VM first, before you point it at real hardware:
|
||||||
|
|
||||||
1. Make sure the `br-vetting` bridge exists on the hypervisor (see
|
1. In the UI at `http://<lxc-lan-ip>:8080`, register a host:
|
||||||
above). From inside the LXC, confirm it's reachable on your
|
|
||||||
PXE-side interface.
|
|
||||||
|
|
||||||
2. In the UI at `http://<lxc>:8080`, register a host:
|
|
||||||
- Name: `qemu-test`
|
- Name: `qemu-test`
|
||||||
- MAC: `52:54:00:12:34:56`
|
- MAC: `52:54:00:12:34:56`
|
||||||
- WoL broadcast IP: `10.77.0.255`
|
- WoL broadcast IP: your LAN broadcast, e.g. `192.168.1.255`
|
||||||
- Expected spec: paste a minimal YAML like
|
- Expected spec: paste a minimal YAML like
|
||||||
```yaml
|
```yaml
|
||||||
memory: { total_gib: 4 }
|
memory: { total_gib: 4 }
|
||||||
cpu: { logical_cores: 4 }
|
cpu: { logical_cores: 4 }
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`.
|
2. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingReboot`.
|
||||||
|
|
||||||
4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:
|
3. Launch the QEMU VM on the LAN bridge so it PXE-boots via the
|
||||||
|
router's DHCP + our proxy-DHCP reply:
|
||||||
|
|
||||||
```
|
```
|
||||||
sudo qemu-system-x86_64 \
|
sudo qemu-system-x86_64 \
|
||||||
-enable-kvm -cpu host -smp 4 -m 4096 \
|
-enable-kvm -cpu host -smp 4 -m 4096 \
|
||||||
-netdev bridge,id=n0,br=br-vetting \
|
-netdev bridge,id=n0,br=vmbr0 \
|
||||||
-device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
|
-device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
|
||||||
-drive file=/tmp/test-disk.img,format=raw,if=virtio \
|
-drive file=/tmp/test-disk.img,format=raw,if=virtio \
|
||||||
-boot n -serial mon:stdio -display none
|
-boot n -serial mon:stdio -display none
|
||||||
```
|
```
|
||||||
|
|
||||||
5. Watch the tile advance through stages. On success, the tile shows
|
(Swap `vmbr0` for whatever your Proxmox LAN bridge is called.)
|
||||||
|
|
||||||
|
4. Watch the tile advance through stages. On success, the tile shows
|
||||||
**View report** and the VM auto-shuts-down.
|
**View report** and the VM auto-shuts-down.
|
||||||
|
|
||||||
For real repaired hardware: same flow, but register the node's actual
|
For real repaired hardware: same flow, but register the node's actual
|
||||||
MAC + expected spec, and make sure the node's BIOS is set to PXE-boot
|
LAN MAC + expected spec, and make sure the node's BIOS is set to
|
||||||
from the NIC that's on the `br-vetting` network.
|
PXE-boot from the NIC that's on the LAN.
|
||||||
|
|
||||||
## A failed run — SSH to the held host
|
## A failed run — SSH to the held host
|
||||||
|
|
||||||
@@ -207,7 +210,8 @@ auth is independent and keeps working either way.
|
|||||||
|
|
||||||
| Symptom | First check |
|
| Symptom | First check |
|
||||||
|---|---|
|
|---|---|
|
||||||
| PXE client gets no DHCP offer | `journalctl -u vetting` for dnsmasq errors; confirm the LXC has `CAP_NET_ADMIN` (the shipped systemd unit does); confirm the host MAC is actually registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). |
|
| Host sits at PXE, no boot filename | Confirm the MAC is registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). If it is, `sudo tcpdump -i <lan-iface> -n -e 'port 67 or port 68 or port 4011'` while the host PXEs — if you see DISCOVER/OFFER from the router but no proxy reply from us, check `journalctl -u vetting` for dnsmasq errors. |
|
||||||
|
| PXE boots but iPXE can't fetch the script | Verify the LXC's LAN IP matches `pxe.orchestrator_url` in `/etc/vetting/vetting.yaml` — iPXE bakes that URL in at chainload. |
|
||||||
| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
|
| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
|
||||||
| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
|
| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
|
||||||
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
|
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
|
||||||
|
|||||||
@@ -13,6 +13,13 @@
|
|||||||
//
|
//
|
||||||
// sudo go test -tags=e2e -run TestQEMUFullRun ./test/e2e/...
|
// sudo go test -tags=e2e -run TestQEMUFullRun ./test/e2e/...
|
||||||
//
|
//
|
||||||
|
// Network precondition: dnsmasq runs in proxy-DHCP mode on the LAN.
|
||||||
|
// The QEMU VM attaches to the LAN bridge (default `vmbr0`) and gets
|
||||||
|
// its IP from the LAN's real DHCP server (e.g. UniFi) while the
|
||||||
|
// orchestrator's dnsmasq layers on the PXE options. There must be a
|
||||||
|
// reachable DHCP server on that bridge — tests will hang at PXE
|
||||||
|
// otherwise. Override the bridge with VETTING_E2E_BRIDGE.
|
||||||
|
//
|
||||||
// See docs/operations.md for the manual QEMU invocation equivalent.
|
// See docs/operations.md for the manual QEMU invocation equivalent.
|
||||||
package e2e
|
package e2e
|
||||||
|
|
||||||
@@ -34,11 +41,11 @@ import (
|
|||||||
// Tunables — overridable via env for CI, defaults match the manual
|
// Tunables — overridable via env for CI, defaults match the manual
|
||||||
// setup documented in docs/operations.md.
|
// setup documented in docs/operations.md.
|
||||||
var (
|
var (
|
||||||
bridgeName = envOr("VETTING_E2E_BRIDGE", "br-vetting")
|
bridgeName = envOr("VETTING_E2E_BRIDGE", "vmbr0")
|
||||||
liveKernel = envOr("VETTING_E2E_KERNEL", "live-image/out/vmlinuz")
|
liveKernel = envOr("VETTING_E2E_KERNEL", "live-image/out/vmlinuz")
|
||||||
liveInitrd = envOr("VETTING_E2E_INITRD", "live-image/out/initrd.img")
|
liveInitrd = envOr("VETTING_E2E_INITRD", "live-image/out/initrd.img")
|
||||||
testMAC = envOr("VETTING_E2E_MAC", "52:54:00:12:34:56")
|
testMAC = envOr("VETTING_E2E_MAC", "52:54:00:12:34:56")
|
||||||
publicURL = envOr("VETTING_E2E_URL", "http://10.77.0.1:8080")
|
publicURL = envOr("VETTING_E2E_URL", "http://127.0.0.1:8080")
|
||||||
// Overall budget for the run to reach Completed. Stage timeouts in
|
// Overall budget for the run to reach Completed. Stage timeouts in
|
||||||
// the config should be tuned down for E2E to well under this.
|
// the config should be tuned down for E2E to well under this.
|
||||||
runBudget = 10 * time.Minute
|
runBudget = 10 * time.Minute
|
||||||
|
|||||||
Reference in New Issue
Block a user