Commit Graph

15 Commits

Author SHA1 Message Date
josh 157b70f536 pxe: split subnet into network+netmask for dnsmasq proxy-DHCP
CI / Lint + build + test (push) Successful in 2m0s
Release / release (push) Successful in 3m35s
dnsmasq's proxy-DHCP syntax is `dhcp-range=<network-ip>,proxy[,<mask>]`,
not a CIDR. Passing "192.168.1.0/24,proxy" made dnsmasq refuse to start
with "bad dhcp-range at line 12". Parse the CIDR once in writeConf()
and render Network + Netmask as separate template fields.

The config surface (pxe.subnet) stays CIDR because that's the right
shape for humans; we just unpack it before handing to dnsmasq.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:17:10 -04:00
josh 506c856046 pxe: switch dnsmasq to proxy-DHCP mode on the LAN
CI / Lint + build + test (push) Successful in 1m48s
Release / release (push) Successful in 2m22s
Previously the orchestrator ran a full DHCP server on a dedicated
br-vetting bridge (10.77.0.0/24), which required a hypervisor-level
bridge + physical cabling onto that bridge for every repaired host.
Real-world bite: the LXC's br-vetting had no L2 path to the target
host's PXE NIC, so DHCPDISCOVERs never reached eth1 and PXE silently
timed out.

dnsmasq's proxy-DHCP mode is the idiomatic answer: it coexists with
the LAN's existing DHCP server (UniFi, etc.), never assigns an IP
itself, and only supplements the PXE options. No dedicated bridge,
no VLAN, no cabling changes \u2014 dnsmasq binds to the LAN interface
and layers option 66/67 + the PXE BINL on top of the real DHCP
exchange. The MAC allowlist still gates replies, so random LAN
clients booting from network get nothing.

Template switches dhcp-range=<start,end,lease> to
dhcp-range=<cidr>,proxy and replaces dhcp-boot= for first-boot ROM
clients with pxe-service= directives (the correct proxy-mode
chainload form). Validation drops the dhcp_range regex for a
net.ParseCIDR check on pxe.subnet. Config, production/example yaml,
and pxe-setup.sh swap --dhcp-range for --subnet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:02:49 -04:00
josh 6a1d5c3bed pxe: route dnsmasq lease + pid files into RuntimeDir
CI / Lint + build + test (push) Successful in 1m39s
Release / release (push) Successful in 2m24s
Without explicit dhcp-leasefile and pid-file, dnsmasq reaches for
its distro defaults (/var/lib/misc/dnsmasq.leases,
/run/dnsmasq.pid) — both outside the systemd unit's
ReadWritePaths=/var/lib/vetting /var/log/vetting sandbox, causing
'Read-only file system' on every start.

RuntimeDir is already writable by construction (Supervisor.Start
mkdir's it), so writing both files there keeps dnsmasq entirely
inside the sandbox.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 11:31:37 -04:00
josh a5055b3c7a Automate PXE setup: release bundle + pxe-setup.sh + startup validation
CI / Lint + build + test (push) Has been cancelled
Collapses the LXC side of PXE enablement from a six-step manual dance
(build, fetch iPXE, scp, bridge, hand-edit yaml) into:

  make release                   # dev box (Linux/WSL)
  scp bundle.tar.gz lxc:/tmp/
  sudo ./install.sh              # base install, unchanged
  sudo ./pxe-setup.sh --interface ... --dhcp-range ... --orchestrator-url ...

pxe-setup.sh fetches iPXE from boot.ipxe.org, verifies against pinned
SHA256s in deploy/ipxe-shas.txt (fail-closed), places vmlinuz/initrd.img
from the bundle, and rewrites only the pxe: block of vetting.yaml.
Idempotent; --force gates overwriting a hand-edited block.

Adds Supervisor.Validate() — called before dnsmasq spawn — so typo'd
configs fail at orchestrator startup with clear errors naming the
missing file or yaml key, instead of silently serving broken TFTP
until a real host tries to PXE-boot. Nine tests cover missing files,
bogus interface, malformed dhcp_range, bad orchestrator_url, and
aggregate reporting.

Hypervisor bridge creation stays documented (LXC can't do it) but
everything downstream of the bridge is now scripted.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 01:38:43 -04:00
josh d245fa6235 quick.sh: stage+install agent to avoid ETXTBSY, restart service
CI / Lint + build + test (push) Failing after 5m17s
Re-running quick.sh on a host where vetting-reporter was already
running failed with curl error 23 because curl can't overwrite a
busy executable. Download to a staging path, then use `install(1)`
which unlinks the target before writing. Swap `enable --now` for
`enable` + `restart` so the service picks up the new binary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 01:13:14 -04:00
josh d0bfae14c8 Heartbeat-first dispatch: retire WoL-as-default, add WaitingReboot
CI / Lint + build + test (push) Has been cancelled
Every supported host runs vetting-reporter in-OS and heartbeats every
30s. WoL was never the thing that started vetting — the heartbeat
response's reboot_for_vetting command was. Firing WoL first only
crowded the run log with misleading diagnostics when the real failure
mode is "reporter isn't installed."

- StartRun 409s if the host hasn't heartbeated within 60s, pointing
  the operator at /register/quick.sh.
- Dispatcher re-checks LastSeenAt at dispatch time (run may sit in
  Queued long enough for the host to go offline); stale hosts mark
  the run Failed with failed_stage=dispatch instead of looping.
- New StateWaitingReboot + TriggerRebootCommanded capture the actual
  semantics. StateWaitingWoL kept as the hook point for a future
  manual-override button.
- Tile disables the Start button with a quick.sh tooltip when the
  host is offline, matching the server-side 409.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 01:10:34 -04:00
josh c9927ca2bf config: default agent.asset_dir so old configs still serve /assets
CI / Lint + build + test (push) Failing after 5m12s
Operators who installed vetting before agent.asset_dir existed keep
their config preserved by install.sh on upgrade, which left them
with AssetDir="" — the router silently dropped the /assets/*
mount and the quick-register one-liner hit 404 fetching the agent
binary. Default AssetDir alongside the database file so the same
directory install.sh already creates + drops the agent binary into
is picked up automatically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 00:41:09 -04:00
josh 1694c20b12 Host detail v2: full pipeline + per-stage logs + WoL diagnostics
CI / Lint + build + test (push) Has been cancelled
Pipeline now always renders all 13 nodes (3 pre-stage + 9 stage +
Completed), synthesising ghosts from run state when stage rows
aren't seeded yet. Makes a WaitingWoL host show the full timeline
ahead of it instead of just 4 dots.

Agent tags each log line with its stage; logs.Hub fans out to both
log-{runID} and log-{runID}-{stage} SSE events so the detail page
can show per-stage tabs with a pure-CSS radio-sibling switch. Flat
run log prepends [stage] so grep still works.

Dispatcher writes picked/sent-WoL/heartbeat lines into the per-run
log — the operator opens the detail page, sees WaitingWoL stuck,
and reads exactly what the dispatcher did and why nothing's
progressing, instead of having to tail journalctl on the LXC.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 00:38:27 -04:00
josh bb658a8435 Host detail page + pipeline timeline
CI / Lint + build + test (push) Has been cancelled
Click a tile to open /hosts/{id} — the canonical control surface per
host. Timeline renders every pre-stage, stage, and terminal node in
order, with the current one pulsing, failed ones flagged, and
downstream ones dimmed as skipped. Detail page shows summary, hold
card (when holding), all action buttons, spec diffs, a full-height
log pane, and a collapsed expected-spec YAML.

Tile slims to name, last-seen, status, and one primary action; a
CSS-overlay <a> makes the whole card clickable while buttons stay
receptive via z-index.

Runner.publishTileUpdate now also emits pipeline-{runID} fragments,
and CompleteStage wraps Stages.CompleteByName so stage completions
advance the timeline live — without this the dots only moved on
state transitions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-17 23:59:43 -04:00
josh 9b16ed80e6 Heartbeat command channel: reboot_for_vetting skips WoL
CI / Lint + build + test (push) Failing after 5m13s
When the operator clicks Start vetting and the host is heartbeating,
the heartbeat response now carries cmd=reboot_for_vetting + run_id.
The handler drives the Queued → WaitingWoL transition via the existing
state machine, so a benign race with the 2s dispatcher poll is refused
by the state machine (not double-dispatched). WaitingWoL retries for
10 minutes to cover a crashed-mid-reboot case, then falls back to
operator action.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-17 23:37:01 -04:00
josh a0c0fb114f Add host-mode heartbeat: vetting-agent host + last-seen badge
CI / Lint + build + test (push) Has been cancelled
vetting-agent gains a `host` subcommand that runs as a systemd service
installed by the quick-register one-liner, POSTing every 30s to
/api/v1/hosts/{mac}/heartbeat so the dashboard tile shows "online" or
"Nm ago" without waiting on WoL. Ships dormant client code for the
Phase 2 reboot_for_vetting command so the server can flip it on later
without a binary redeploy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-17 23:34:15 -04:00
josh d24207427f Fix quick-register broadcast detection on Proxmox bridges
CI / Lint + build + test (push) Failing after 5m17s
Two bugs compounded on Proxmox hosts: primary_iface walked
`ip link show` and picked the physical NIC (e.g. enp1s0), which has
no IPv4 on Proxmox because the address lives on vmbr0. Even if vmbr0
had been picked, the kernel reports its broadcast as 0.0.0.0, so the
script fell all the way back to 255.255.255.255.

Now we prefer the default-route interface (vmbr0 on Proxmox, eno1 on
bare metal) and, when `ip` doesn't surface a usable `brd`, compute
the broadcast from the inet CIDR instead of giving up.
2026-04-17 22:57:49 -04:00
josh 8b3d9a312e Add quick-register one-liner for target-host registration
CI / Lint + build + test (push) Failing after 5m15s
Operator pastes `curl -fsSL $ORCH/register/quick.sh | sudo bash` on the
target host (pre-wipe). The script probes MAC + CPU/RAM/disks/NICs/GPUs,
emits an expected-spec YAML, and POSTs to a new LAN-trusted JSON
endpoint /api/v1/hosts. The register page shows the command prefilled
with the orchestrator URL; the manual form moves into a collapsible
"Register manually" disclosure.
2026-04-17 22:50:54 -04:00
josh 42da48864f Remove operator auth — trust the LAN
CI / Lint + build + test (push) Failing after 5m15s
Can't log in from a fresh LXC deploy, and the service is LAN-only by
design. Rip out the whole bcrypt-password / signed-cookie session
layer: internal/auth, login templates, gen-admin-password binary +
Makefile targets, auth config block, login/logout routes and the
RequireSession middleware wrap. Agent bearer-token auth on
/api/v1/runs/{id}/* is untouched.

Operators who want a password can front the service with a reverse
proxy — noted in README and docs/operations.md.
2026-04-17 22:31:49 -04:00
josh 9bb4b09a04 Initial commit: full Phases 1-6 implementation
CI / Lint + build + test (push) Has been cancelled
Post-repair hardware validation pipeline for Proxmox cluster hosts.
Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq
PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
2026-04-17 21:32:10 -04:00