Vetting

Author	SHA1	Message	Date
josh	27098fc7ed	cpustress+orchestrator: serial CPU/RAM passes + silent-skip guard CI / Lint + build + test (push) Successful in 1m23s Details Release / release (push) Successful in 6m2s Details Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping CPUStress. Two compounding bugs: 1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently. On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer fired, usually on the agent itself. Replaced with two sequential passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1, --vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify) for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a premature clean exit counts as failure instead of a silent pass. 2. On systemd-restart after the OOM, the agent hardcoded nextStage := "Inventory" and re-ran it. The orchestrator's /result handler advances run state via TriggerStageCompleted against the current RunState, not against body.Stage — so an Inventory result posted while the run was in StateCPUStress silently advanced CPUStress → Storage and marked CPUStress passed without it ever running. Two-layer defense for #2: - agent-side: /claim response now carries current_state; agent resumes at the matching stage on a re-claim (happy path). - server-side: new TriggerStageMismatch + StageNameForState helper backstop. If body.Stage doesn't match the run's current stage, /result parks the run in FailedHolding with failed_stage labeled "<got> (expected <expected>)" and returns 409. Other stages audited for similar unbounded concurrency — none found; only CPUStress was unsafe. Tests: - cpustress_test.go — parseMemAvailable parses real meminfo, errors on missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB headroom on normal/huge boxes. - statemachine_test.go — TriggerStageMismatch lands at FailedHolding from every stage state and is rejected from pre-stage/terminal states; StageNameForState round-trips the stageStates map. - agent_handlers_test.go — TestResult_RejectsMismatchedStage proves the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage proves the guard doesn't break the happy path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 17:29:13 -04:00
josh	cdd6cae3b0	ui: keep detail-page SSE swaps live after the first outerHTML replace CI / Lint + build + test (push) Successful in 1m28s Details Release / release (push) Successful in 6m29s Details Pipeline fragment payload was a bare <div class=pipeline>, but the sse-swap=pipeline-N wrapper lived only in the page shell. The first outerHTML swap destroyed the wrapper, so every subsequent pipeline event had nothing to target — forcing a manual refresh. RenderPipelineString now emits the full <section id=pipeline-N sse-swap=... hx-swap=outerHTML> wrapper, used from both the shell and the orchestrator publish path. Also drop the red-bar styling from the empty DetailHold placeholder: the wrapper's detail-hold class was painting an unconditional red band between Pipeline and Actions whenever no hold was active.	2026-04-18 17:03:39 -04:00
josh	0db790ae3e	ui: stream host-detail fragments over SSE so the page updates live CI / Lint + build + test (push) Successful in 1m29s Details Release / release (push) Has been cancelled Details The detail page was only partly live: Pipeline + LogTabs subscribed to SSE, but the summary header, actions row, spec-diffs list and hold-key block all froze at page-load and required a manual refresh to catch up with state changes. Extract each of those four regions into its own named templ component with a stable id and sse-swap target, add Render*String helpers so the orchestrator can publish pre-rendered fragments, and register a HostDetailRenderer alongside the existing Tile/Pipeline renderers. PublishHostDetail is folded into publishTileUpdate so every call site that already refreshes a tile now also refreshes the detail page — keeps the fan-out honest without scattering new publish calls. The empty-state wrappers for spec-diffs and hold are load-bearing: without the <section id=... sse-swap=...> present at initial GET, the first live event after SpecValidate or Hold writes would have no DOM node to swap into. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 16:36:13 -04:00
josh	026923075c	pxe: disable systemd-firstboot so the live image doesn't prompt CI / Lint + build + test (push) Successful in 1m22s Details Release / release (push) Has been cancelled Details systemd-firstboot.service is an interactive wizard that asks for locale, timezone, and root password when /etc/machine-id isn't populated — i.e. every PXE boot of a mkosi-built image. It sits on sysinit.target waiting for input that will never arrive, blocking the agent service and every other downstream unit indefinitely. systemd.firstboot=off on the kernel cmdline is the documented kill switch; no image-side changes needed.	2026-04-18 15:35:24 -04:00
josh	c45349f62c	pxe: mask serial-getty@ttyS0 so hosts without serial don't wait 90s CI / Lint + build + test (push) Successful in 1m47s Details Release / release (push) Successful in 5m16s Details systemd-getty-generator reads console=ttyS0 off the kernel cmdline and auto-creates serial-getty@ttyS0.service, which BindsTo dev-ttyS0.device. On hardware without a physical serial port the device node never shows up, systemd waits its full default 90s timeout, and only then proceeds. systemd.mask= on the kernel cmdline is a first-class option — masks the unit before the generator's link even gets activated. Kernel messages still go to ttyS0 if a port is present; we just don't try to spawn a login prompt there.	2026-04-18 14:47:03 -04:00
josh	a88e24bef4	live-image: real /init + verbose boot for first-boot diagnosis CI / Lint + build + test (push) Successful in 1m23s Details Release / release (push) Successful in 4m49s Details Host boots past kernel init and then stalls silently. ACPI DSDT error about TXHC.RHUB.SS01 is benign noise (Tiger Lake firmware bug) — the actual problem is that nothing between kernel handoff and (maybe) systemd is visible on the console. Two changes: 1. Replace the /init → sbin/init symlink with a real shell script (live-image/mkosi.extra/init) that mounts /proc /sys /dev /dev/pts /dev/shm /run before execing systemd. Systemd has fallback mount code for these, but when it fails the failure is silent. Doing it explicitly in /init keeps failures visible and avoids the fragile symlink-resolution trick. 2. Drop 'quiet' from the kernel cmdline and add loglevel=7 plus systemd.log_target=kmsg + journald.forward_to_console=1 so every early-boot message reaches both tty0 and ttyS0. Will be dialed back once boot is stable. Also: .gitattributes pins LF on live-image/, .gitea/, Makefile, and *.sh so Windows checkouts don't break shell scripts and Makefile recipes with CRLF. /init also gets chmod 0755 in repack-initrd as a belt-and-braces against mode loss on non-Linux checkouts.	2026-04-18 14:31:40 -04:00
josh	4524ab8dc0	runs: add non-destructive flag + operator Cancel button CI / Lint + build + test (push) Successful in 2m5s Details Release / release (push) Successful in 3m5s Details Non-destructive pre-declares "don't touch the disks" on Start: the Storage stage skips wipe-probe, badblocks -w, and write-mode fio, and reports a read-only summary. Runs a new non_destructive column; threaded through Claim → agent tests.Deps → Storage stage. Cancel halts an in-flight run. The orchestrator transitions to a new StateCancelled via TriggerOperatorCancelled (valid from any active state); the agent's next heartbeat returns cmd=cancel_stage, which fires a stored CancelFunc on the per-stage context. Stage subprocesses spawned with exec.CommandContext die with the context, the agent posts a cancelled outcome, then powers the host off. Destructive stages mid-run may leave the host in an intermediate state — the UI confirm dialog warns the operator; recovery is manual for now. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:01:42 -04:00
josh	2c440fce8a	pxe: move dhcp-host allowlist into a SIGHUP-reloadable file CI / Lint + build + test (push) Successful in 1m38s Details Release / release (push) Successful in 2m25s Details dnsmasq's SIGHUP re-reads /etc/ethers and any --dhcp-hostsfile= paths, but NOT dhcp-host= lines from the main conf. Reload() was faithfully rewriting dnsmasq.conf with the new MAC, sending SIGHUP, and then dnsmasq kept serving its startup view — so a freshly-registered host still showed up as "proxy-ignored, tags: eth0" with no "known" tag. Split the allowlist into ${RuntimeDir}/dhcp-hosts, referenced from the main conf via dhcp-hostsfile=. writeConf() is static-ish now; Reload just rewrites the hosts file and SIGHUPs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:41:27 -04:00
josh	bce6e08524	pxe: reload dnsmasq on host create/delete CI / Lint + build + test (push) Successful in 1m54s Details Release / release (push) Successful in 2m36s Details pxe.Supervisor.Reload() was defined but never wired up. After a host was registered in the UI or via the quick-register JSON endpoint, the dnsmasq conf still held only the hosts that existed at orchestrator startup. The new MAC wasn't tagged `known`, so when the host PXE'd, dnsmasq logged "PXE(eth0) <mac> proxy-ignored" and the boot timed out back to the BIOS. Add an optional PXEReloader interface to api.UI, wire it from main when pxe is enabled, and call u.reloadPXE() after successful Create and Delete. Logs-and-continues on failure — host registration itself has already committed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:31:00 -04:00
josh	157b70f536	pxe: split subnet into network+netmask for dnsmasq proxy-DHCP CI / Lint + build + test (push) Successful in 2m0s Details Release / release (push) Successful in 3m35s Details dnsmasq's proxy-DHCP syntax is `dhcp-range=<network-ip>,proxy[,<mask>]`, not a CIDR. Passing "192.168.1.0/24,proxy" made dnsmasq refuse to start with "bad dhcp-range at line 12". Parse the CIDR once in writeConf() and render Network + Netmask as separate template fields. The config surface (pxe.subnet) stays CIDR because that's the right shape for humans; we just unpack it before handing to dnsmasq. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:17:10 -04:00
josh	506c856046	pxe: switch dnsmasq to proxy-DHCP mode on the LAN CI / Lint + build + test (push) Successful in 1m48s Details Release / release (push) Successful in 2m22s Details Previously the orchestrator ran a full DHCP server on a dedicated br-vetting bridge (10.77.0.0/24), which required a hypervisor-level bridge + physical cabling onto that bridge for every repaired host. Real-world bite: the LXC's br-vetting had no L2 path to the target host's PXE NIC, so DHCPDISCOVERs never reached eth1 and PXE silently timed out. dnsmasq's proxy-DHCP mode is the idiomatic answer: it coexists with the LAN's existing DHCP server (UniFi, etc.), never assigns an IP itself, and only supplements the PXE options. No dedicated bridge, no VLAN, no cabling changes \u2014 dnsmasq binds to the LAN interface and layers option 66/67 + the PXE BINL on top of the real DHCP exchange. The MAC allowlist still gates replies, so random LAN clients booting from network get nothing. Template switches dhcp-range=<start,end,lease> to dhcp-range=<cidr>,proxy and replaces dhcp-boot= for first-boot ROM clients with pxe-service= directives (the correct proxy-mode chainload form). Validation drops the dhcp_range regex for a net.ParseCIDR check on pxe.subnet. Config, production/example yaml, and pxe-setup.sh swap --dhcp-range for --subnet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:02:49 -04:00
josh	6a1d5c3bed	pxe: route dnsmasq lease + pid files into RuntimeDir CI / Lint + build + test (push) Successful in 1m39s Details Release / release (push) Successful in 2m24s Details Without explicit dhcp-leasefile and pid-file, dnsmasq reaches for its distro defaults (/var/lib/misc/dnsmasq.leases, /run/dnsmasq.pid) — both outside the systemd unit's ReadWritePaths=/var/lib/vetting /var/log/vetting sandbox, causing 'Read-only file system' on every start. RuntimeDir is already writable by construction (Supervisor.Start mkdir's it), so writing both files there keeps dnsmasq entirely inside the sandbox. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 11:31:37 -04:00
josh	a5055b3c7a	Automate PXE setup: release bundle + pxe-setup.sh + startup validation CI / Lint + build + test (push) Has been cancelled Details Collapses the LXC side of PXE enablement from a six-step manual dance (build, fetch iPXE, scp, bridge, hand-edit yaml) into: make release # dev box (Linux/WSL) scp bundle.tar.gz lxc:/tmp/ sudo ./install.sh # base install, unchanged sudo ./pxe-setup.sh --interface ... --dhcp-range ... --orchestrator-url ... pxe-setup.sh fetches iPXE from boot.ipxe.org, verifies against pinned SHA256s in deploy/ipxe-shas.txt (fail-closed), places vmlinuz/initrd.img from the bundle, and rewrites only the pxe: block of vetting.yaml. Idempotent; --force gates overwriting a hand-edited block. Adds Supervisor.Validate() — called before dnsmasq spawn — so typo'd configs fail at orchestrator startup with clear errors naming the missing file or yaml key, instead of silently serving broken TFTP until a real host tries to PXE-boot. Nine tests cover missing files, bogus interface, malformed dhcp_range, bad orchestrator_url, and aggregate reporting. Hypervisor bridge creation stays documented (LXC can't do it) but everything downstream of the bridge is now scripted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 01:38:43 -04:00
josh	d245fa6235	quick.sh: stage+install agent to avoid ETXTBSY, restart service CI / Lint + build + test (push) Failing after 5m17s Details Re-running quick.sh on a host where vetting-reporter was already running failed with curl error 23 because curl can't overwrite a busy executable. Download to a staging path, then use `install(1)` which unlinks the target before writing. Swap `enable --now` for `enable` + `restart` so the service picks up the new binary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 01:13:14 -04:00
josh	d0bfae14c8	Heartbeat-first dispatch: retire WoL-as-default, add WaitingReboot CI / Lint + build + test (push) Has been cancelled Details Every supported host runs vetting-reporter in-OS and heartbeats every 30s. WoL was never the thing that started vetting — the heartbeat response's reboot_for_vetting command was. Firing WoL first only crowded the run log with misleading diagnostics when the real failure mode is "reporter isn't installed." - StartRun 409s if the host hasn't heartbeated within 60s, pointing the operator at /register/quick.sh. - Dispatcher re-checks LastSeenAt at dispatch time (run may sit in Queued long enough for the host to go offline); stale hosts mark the run Failed with failed_stage=dispatch instead of looping. - New StateWaitingReboot + TriggerRebootCommanded capture the actual semantics. StateWaitingWoL kept as the hook point for a future manual-override button. - Tile disables the Start button with a quick.sh tooltip when the host is offline, matching the server-side 409. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 01:10:34 -04:00
josh	c9927ca2bf	config: default agent.asset_dir so old configs still serve /assets CI / Lint + build + test (push) Failing after 5m12s Details Operators who installed vetting before agent.asset_dir existed keep their config preserved by install.sh on upgrade, which left them with AssetDir="" — the router silently dropped the /assets/* mount and the quick-register one-liner hit 404 fetching the agent binary. Default AssetDir alongside the database file so the same directory install.sh already creates + drops the agent binary into is picked up automatically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 00:41:09 -04:00
josh	1694c20b12	Host detail v2: full pipeline + per-stage logs + WoL diagnostics CI / Lint + build + test (push) Has been cancelled Details Pipeline now always renders all 13 nodes (3 pre-stage + 9 stage + Completed), synthesising ghosts from run state when stage rows aren't seeded yet. Makes a WaitingWoL host show the full timeline ahead of it instead of just 4 dots. Agent tags each log line with its stage; logs.Hub fans out to both log-{runID} and log-{runID}-{stage} SSE events so the detail page can show per-stage tabs with a pure-CSS radio-sibling switch. Flat run log prepends [stage] so grep still works. Dispatcher writes picked/sent-WoL/heartbeat lines into the per-run log — the operator opens the detail page, sees WaitingWoL stuck, and reads exactly what the dispatcher did and why nothing's progressing, instead of having to tail journalctl on the LXC. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 00:38:27 -04:00
josh	bb658a8435	Host detail page + pipeline timeline CI / Lint + build + test (push) Has been cancelled Details Click a tile to open /hosts/{id} — the canonical control surface per host. Timeline renders every pre-stage, stage, and terminal node in order, with the current one pulsing, failed ones flagged, and downstream ones dimmed as skipped. Detail page shows summary, hold card (when holding), all action buttons, spec diffs, a full-height log pane, and a collapsed expected-spec YAML. Tile slims to name, last-seen, status, and one primary action; a CSS-overlay <a> makes the whole card clickable while buttons stay receptive via z-index. Runner.publishTileUpdate now also emits pipeline-{runID} fragments, and CompleteStage wraps Stages.CompleteByName so stage completions advance the timeline live — without this the dots only moved on state transitions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-17 23:59:43 -04:00
josh	9b16ed80e6	Heartbeat command channel: reboot_for_vetting skips WoL CI / Lint + build + test (push) Failing after 5m13s Details When the operator clicks Start vetting and the host is heartbeating, the heartbeat response now carries cmd=reboot_for_vetting + run_id. The handler drives the Queued → WaitingWoL transition via the existing state machine, so a benign race with the 2s dispatcher poll is refused by the state machine (not double-dispatched). WaitingWoL retries for 10 minutes to cover a crashed-mid-reboot case, then falls back to operator action. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-17 23:37:01 -04:00
josh	a0c0fb114f	Add host-mode heartbeat: vetting-agent host + last-seen badge CI / Lint + build + test (push) Has been cancelled Details vetting-agent gains a `host` subcommand that runs as a systemd service installed by the quick-register one-liner, POSTing every 30s to /api/v1/hosts/{mac}/heartbeat so the dashboard tile shows "online" or "Nm ago" without waiting on WoL. Ships dormant client code for the Phase 2 reboot_for_vetting command so the server can flip it on later without a binary redeploy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-17 23:34:15 -04:00
josh	d24207427f	Fix quick-register broadcast detection on Proxmox bridges CI / Lint + build + test (push) Failing after 5m17s Details Two bugs compounded on Proxmox hosts: primary_iface walked `ip link show` and picked the physical NIC (e.g. enp1s0), which has no IPv4 on Proxmox because the address lives on vmbr0. Even if vmbr0 had been picked, the kernel reports its broadcast as 0.0.0.0, so the script fell all the way back to 255.255.255.255. Now we prefer the default-route interface (vmbr0 on Proxmox, eno1 on bare metal) and, when `ip` doesn't surface a usable `brd`, compute the broadcast from the inet CIDR instead of giving up.	2026-04-17 22:57:49 -04:00
josh	8b3d9a312e	Add quick-register one-liner for target-host registration CI / Lint + build + test (push) Failing after 5m15s Details Operator pastes `curl -fsSL $ORCH/register/quick.sh \| sudo bash` on the target host (pre-wipe). The script probes MAC + CPU/RAM/disks/NICs/GPUs, emits an expected-spec YAML, and POSTs to a new LAN-trusted JSON endpoint /api/v1/hosts. The register page shows the command prefilled with the orchestrator URL; the manual form moves into a collapsible "Register manually" disclosure.	2026-04-17 22:50:54 -04:00
josh	42da48864f	Remove operator auth — trust the LAN CI / Lint + build + test (push) Failing after 5m15s Details Can't log in from a fresh LXC deploy, and the service is LAN-only by design. Rip out the whole bcrypt-password / signed-cookie session layer: internal/auth, login templates, gen-admin-password binary + Makefile targets, auth config block, login/logout routes and the RequireSession middleware wrap. Agent bearer-token auth on /api/v1/runs/{id}/* is untouched. Operators who want a password can front the service with a reverse proxy — noted in README and docs/operations.md.	2026-04-17 22:31:49 -04:00
josh	9bb4b09a04	Initial commit: full Phases 1-6 implementation CI / Lint + build + test (push) Has been cancelled Details Post-repair hardware validation pipeline for Proxmox cluster hosts. Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.	2026-04-17 21:32:10 -04:00

24 Commits