Vetting

Author	SHA1	Message	Date
josh	27098fc7ed	cpustress+orchestrator: serial CPU/RAM passes + silent-skip guard CI / Lint + build + test (push) Successful in 1m23s Details Release / release (push) Successful in 6m2s Details Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping CPUStress. Two compounding bugs: 1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently. On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer fired, usually on the agent itself. Replaced with two sequential passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1, --vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify) for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a premature clean exit counts as failure instead of a silent pass. 2. On systemd-restart after the OOM, the agent hardcoded nextStage := "Inventory" and re-ran it. The orchestrator's /result handler advances run state via TriggerStageCompleted against the current RunState, not against body.Stage — so an Inventory result posted while the run was in StateCPUStress silently advanced CPUStress → Storage and marked CPUStress passed without it ever running. Two-layer defense for #2: - agent-side: /claim response now carries current_state; agent resumes at the matching stage on a re-claim (happy path). - server-side: new TriggerStageMismatch + StageNameForState helper backstop. If body.Stage doesn't match the run's current stage, /result parks the run in FailedHolding with failed_stage labeled "<got> (expected <expected>)" and returns 409. Other stages audited for similar unbounded concurrency — none found; only CPUStress was unsafe. Tests: - cpustress_test.go — parseMemAvailable parses real meminfo, errors on missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB headroom on normal/huge boxes. - statemachine_test.go — TriggerStageMismatch lands at FailedHolding from every stage state and is rejected from pre-stage/terminal states; StageNameForState round-trips the stageStates map. - agent_handlers_test.go — TestResult_RejectsMismatchedStage proves the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage proves the guard doesn't break the happy path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 17:29:13 -04:00
josh	cdd6cae3b0	ui: keep detail-page SSE swaps live after the first outerHTML replace CI / Lint + build + test (push) Successful in 1m28s Details Release / release (push) Successful in 6m29s Details Pipeline fragment payload was a bare <div class=pipeline>, but the sse-swap=pipeline-N wrapper lived only in the page shell. The first outerHTML swap destroyed the wrapper, so every subsequent pipeline event had nothing to target — forcing a manual refresh. RenderPipelineString now emits the full <section id=pipeline-N sse-swap=... hx-swap=outerHTML> wrapper, used from both the shell and the orchestrator publish path. Also drop the red-bar styling from the empty DetailHold placeholder: the wrapper's detail-hold class was painting an unconditional red band between Pipeline and Actions whenever no hold was active.	2026-04-18 17:03:39 -04:00
josh	e73e31af92	live-image: install stage tools and fail loudly if any are missing CI / Lint + build + test (push) Successful in 1m32s Details Release / release (push) Successful in 6m28s Details The live image was still carrying the Phase 2 package list, so SMART, CPUStress, and Network each hit a LookPath miss and returned pass-with-skip. A run that skipped every real check still ended in "completed" — nothing on the report said the image was broken. Add smartmontools, stress-ng, fio, iperf3, lshw, lm-sensors, e2fsprogs, and util-linux to mkosi.conf. Flip the three stages from skip-pass to fail when their binary is missing so any future packaging regression blocks the run instead of whispering past it. Legitimate "no hardware" skips (no GPU, no hwmon, no disks, non-destructive) are untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 16:39:28 -04:00
josh	0db790ae3e	ui: stream host-detail fragments over SSE so the page updates live CI / Lint + build + test (push) Successful in 1m29s Details Release / release (push) Has been cancelled Details The detail page was only partly live: Pipeline + LogTabs subscribed to SSE, but the summary header, actions row, spec-diffs list and hold-key block all froze at page-load and required a manual refresh to catch up with state changes. Extract each of those four regions into its own named templ component with a stable id and sse-swap target, add Render*String helpers so the orchestrator can publish pre-rendered fragments, and register a HostDetailRenderer alongside the existing Tile/Pipeline renderers. PublishHostDetail is folded into publishTileUpdate so every call site that already refreshes a tile now also refreshes the detail page — keeps the fan-out honest without scattering new publish calls. The empty-state wrappers for spec-diffs and hold are load-bearing: without the <section id=... sse-swap=...> present at initial GET, the first live event after SpecValidate or Hold writes would have no DOM node to swap into. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 16:36:13 -04:00
josh	5e9ad7f569	probes: sanitize disk serials and normalize GPU model for stable spec keys CI / Lint + build + test (push) Successful in 1m25s Details Release / release (push) Successful in 5m38s Details Two related bugs were producing different map keys for identical hardware depending on whether the inventory probe ran in the reporter on the Proxmox host or in the live-image agent after PXE boot. 1. diskSerial read /sys/block/<dev>/device/{serial,vpd_pg80} and only TrimSpace'd the result. vpd_pg80 is a binary SCSI VPD page with a 4-byte header, and some SSDs leak NUL/control bytes into the text serial file. Those bytes survive into the Go string, lowercase unchanged, and become a garbage map key that the reporter's cleaner read can't match. Sanitize to ASCII-printable range at ingest. 2. probeGPUs built the model slug from fields[2] + " " + fields[3] of `lspci -mm -nnk` output. fields[3] is subsystem vendor/device info, which varies between otherwise-identical cards and carries the `-rXX` revision marker — stable-enough for display but not for identity. Use fields[2] alone, strip the trailing `[NNNN]` PCI device-ID that lspci -nn appends, and sanitize for consistency. After deploying the new orchestrator + re-running the configure step on each registered host, SpecValidate will match cleanly. Disk diffs self-resolve because the reporter already stored clean serials; GPU diffs need one reporter re-run because the old expected slug still carries subsystem noise.	2026-04-18 16:06:18 -04:00
josh	d48cf146f4	live-image: mask systemd-firstboot at image-build time CI / Lint + build + test (push) Successful in 1m24s Details Release / release (push) Successful in 5m53s Details Belt-and-braces for the kernel-cmdline systemd.firstboot=off fix. mkosi ships /etc/machine-id empty, which triggers firstboot's interactive locale/timezone/root-password prompt on every PXE boot; with the agent running unattended there's nobody to answer and sysinit.target blocks indefinitely. Mask via a /dev/null symlink in /etc/systemd/system so the service is unstartable regardless of cmdline — rules out the failure mode where an older orchestrator binary serves an iPXE script without the off-switch arg.	2026-04-18 15:41:46 -04:00
josh	026923075c	pxe: disable systemd-firstboot so the live image doesn't prompt CI / Lint + build + test (push) Successful in 1m22s Details Release / release (push) Has been cancelled Details systemd-firstboot.service is an interactive wizard that asks for locale, timezone, and root password when /etc/machine-id isn't populated — i.e. every PXE boot of a mkosi-built image. It sits on sysinit.target waiting for input that will never arrive, blocking the agent service and every other downstream unit indefinitely. systemd.firstboot=off on the kernel cmdline is the documented kill switch; no image-side changes needed.	2026-04-18 15:35:24 -04:00
josh	956120b80e	deploy: show speed + ETA in bundle-download progress meter CI / Lint + build + test (push) Successful in 1m24s Details Release / release (push) Successful in 5m30s Details Drop --progress-bar (curl's minimal hash meter) in favor of the default progress output, which includes transfer rate and time remaining. Bundles grew from ~30 MB to ~300 MB with the full-rootfs initrd, and a percentage-only bar with no speed hint makes a slow registry look indistinguishable from a hang.	2026-04-18 15:04:26 -04:00
josh	c45349f62c	pxe: mask serial-getty@ttyS0 so hosts without serial don't wait 90s CI / Lint + build + test (push) Successful in 1m47s Details Release / release (push) Successful in 5m16s Details systemd-getty-generator reads console=ttyS0 off the kernel cmdline and auto-creates serial-getty@ttyS0.service, which BindsTo dev-ttyS0.device. On hardware without a physical serial port the device node never shows up, systemd waits its full default 90s timeout, and only then proceeds. systemd.mask= on the kernel cmdline is a first-class option — masks the unit before the generator's link even gets activated. Kernel messages still go to ttyS0 if a port is present; we just don't try to spawn a login prompt there.	2026-04-18 14:47:03 -04:00
josh	a88e24bef4	live-image: real /init + verbose boot for first-boot diagnosis CI / Lint + build + test (push) Successful in 1m23s Details Release / release (push) Successful in 4m49s Details Host boots past kernel init and then stalls silently. ACPI DSDT error about TXHC.RHUB.SS01 is benign noise (Tiger Lake firmware bug) — the actual problem is that nothing between kernel handoff and (maybe) systemd is visible on the console. Two changes: 1. Replace the /init → sbin/init symlink with a real shell script (live-image/mkosi.extra/init) that mounts /proc /sys /dev /dev/pts /dev/shm /run before execing systemd. Systemd has fallback mount code for these, but when it fails the failure is silent. Doing it explicitly in /init keeps failures visible and avoids the fragile symlink-resolution trick. 2. Drop 'quiet' from the kernel cmdline and add loglevel=7 plus systemd.log_target=kmsg + journald.forward_to_console=1 so every early-boot message reaches both tty0 and ttyS0. Will be dialed back once boot is stable. Also: .gitattributes pins LF on live-image/, .gitea/, Makefile, and *.sh so Windows checkouts don't break shell scripts and Makefile recipes with CRLF. /init also gets chmod 0755 in repack-initrd as a belt-and-braces against mode loss on non-Linux checkouts.	2026-04-18 14:31:40 -04:00
josh	43ea845ac0	live-image: pack full rootfs as initrd so PXE actually boots userspace CI / Lint + build + test (push) Successful in 1m54s Details Release / release (push) Successful in 5m10s Details update-initramfs produces a boot stub (~50 MB) that expects to mount a separate rootfs over squashfs/disk/NFS. Our PXE channel only ships vmlinuz+initrd.img, so the stub had nothing to pivot to — kernel finished hand-off and the system wedged with firmware, modules, and userspace stranded in the 545 MB rootfs dir we never delivered. Replace with an everything-in-initramfs build: cpio.zst the full rootfs (minus /boot) as the initrd, add /init -> sbin/init for the kernel's runtime entrypoint, materialize the kernel symlink into a real file. Bump check-initrd floor to 200 MB and switch the firmware grep from unmkinitramfs (boot-stub-specific) to zstd \| cpio -t. Also add cpio to the CI apt deps.	2026-04-18 14:14:08 -04:00
josh	6c6d20710f	live-image: fix check-initrd size measurement; add zstd to image CI / Lint + build + test (push) Successful in 1m28s Details Release / release (push) Failing after 4m10s Details Previous run actually built the 518 MB rootfs with firmware-misc-nonfree et al. installed — the real payload is working. Two follow-ups: - check-initrd was reading stat on a symlink path and getting 30 bytes (the symlink's own size), not the 6.1.0-44-amd64 kernel initrd it points to. Switched to wc -c, which follows symlinks, and to du -hL for the OK message. - Add zstd to Packages= so COMPRESS=zstd in initramfs.conf can be honored; without it update-initramfs falls back to gzip with a "No zstd in PATH" warning. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 14:00:07 -04:00
josh	0a5e5d0b39	ci: add bubblewrap dep and bump mkosi to v25.3 CI / Lint + build + test (push) Successful in 1m31s Details Release / release (push) Failing after 3m47s Details v24.3 crashed in cp_version() during the copy-package-manager-trees step because its sandbox needs bubblewrap (not present in the runner apt list), and cp --version returned empty output inside the broken sandbox. Installing bubblewrap and bumping to v25.3 which has tighter sandbox fallback behavior. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:53:09 -04:00
josh	488a0d1052	ci: install mkosi from upstream git tag, not PyPI Release / release (push) Failing after 1m54s Details CI / Lint + build + test (push) Has been cancelled Details Previous commit pinned mkosi==24.3 via pip but mkosi isn't published on PyPI past ancient versions — the runner hit "Could not find a version that satisfies the requirement mkosi==24.3". Install from the upstream git tag v24.3 instead; added git to the apt dep list for pip's VCS fetch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:44:51 -04:00
josh	28918bad15	live-image: fix firmware so i915 actually loads at boot CI / Lint + build + test (push) Successful in 1m35s Details Release / release (push) Failing after 22s Details Previous attempt (`c962d6d`) added firmware-linux-nonfree to mkosi.conf, but the CI bundle was still 63 MB and Tiger Lake wedged on tgl_guc. Two reasons: (1) firmware-linux-nonfree on bookworm is a thin metapackage that doesn't include firmware-misc-nonfree, which is where i915 GuC/HuC blobs actually live; (2) Ubuntu's apt-packaged mkosi is old enough that Repositories=non-free-firmware shorthand likely isn't wired through to the debootstrap invocation, so firmware packages silently miss the bootstrap step entirely. Changes: - Enumerate firmware packages explicitly in mkosi.conf (firmware- misc-nonfree, firmware-iwlwifi, firmware-realtek, firmware-amd- graphics, firmware-intel-sound, intel/amd64-microcode). - Ship mkosi.sources.d/debian.sources with explicit deb822 so the non-free-firmware component is unambiguously available. - Install mkosi 24.3 via pip in CI instead of apt's older build. - Pin MODULES=most and COMPRESS=zstd via a tracked initramfs-tools config under mkosi.extra/. - Narrow .gitignore so only the generated agent binary is ignored, not the whole mkosi.extra/ tree. - New check-initrd Makefile target asserts both size (>=150 MB) and actual presence of i915/tgl_guc_*.bin inside the built initrd, so a silent firmware-drop regression fails the build loudly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:38:40 -04:00
josh	c962d6d8ab	live-image: bundle nonfree firmware (i915 GuC et al.) CI / Lint + build + test (push) Successful in 2m19s Details Release / release (push) Successful in 3m28s Details Tiger Lake and later Intel iGPUs need i915/tgl_guc_*.bin; without it the i915 init wedges and floods the console. Same story on most modern wifi/NIC hardware. Pull firmware-linux-nonfree (metapackage covering misc-nonfree, iwlwifi, realtek, amd-graphics, …) from the bookworm non-free-firmware repo — single line fix, ~500MB cost to the squashfs, worth it for booting arbitrary repaired hosts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:14:19 -04:00
josh	4524ab8dc0	runs: add non-destructive flag + operator Cancel button CI / Lint + build + test (push) Successful in 2m5s Details Release / release (push) Successful in 3m5s Details Non-destructive pre-declares "don't touch the disks" on Start: the Storage stage skips wipe-probe, badblocks -w, and write-mode fio, and reports a read-only summary. Runs a new non_destructive column; threaded through Claim → agent tests.Deps → Storage stage. Cancel halts an in-flight run. The orchestrator transitions to a new StateCancelled via TriggerOperatorCancelled (valid from any active state); the agent's next heartbeat returns cmd=cancel_stage, which fires a stored CancelFunc on the per-stage context. Stage subprocesses spawned with exec.CommandContext die with the context, the agent posts a cancelled outcome, then powers the host off. Destructive stages mid-run may leave the host in an intermediate state — the UI confirm dialog warns the operator; recovery is manual for now. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:01:42 -04:00
josh	2c440fce8a	pxe: move dhcp-host allowlist into a SIGHUP-reloadable file CI / Lint + build + test (push) Successful in 1m38s Details Release / release (push) Successful in 2m25s Details dnsmasq's SIGHUP re-reads /etc/ethers and any --dhcp-hostsfile= paths, but NOT dhcp-host= lines from the main conf. Reload() was faithfully rewriting dnsmasq.conf with the new MAC, sending SIGHUP, and then dnsmasq kept serving its startup view — so a freshly-registered host still showed up as "proxy-ignored, tags: eth0" with no "known" tag. Split the allowlist into ${RuntimeDir}/dhcp-hosts, referenced from the main conf via dhcp-hostsfile=. writeConf() is static-ish now; Reload just rewrites the hosts file and SIGHUPs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:41:27 -04:00
josh	bce6e08524	pxe: reload dnsmasq on host create/delete CI / Lint + build + test (push) Successful in 1m54s Details Release / release (push) Successful in 2m36s Details pxe.Supervisor.Reload() was defined but never wired up. After a host was registered in the UI or via the quick-register JSON endpoint, the dnsmasq conf still held only the hosts that existed at orchestrator startup. The new MAC wasn't tagged `known`, so when the host PXE'd, dnsmasq logged "PXE(eth0) <mac> proxy-ignored" and the boot timed out back to the BIOS. Add an optional PXEReloader interface to api.UI, wire it from main when pxe is enabled, and call u.reloadPXE() after successful Create and Delete. Logs-and-continues on failure — host registration itself has already committed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:31:00 -04:00
josh	157b70f536	pxe: split subnet into network+netmask for dnsmasq proxy-DHCP CI / Lint + build + test (push) Successful in 2m0s Details Release / release (push) Successful in 3m35s Details dnsmasq's proxy-DHCP syntax is `dhcp-range=<network-ip>,proxy[,<mask>]`, not a CIDR. Passing "192.168.1.0/24,proxy" made dnsmasq refuse to start with "bad dhcp-range at line 12". Parse the CIDR once in writeConf() and render Network + Netmask as separate template fields. The config surface (pxe.subnet) stays CIDR because that's the right shape for humans; we just unpack it before handing to dnsmasq. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:17:10 -04:00
josh	cf3a75591c	install: stage pxe-setup.sh at /usr/local/sbin/vetting-pxe-setup CI / Lint + build + test (push) Successful in 1m36s Details Release / release (push) Successful in 2m29s Details proxmox-install.sh tarball-extracts into a tempdir that gets wiped on EXIT, so after the one-liner there's no pxe-setup.sh on disk for the operator to run. Have install.sh drop the script + ipxe-shas.txt into /usr/local/share/vetting/ and symlink it as /usr/local/sbin/vetting-pxe-setup (in PATH). pxe-setup.sh now readlink -f's BASH_SOURCE so the symlink resolves to the share dir where ipxe-shas.txt lives, and gracefully handles the case where install.sh already staged vmlinuz + initrd.img into LIVE_DIR (no bundle live-image/ needed at that point). Update the trailing hint in proxmox-install.sh and the operations runbook to surface the new `sudo vetting-pxe-setup ...` command. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:10:23 -04:00
josh	bcbbc35489	docs+e2e: document proxy-DHCP topology; default e2e bridge to LAN CI / Lint + build + test (push) Successful in 1m37s Details Release / release (push) Has been cancelled Details Rewrites the PXE section of the ops runbook around the new proxy-DHCP model (no dedicated bridge, coexists with UniFi/pfSense/etc.) and swaps the e2e test's default bridge + orchestrator URL to match. The e2e file now calls out the LAN-DHCP precondition in its header so future-me (or CI) doesn't hang at PXE wondering why nothing answers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:07:05 -04:00
josh	506c856046	pxe: switch dnsmasq to proxy-DHCP mode on the LAN CI / Lint + build + test (push) Successful in 1m48s Details Release / release (push) Successful in 2m22s Details Previously the orchestrator ran a full DHCP server on a dedicated br-vetting bridge (10.77.0.0/24), which required a hypervisor-level bridge + physical cabling onto that bridge for every repaired host. Real-world bite: the LXC's br-vetting had no L2 path to the target host's PXE NIC, so DHCPDISCOVERs never reached eth1 and PXE silently timed out. dnsmasq's proxy-DHCP mode is the idiomatic answer: it coexists with the LAN's existing DHCP server (UniFi, etc.), never assigns an IP itself, and only supplements the PXE options. No dedicated bridge, no VLAN, no cabling changes \u2014 dnsmasq binds to the LAN interface and layers option 66/67 + the PXE BINL on top of the real DHCP exchange. The MAC allowlist still gates replies, so random LAN clients booting from network get nothing. Template switches dhcp-range=<start,end,lease> to dhcp-range=<cidr>,proxy and replaces dhcp-boot= for first-boot ROM clients with pxe-service= directives (the correct proxy-mode chainload form). Validation drops the dhcp_range regex for a net.ParseCIDR check on pxe.subnet. Config, production/example yaml, and pxe-setup.sh swap --dhcp-range for --subnet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:02:49 -04:00
josh	b809bf5f3e	proxmox-install: show download progress bar for the bundle fetch CI / Lint + build + test (push) Successful in 1m37s Details Release / release (push) Successful in 2m44s Details -fsSL suppresses all output during the ~30 MB download, which leaves the operator staring at 'fetching bundle...' for up to a minute on a cold registry. Drop -s and add --progress-bar so there is a live indicator; keep -fL so we still fail on HTTP errors and follow redirects. Print the downloaded size alongside 'extracting' for quick sanity-checking. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 11:43:51 -04:00
josh	6a1d5c3bed	pxe: route dnsmasq lease + pid files into RuntimeDir CI / Lint + build + test (push) Successful in 1m39s Details Release / release (push) Successful in 2m24s Details Without explicit dhcp-leasefile and pid-file, dnsmasq reaches for its distro defaults (/var/lib/misc/dnsmasq.leases, /run/dnsmasq.pid) — both outside the systemd unit's ReadWritePaths=/var/lib/vetting /var/log/vetting sandbox, causing 'Read-only file system' on every start. RuntimeDir is already writable by construction (Supervisor.Start mkdir's it), so writing both files there keeps dnsmasq entirely inside the sandbox. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 11:31:37 -04:00
josh	9d17859992	orchestrator: anchor pxe+tftp runtime dirs under artifacts parent CI / Lint + build + test (push) Successful in 1m38s Details Release / release (push) Successful in 2m43s Details Previously tftp_root defaulted to logs.dir/../tftp and the pxe runtime dir to logs.dir/../pxe. On a production install that resolves to /var/log/tftp and /var/log/pxe, both outside the systemd unit's ReadWritePaths=/var/lib/vetting /var/log/vetting sandbox. The service crash-looped with "mkdir /var/log/pxe: read-only file system" as soon as PXE was enabled. Switch the anchor to filepath.Dir(cfg.Artifacts.Dir) — typically /var/lib/vetting — which the sandbox already allows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 11:14:11 -04:00
josh	caebd00d8d	live-image: symlink /initrd.img to match /vmlinuz CI / Lint + build + test (push) Successful in 1m47s Details Release / release (push) Successful in 2m14s Details The linux-image-amd64 postinst creates /vmlinuz but the paired /initrd.img symlink only shows up via an initramfs-tools hook that doesn't fire when we call update-initramfs ourselves. Without it, the top-level Makefile's `cp live-image/build/initrd.img` fails and `make release` aborts with a broken bundle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 10:54:25 -04:00
josh	41a273b47f	live-image: generate initrd explicitly; fail release on missing files CI / Lint + build + test (push) Successful in 1m47s Details Release / release (push) Failing after 2m28s Details Two bugs chained together to ship a broken bundle: 1. With Bootable=no, mkosi skips update-initramfs, so no /boot/initrd.img-<kver> ever gets generated inside the rootfs. The postinst now runs update-initramfs via chroot to produce it. 2. The `make release` recipe chained its `cp` calls with `;`, so a missing live-image/build/initrd.img silently failed and the bundle still got tarred + uploaded. Adding `set -e` at the top of the recipe makes any missing component fail the build loudly instead of shipping a half-bundle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 10:47:26 -04:00
josh	f927a4a66b	install.sh: stage live image and auto-restart on upgrade CI / Lint + build + test (push) Successful in 1m38s Details Release / release (push) Successful in 1m45s Details Single-command upgrades were leaving /var/lib/vetting/live/ stale on PXE-enabled LXCs because install.sh explicitly punted live-image staging to pxe-setup.sh. That was right when make-release ran on a dev box, but the new registry-pull flow ships vmlinuz+initrd.img inside the bundle — they should land in place during every install. install.sh now: - auto-detects live-image/{vmlinuz,initrd.img} (release bundle layout) or ../live-image/build/ (repo dev checkout) and stages them into --live-dir (default /var/lib/vetting/live). - restarts vetting.service when already enabled, so the curl \| sudo bash one-liner is the full upgrade loop. First- install path still leaves the service stopped for config edits. pxe-setup.sh's own live-image copy is now redundant on upgrade but still runs for first-time PXE setup (it also writes the pxe: block of vetting.yaml, which install.sh has no business touching). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 10:38:34 -04:00
josh	5aa245cd85	live-image: disable mkosi Bootable (PXE doesn't need a bootloader) CI / Lint + build + test (push) Successful in 1m36s Details Release / release (push) Successful in 1m56s Details mkosi was failing with "systemd-boot was not found at usr/lib/systemd/boot/efi" because Bootable=yes expects systemd-boot installed inside the image for EFI boot. This image is only ever PXE-booted — iPXE loads vmlinuz+initrd from TFTP directly, so the rootfs itself needs no bootloader. Switching to Bootable=no drops the EFI-image assembly step; the linux-image-amd64 postinst still creates /vmlinuz and /initrd.img symlinks that the top-level Makefile copies into the bundle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 10:18:49 -04:00
josh	a893b0d817	live-image: stage agent binary via mkosi.extra CI / Lint + build + test (push) Successful in 1m33s Details Release / release (push) Failing after 1m43s Details mkosi only mounts live-image/ as /work/src, so the postinst couldn't reach the repo-root bin/vetting-agent.linux-amd64 — the build failed in CI with `install: cannot stat '/work/src/bin/vetting-agent.linux-amd64'`. The Makefile now copies the prebuilt agent into mkosi.extra/, which mkosi merges into the image root automatically. The postinst is reduced to creating the multi-user.target.wants symlink. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 03:13:38 -04:00
josh	d6cdb7caa9	ci: install kmod for mkosi depmod CI / Lint + build + test (push) Successful in 1m35s Details Release / release (push) Failing after 1m38s Details After installing the kernel package into the live image, mkosi runs depmod on the host against the image's module tree. depmod ships in the kmod package, which isn't in the runner container by default.	2026-04-18 03:05:55 -04:00
josh	e6aa57e839	ci: install systemd-boot for mkosi bootctl CI / Lint + build + test (push) Successful in 1m38s Details Release / release (push) Failing after 1m31s Details mkosi Bootable=yes shells out to bootctl kernel-identify on the host, which ships in the systemd-boot package on Ubuntu (not in systemd itself). Without it, the live-image build fails at the end with "bootctl: not found" after successfully installing all packages.	2026-04-18 03:01:30 -04:00
josh	3dc0ca0bc2	ci: install debian-archive-keyring for mkosi bootstrap CI / Lint + build + test (push) Successful in 1m34s Details Release / release (push) Failing after 1m29s Details mkosi's apt-get (inside the mkosi workspace) couldn't verify Debian's InRelease signatures because the act_runner's Ubuntu base image ships Ubuntu's keyring, not Debian's. Adding `debian-archive-keyring` to the apt install list exposes /usr/share/keyrings/debian-archive-keyring.gpg which debootstrap and apt need for the bookworm repos. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 02:54:04 -04:00
josh	a427640608	ci: install systemd-ukify so mkosi's Bootable=yes step succeeds CI / Lint + build + test (push) Successful in 1m35s Details Release / release (push) Failing after 1m1s Details mkosi refused with "Could not find 'ukify'". The live image's mkosi.conf sets Bootable=yes, and mkosi invokes ukify to package the Unified Kernel Image alongside vmlinuz+initrd.img. On Debian/Ubuntu, ukify ships in the `systemd-ukify` apt package (not in `systemd`). Added to both release.yml and e2e.yml's live-image dep lists. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 02:50:39 -04:00
josh	4dda1dad83	live-image: mark mkosi.postinst executable in git index CI / Lint + build + test (push) Successful in 1m38s Details Release / release (push) Failing after 1m4s Details mkosi refuses to run a non-executable postinst. git was tracking it as 100644 because it was added from Windows (no POSIX exec bit on the FS), so CI saw a non-executable file even though WSL/Linux had been treating it fine locally. Same fix applied earlier to install.sh + pxe-setup.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 02:41:40 -04:00
josh	180a5212a4	deps: add missing go.sum entry for golang.org/x/term v0.25.0 CI / Lint + build + test (push) Successful in 1m37s Details Release / release (push) Failing after 1m4s Details CI's `go mod tidy` check caught the drift. The module was used transitively but never recorded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 02:38:13 -04:00
josh	74c09e9596	ci: disable setup-go cache to skip 4m Gitea cache server timeout CI / Lint + build + test (push) Failing after 32s Details Release / release (push) Has been cancelled Details The action tries to restore from 172.18.0.2:36061 (Gitea's cache server), times out, falls through to a fresh download anyway. Pure waste since the runner already has the toolchain in /opt/hostedtoolcache. Turn cache off. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 02:37:16 -04:00
josh	869cd78d0b	ci: quote e2e.yml input description so Gitea's YAML parser accepts it CI / Lint + build + test (push) Has been cancelled Details Release / release (push) Has been cancelled Details Unquoted `(default: main)` trips Gitea Actions' strict YAML parser with "mapping values are not allowed in this context" because the inline colon reads as a nested mapping. GitHub Actions' parser was lenient about this; Gitea's isn't. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 02:34:35 -04:00
josh	03dcf33686	ci: switch runs-on to ubuntu-latest to match runner label CI / Lint + build + test (push) Failing after 8m44s Details Release / release (push) Has been cancelled Details The self-hosted Gitea runner advertises itself as `ubuntu-latest`, not `self-hosted`, so the jobs were never getting picked up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 02:25:22 -04:00
josh	f188c7add4	proxmox-install: fetch prebuilt bundle from Gitea package registry CI / Lint + build + test (push) Has been cancelled Details Release / release (push) Has been cancelled Details Drops the per-install Go toolchain dance + source build. The installer now just curls the bundle from ${REGISTRY_URL}/api/packages/${PACKAGE_OWNER}/generic/vetting/${VETTING_VERSION}/vetting-bundle.tar.gz, extracts it, and hands off to the bundled install.sh with explicit --binary / --agent-binary paths so the in-bundle layout is picked up. Default version is `latest` (rolling alias, overwritten by release.yml on each push to main). Pin via `VETTING_VERSION=sha-abc1234 curl ... \| bash` when rolling back or testing a specific commit. Removes the `apt install build-essential git` + Go toolchain download + templ install + `make orchestrator-linux agent-linux` path — the CI workflow already produced all of that. Install time on a cold LXC drops from minutes to under a minute, and live-image kernel/initrd now arrive with every install instead of requiring a separate WSL build. Also rewrites docs/operations.md's install section around the one-liner, keeps the `make release` + scp path as the offline fallback, and swaps the upgrade section to just "rerun the one-liner." Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 02:16:02 -04:00
josh	609ad2e383	ci: migrate to Gitea Actions + publish release bundle to package registry CI / Lint + build + test (push) Has been cancelled Details Release / release (push) Has been cancelled Details Adds `.gitea/workflows/{ci,e2e,release}.yml` and removes the old `.github/workflows/` counterparts. Gitea reads both paths, so keeping them would double-run every job on every push. - ci.yml / e2e.yml are 1:1 ports of the GitHub versions, just with `runs-on: self-hosted` (Gitea has no hosted runners). - release.yml is new: fires on push to main, runs `make release`, then publishes `vetting-bundle.tar.gz` to the Gitea generic package registry under two versions — `sha-<short-sha>` (immutable, pinnable) and `latest` (rolling alias, DELETE+PUT on each run). Auth via a REGISTRY_TOKEN secret + REGISTRY_URL variable configured on the Gitea side. The runner is being reconfigured to privileged so `mkosi` + `debootstrap` can build the live image inside CI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 02:14:08 -04:00
josh	05bd88b016	pxe-setup: handle quoted defaults whose comments contain quotes CI / Lint + build + test (push) Failing after 5m14s Details The production yaml ships `interface: "" # e.g. "eth0"`. The old extractor did `gsub(/^"\|"$/, "")` which only strips outer quotes, so with an inline comment containing quotes it produced garbage like `" # e.g. "eth0`, tripping the idempotency check. Replaces the two inline extractors with one `extract_yaml_value` helper that first tries to match `"[^"]*"` (grabbing only the first quoted value), falling back to strip-trailing-comment + trim for unquoted values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 01:52:31 -04:00
josh	6ce95547f4	deploy: mark install.sh + pxe-setup.sh executable in git index CI / Lint + build + test (push) Failing after 5m13s Details Git on Windows dropped the exec bit when the files were first committed, so `sudo ./pxe-setup.sh` on the LXC errored with "command not found". Fix via `git update-index --chmod=+x`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 01:43:02 -04:00
josh	a5055b3c7a	Automate PXE setup: release bundle + pxe-setup.sh + startup validation CI / Lint + build + test (push) Has been cancelled Details Collapses the LXC side of PXE enablement from a six-step manual dance (build, fetch iPXE, scp, bridge, hand-edit yaml) into: make release # dev box (Linux/WSL) scp bundle.tar.gz lxc:/tmp/ sudo ./install.sh # base install, unchanged sudo ./pxe-setup.sh --interface ... --dhcp-range ... --orchestrator-url ... pxe-setup.sh fetches iPXE from boot.ipxe.org, verifies against pinned SHA256s in deploy/ipxe-shas.txt (fail-closed), places vmlinuz/initrd.img from the bundle, and rewrites only the pxe: block of vetting.yaml. Idempotent; --force gates overwriting a hand-edited block. Adds Supervisor.Validate() — called before dnsmasq spawn — so typo'd configs fail at orchestrator startup with clear errors naming the missing file or yaml key, instead of silently serving broken TFTP until a real host tries to PXE-boot. Nine tests cover missing files, bogus interface, malformed dhcp_range, bad orchestrator_url, and aggregate reporting. Hypervisor bridge creation stays documented (LXC can't do it) but everything downstream of the bridge is now scripted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 01:38:43 -04:00
josh	d245fa6235	quick.sh: stage+install agent to avoid ETXTBSY, restart service CI / Lint + build + test (push) Failing after 5m17s Details Re-running quick.sh on a host where vetting-reporter was already running failed with curl error 23 because curl can't overwrite a busy executable. Download to a staging path, then use `install(1)` which unlinks the target before writing. Swap `enable --now` for `enable` + `restart` so the service picks up the new binary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 01:13:14 -04:00
josh	d0bfae14c8	Heartbeat-first dispatch: retire WoL-as-default, add WaitingReboot CI / Lint + build + test (push) Has been cancelled Details Every supported host runs vetting-reporter in-OS and heartbeats every 30s. WoL was never the thing that started vetting — the heartbeat response's reboot_for_vetting command was. Firing WoL first only crowded the run log with misleading diagnostics when the real failure mode is "reporter isn't installed." - StartRun 409s if the host hasn't heartbeated within 60s, pointing the operator at /register/quick.sh. - Dispatcher re-checks LastSeenAt at dispatch time (run may sit in Queued long enough for the host to go offline); stale hosts mark the run Failed with failed_stage=dispatch instead of looping. - New StateWaitingReboot + TriggerRebootCommanded capture the actual semantics. StateWaitingWoL kept as the hook point for a future manual-override button. - Tile disables the Start button with a quick.sh tooltip when the host is offline, matching the server-side 409. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 01:10:34 -04:00
josh	c9927ca2bf	config: default agent.asset_dir so old configs still serve /assets CI / Lint + build + test (push) Failing after 5m12s Details Operators who installed vetting before agent.asset_dir existed keep their config preserved by install.sh on upgrade, which left them with AssetDir="" — the router silently dropped the /assets/* mount and the quick-register one-liner hit 404 fetching the agent binary. Default AssetDir alongside the database file so the same directory install.sh already creates + drops the agent binary into is picked up automatically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 00:41:09 -04:00
josh	1694c20b12	Host detail v2: full pipeline + per-stage logs + WoL diagnostics CI / Lint + build + test (push) Has been cancelled Details Pipeline now always renders all 13 nodes (3 pre-stage + 9 stage + Completed), synthesising ghosts from run state when stage rows aren't seeded yet. Makes a WaitingWoL host show the full timeline ahead of it instead of just 4 dots. Agent tags each log line with its stage; logs.Hub fans out to both log-{runID} and log-{runID}-{stage} SSE events so the detail page can show per-stage tabs with a pure-CSS radio-sibling switch. Flat run log prepends [stage] so grep still works. Dispatcher writes picked/sent-WoL/heartbeat lines into the per-run log — the operator opens the detail page, sees WaitingWoL stuck, and reads exactly what the dispatcher did and why nothing's progressing, instead of having to tail journalctl on the LXC. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 00:38:27 -04:00
josh	a3d5e2d0a4	proxmox-install: build agent binary for serving CI / Lint + build + test (push) Failing after 5m22s Details The agent binary is never run on the LXC, but it has to be present so /assets/vetting-agent-linux-amd64 can serve it to target hosts via the quick-register one-liner. Install was failing because only orchestrator-linux was being built. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 00:12:41 -04:00

1 2

61 Commits