Commit Graph

55 Commits

Author SHA1 Message Date
josh 026923075c pxe: disable systemd-firstboot so the live image doesn't prompt
CI / Lint + build + test (push) Successful in 1m22s
Release / release (push) Has been cancelled
systemd-firstboot.service is an interactive wizard that asks for
locale, timezone, and root password when /etc/machine-id isn't
populated — i.e. every PXE boot of a mkosi-built image. It sits on
sysinit.target waiting for input that will never arrive, blocking
the agent service and every other downstream unit indefinitely.

systemd.firstboot=off on the kernel cmdline is the documented kill
switch; no image-side changes needed.
2026-04-18 15:35:24 -04:00
josh 956120b80e deploy: show speed + ETA in bundle-download progress meter
CI / Lint + build + test (push) Successful in 1m24s
Release / release (push) Successful in 5m30s
Drop --progress-bar (curl's minimal hash meter) in favor of the default
progress output, which includes transfer rate and time remaining.
Bundles grew from ~30 MB to ~300 MB with the full-rootfs initrd, and
a percentage-only bar with no speed hint makes a slow registry look
indistinguishable from a hang.
2026-04-18 15:04:26 -04:00
josh c45349f62c pxe: mask serial-getty@ttyS0 so hosts without serial don't wait 90s
CI / Lint + build + test (push) Successful in 1m47s
Release / release (push) Successful in 5m16s
systemd-getty-generator reads console=ttyS0 off the kernel cmdline and
auto-creates serial-getty@ttyS0.service, which BindsTo dev-ttyS0.device.
On hardware without a physical serial port the device node never shows
up, systemd waits its full default 90s timeout, and only then proceeds.

systemd.mask= on the kernel cmdline is a first-class option — masks
the unit before the generator's link even gets activated. Kernel
messages still go to ttyS0 if a port is present; we just don't try
to spawn a login prompt there.
2026-04-18 14:47:03 -04:00
josh a88e24bef4 live-image: real /init + verbose boot for first-boot diagnosis
CI / Lint + build + test (push) Successful in 1m23s
Release / release (push) Successful in 4m49s
Host boots past kernel init and then stalls silently. ACPI DSDT error
about TXHC.RHUB.SS01 is benign noise (Tiger Lake firmware bug) — the
actual problem is that nothing between kernel handoff and (maybe)
systemd is visible on the console.

Two changes:

1. Replace the /init → sbin/init symlink with a real shell script
   (live-image/mkosi.extra/init) that mounts /proc /sys /dev /dev/pts
   /dev/shm /run before execing systemd. Systemd has fallback mount
   code for these, but when it fails the failure is silent. Doing it
   explicitly in /init keeps failures visible and avoids the fragile
   symlink-resolution trick.

2. Drop 'quiet' from the kernel cmdline and add loglevel=7 plus
   systemd.log_target=kmsg + journald.forward_to_console=1 so every
   early-boot message reaches both tty0 and ttyS0. Will be dialed
   back once boot is stable.

Also: .gitattributes pins LF on live-image/, .gitea/, Makefile, and
*.sh so Windows checkouts don't break shell scripts and Makefile
recipes with CRLF. /init also gets chmod 0755 in repack-initrd as a
belt-and-braces against mode loss on non-Linux checkouts.
2026-04-18 14:31:40 -04:00
josh 43ea845ac0 live-image: pack full rootfs as initrd so PXE actually boots userspace
CI / Lint + build + test (push) Successful in 1m54s
Release / release (push) Successful in 5m10s
update-initramfs produces a boot stub (~50 MB) that expects to mount a
separate rootfs over squashfs/disk/NFS. Our PXE channel only ships
vmlinuz+initrd.img, so the stub had nothing to pivot to — kernel
finished hand-off and the system wedged with firmware, modules, and
userspace stranded in the 545 MB rootfs dir we never delivered.

Replace with an everything-in-initramfs build: cpio.zst the full
rootfs (minus /boot) as the initrd, add /init -> sbin/init for the
kernel's runtime entrypoint, materialize the kernel symlink into a
real file. Bump check-initrd floor to 200 MB and switch the firmware
grep from unmkinitramfs (boot-stub-specific) to zstd | cpio -t.

Also add cpio to the CI apt deps.
2026-04-18 14:14:08 -04:00
josh 6c6d20710f live-image: fix check-initrd size measurement; add zstd to image
CI / Lint + build + test (push) Successful in 1m28s
Release / release (push) Failing after 4m10s
Previous run actually built the 518 MB rootfs with firmware-misc-nonfree
et al. installed — the real payload is working. Two follow-ups:

- check-initrd was reading stat on a symlink path and getting 30 bytes
  (the symlink's own size), not the 6.1.0-44-amd64 kernel initrd it
  points to. Switched to wc -c, which follows symlinks, and to du -hL
  for the OK message.
- Add zstd to Packages= so COMPRESS=zstd in initramfs.conf can be
  honored; without it update-initramfs falls back to gzip with a
  "No zstd in PATH" warning.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 14:00:07 -04:00
josh 0a5e5d0b39 ci: add bubblewrap dep and bump mkosi to v25.3
CI / Lint + build + test (push) Successful in 1m31s
Release / release (push) Failing after 3m47s
v24.3 crashed in cp_version() during the copy-package-manager-trees
step because its sandbox needs bubblewrap (not present in the runner
apt list), and cp --version returned empty output inside the broken
sandbox. Installing bubblewrap and bumping to v25.3 which has tighter
sandbox fallback behavior.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 13:53:09 -04:00
josh 488a0d1052 ci: install mkosi from upstream git tag, not PyPI
Release / release (push) Failing after 1m54s
CI / Lint + build + test (push) Has been cancelled
Previous commit pinned mkosi==24.3 via pip but mkosi isn't published
on PyPI past ancient versions — the runner hit
"Could not find a version that satisfies the requirement mkosi==24.3".
Install from the upstream git tag v24.3 instead; added git to the apt
dep list for pip's VCS fetch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 13:44:51 -04:00
josh 28918bad15 live-image: fix firmware so i915 actually loads at boot
CI / Lint + build + test (push) Successful in 1m35s
Release / release (push) Failing after 22s
Previous attempt (c962d6d) added firmware-linux-nonfree to mkosi.conf,
but the CI bundle was still 63 MB and Tiger Lake wedged on tgl_guc.
Two reasons: (1) firmware-linux-nonfree on bookworm is a thin
metapackage that doesn't include firmware-misc-nonfree, which is where
i915 GuC/HuC blobs actually live; (2) Ubuntu's apt-packaged mkosi is
old enough that Repositories=non-free-firmware shorthand likely isn't
wired through to the debootstrap invocation, so firmware packages
silently miss the bootstrap step entirely.

Changes:
- Enumerate firmware packages explicitly in mkosi.conf (firmware-
  misc-nonfree, firmware-iwlwifi, firmware-realtek, firmware-amd-
  graphics, firmware-intel-sound, intel/amd64-microcode).
- Ship mkosi.sources.d/debian.sources with explicit deb822 so the
  non-free-firmware component is unambiguously available.
- Install mkosi 24.3 via pip in CI instead of apt's older build.
- Pin MODULES=most and COMPRESS=zstd via a tracked initramfs-tools
  config under mkosi.extra/.
- Narrow .gitignore so only the generated agent binary is ignored,
  not the whole mkosi.extra/ tree.
- New check-initrd Makefile target asserts both size (>=150 MB) and
  actual presence of i915/tgl_guc_*.bin inside the built initrd, so
  a silent firmware-drop regression fails the build loudly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 13:38:40 -04:00
josh c962d6d8ab live-image: bundle nonfree firmware (i915 GuC et al.)
CI / Lint + build + test (push) Successful in 2m19s
Release / release (push) Successful in 3m28s
Tiger Lake and later Intel iGPUs need i915/tgl_guc_*.bin; without
it the i915 init wedges and floods the console. Same story on most
modern wifi/NIC hardware. Pull firmware-linux-nonfree (metapackage
covering misc-nonfree, iwlwifi, realtek, amd-graphics, …) from the
bookworm non-free-firmware repo — single line fix, ~500MB cost to
the squashfs, worth it for booting arbitrary repaired hosts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 13:14:19 -04:00
josh 4524ab8dc0 runs: add non-destructive flag + operator Cancel button
CI / Lint + build + test (push) Successful in 2m5s
Release / release (push) Successful in 3m5s
Non-destructive pre-declares "don't touch the disks" on Start: the
Storage stage skips wipe-probe, badblocks -w, and write-mode fio,
and reports a read-only summary. Runs a new non_destructive column;
threaded through Claim → agent tests.Deps → Storage stage.

Cancel halts an in-flight run. The orchestrator transitions to a
new StateCancelled via TriggerOperatorCancelled (valid from any
active state); the agent's next heartbeat returns cmd=cancel_stage,
which fires a stored CancelFunc on the per-stage context. Stage
subprocesses spawned with exec.CommandContext die with the context,
the agent posts a cancelled outcome, then powers the host off.

Destructive stages mid-run may leave the host in an intermediate
state — the UI confirm dialog warns the operator; recovery is
manual for now.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 13:01:42 -04:00
josh 2c440fce8a pxe: move dhcp-host allowlist into a SIGHUP-reloadable file
CI / Lint + build + test (push) Successful in 1m38s
Release / release (push) Successful in 2m25s
dnsmasq's SIGHUP re-reads /etc/ethers and any --dhcp-hostsfile= paths,
but NOT dhcp-host= lines from the main conf. Reload() was faithfully
rewriting dnsmasq.conf with the new MAC, sending SIGHUP, and then
dnsmasq kept serving its startup view — so a freshly-registered host
still showed up as "proxy-ignored, tags: eth0" with no "known" tag.

Split the allowlist into ${RuntimeDir}/dhcp-hosts, referenced from the
main conf via dhcp-hostsfile=. writeConf() is static-ish now; Reload
just rewrites the hosts file and SIGHUPs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:41:27 -04:00
josh bce6e08524 pxe: reload dnsmasq on host create/delete
CI / Lint + build + test (push) Successful in 1m54s
Release / release (push) Successful in 2m36s
pxe.Supervisor.Reload() was defined but never wired up. After a host
was registered in the UI or via the quick-register JSON endpoint, the
dnsmasq conf still held only the hosts that existed at orchestrator
startup. The new MAC wasn't tagged `known`, so when the host PXE'd,
dnsmasq logged "PXE(eth0) <mac> proxy-ignored" and the boot timed out
back to the BIOS.

Add an optional PXEReloader interface to api.UI, wire it from main
when pxe is enabled, and call u.reloadPXE() after successful Create
and Delete. Logs-and-continues on failure — host registration itself
has already committed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:31:00 -04:00
josh 157b70f536 pxe: split subnet into network+netmask for dnsmasq proxy-DHCP
CI / Lint + build + test (push) Successful in 2m0s
Release / release (push) Successful in 3m35s
dnsmasq's proxy-DHCP syntax is `dhcp-range=<network-ip>,proxy[,<mask>]`,
not a CIDR. Passing "192.168.1.0/24,proxy" made dnsmasq refuse to start
with "bad dhcp-range at line 12". Parse the CIDR once in writeConf()
and render Network + Netmask as separate template fields.

The config surface (pxe.subnet) stays CIDR because that's the right
shape for humans; we just unpack it before handing to dnsmasq.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:17:10 -04:00
josh cf3a75591c install: stage pxe-setup.sh at /usr/local/sbin/vetting-pxe-setup
CI / Lint + build + test (push) Successful in 1m36s
Release / release (push) Successful in 2m29s
proxmox-install.sh tarball-extracts into a tempdir that gets wiped on
EXIT, so after the one-liner there's no pxe-setup.sh on disk for the
operator to run. Have install.sh drop the script + ipxe-shas.txt into
/usr/local/share/vetting/ and symlink it as
/usr/local/sbin/vetting-pxe-setup (in PATH).

pxe-setup.sh now readlink -f's BASH_SOURCE so the symlink resolves to
the share dir where ipxe-shas.txt lives, and gracefully handles the
case where install.sh already staged vmlinuz + initrd.img into
LIVE_DIR (no bundle live-image/ needed at that point).

Update the trailing hint in proxmox-install.sh and the operations
runbook to surface the new `sudo vetting-pxe-setup ...` command.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:10:23 -04:00
josh bcbbc35489 docs+e2e: document proxy-DHCP topology; default e2e bridge to LAN
CI / Lint + build + test (push) Successful in 1m37s
Release / release (push) Has been cancelled
Rewrites the PXE section of the ops runbook around the new proxy-DHCP
model (no dedicated bridge, coexists with UniFi/pfSense/etc.) and
swaps the e2e test's default bridge + orchestrator URL to match. The
e2e file now calls out the LAN-DHCP precondition in its header so
future-me (or CI) doesn't hang at PXE wondering why nothing answers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:07:05 -04:00
josh 506c856046 pxe: switch dnsmasq to proxy-DHCP mode on the LAN
CI / Lint + build + test (push) Successful in 1m48s
Release / release (push) Successful in 2m22s
Previously the orchestrator ran a full DHCP server on a dedicated
br-vetting bridge (10.77.0.0/24), which required a hypervisor-level
bridge + physical cabling onto that bridge for every repaired host.
Real-world bite: the LXC's br-vetting had no L2 path to the target
host's PXE NIC, so DHCPDISCOVERs never reached eth1 and PXE silently
timed out.

dnsmasq's proxy-DHCP mode is the idiomatic answer: it coexists with
the LAN's existing DHCP server (UniFi, etc.), never assigns an IP
itself, and only supplements the PXE options. No dedicated bridge,
no VLAN, no cabling changes \u2014 dnsmasq binds to the LAN interface
and layers option 66/67 + the PXE BINL on top of the real DHCP
exchange. The MAC allowlist still gates replies, so random LAN
clients booting from network get nothing.

Template switches dhcp-range=<start,end,lease> to
dhcp-range=<cidr>,proxy and replaces dhcp-boot= for first-boot ROM
clients with pxe-service= directives (the correct proxy-mode
chainload form). Validation drops the dhcp_range regex for a
net.ParseCIDR check on pxe.subnet. Config, production/example yaml,
and pxe-setup.sh swap --dhcp-range for --subnet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:02:49 -04:00
josh b809bf5f3e proxmox-install: show download progress bar for the bundle fetch
CI / Lint + build + test (push) Successful in 1m37s
Release / release (push) Successful in 2m44s
-fsSL suppresses all output during the ~30 MB download, which
leaves the operator staring at 'fetching bundle...' for up to a
minute on a cold registry. Drop -s and add --progress-bar so there
is a live indicator; keep -fL so we still fail on HTTP errors and
follow redirects. Print the downloaded size alongside 'extracting'
for quick sanity-checking.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 11:43:51 -04:00
josh 6a1d5c3bed pxe: route dnsmasq lease + pid files into RuntimeDir
CI / Lint + build + test (push) Successful in 1m39s
Release / release (push) Successful in 2m24s
Without explicit dhcp-leasefile and pid-file, dnsmasq reaches for
its distro defaults (/var/lib/misc/dnsmasq.leases,
/run/dnsmasq.pid) — both outside the systemd unit's
ReadWritePaths=/var/lib/vetting /var/log/vetting sandbox, causing
'Read-only file system' on every start.

RuntimeDir is already writable by construction (Supervisor.Start
mkdir's it), so writing both files there keeps dnsmasq entirely
inside the sandbox.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 11:31:37 -04:00
josh 9d17859992 orchestrator: anchor pxe+tftp runtime dirs under artifacts parent
CI / Lint + build + test (push) Successful in 1m38s
Release / release (push) Successful in 2m43s
Previously tftp_root defaulted to logs.dir/../tftp and the pxe
runtime dir to logs.dir/../pxe. On a production install that
resolves to /var/log/tftp and /var/log/pxe, both outside the
systemd unit's ReadWritePaths=/var/lib/vetting /var/log/vetting
sandbox. The service crash-looped with "mkdir /var/log/pxe:
read-only file system" as soon as PXE was enabled.

Switch the anchor to filepath.Dir(cfg.Artifacts.Dir) — typically
/var/lib/vetting — which the sandbox already allows.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 11:14:11 -04:00
josh caebd00d8d live-image: symlink /initrd.img to match /vmlinuz
CI / Lint + build + test (push) Successful in 1m47s
Release / release (push) Successful in 2m14s
The linux-image-amd64 postinst creates /vmlinuz but the paired
/initrd.img symlink only shows up via an initramfs-tools hook that
doesn't fire when we call update-initramfs ourselves. Without it,
the top-level Makefile's `cp live-image/build/initrd.img` fails and
`make release` aborts with a broken bundle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 10:54:25 -04:00
josh 41a273b47f live-image: generate initrd explicitly; fail release on missing files
CI / Lint + build + test (push) Successful in 1m47s
Release / release (push) Failing after 2m28s
Two bugs chained together to ship a broken bundle:

1. With Bootable=no, mkosi skips update-initramfs, so no
   /boot/initrd.img-<kver> ever gets generated inside the rootfs.
   The postinst now runs update-initramfs via chroot to produce it.

2. The `make release` recipe chained its `cp` calls with `;`, so
   a missing live-image/build/initrd.img silently failed and the
   bundle still got tarred + uploaded. Adding `set -e` at the top
   of the recipe makes any missing component fail the build loudly
   instead of shipping a half-bundle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 10:47:26 -04:00
josh f927a4a66b install.sh: stage live image and auto-restart on upgrade
CI / Lint + build + test (push) Successful in 1m38s
Release / release (push) Successful in 1m45s
Single-command upgrades were leaving /var/lib/vetting/live/ stale on
PXE-enabled LXCs because install.sh explicitly punted live-image
staging to pxe-setup.sh. That was right when make-release ran on a
dev box, but the new registry-pull flow ships vmlinuz+initrd.img
inside the bundle — they should land in place during every install.

install.sh now:
  - auto-detects live-image/{vmlinuz,initrd.img} (release bundle
    layout) or ../live-image/build/ (repo dev checkout) and stages
    them into --live-dir (default /var/lib/vetting/live).
  - restarts vetting.service when already enabled, so the
    curl | sudo bash one-liner is the full upgrade loop. First-
    install path still leaves the service stopped for config edits.

pxe-setup.sh's own live-image copy is now redundant on upgrade but
still runs for first-time PXE setup (it also writes the pxe: block
of vetting.yaml, which install.sh has no business touching).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 10:38:34 -04:00
josh 5aa245cd85 live-image: disable mkosi Bootable (PXE doesn't need a bootloader)
CI / Lint + build + test (push) Successful in 1m36s
Release / release (push) Successful in 1m56s
mkosi was failing with "systemd-boot was not found at
usr/lib/systemd/boot/efi" because Bootable=yes expects systemd-boot
installed *inside* the image for EFI boot. This image is only ever
PXE-booted — iPXE loads vmlinuz+initrd from TFTP directly, so the
rootfs itself needs no bootloader.

Switching to Bootable=no drops the EFI-image assembly step; the
linux-image-amd64 postinst still creates /vmlinuz and /initrd.img
symlinks that the top-level Makefile copies into the bundle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 10:18:49 -04:00
josh a893b0d817 live-image: stage agent binary via mkosi.extra
CI / Lint + build + test (push) Successful in 1m33s
Release / release (push) Failing after 1m43s
mkosi only mounts live-image/ as /work/src, so the postinst couldn't
reach the repo-root bin/vetting-agent.linux-amd64 — the build failed
in CI with `install: cannot stat '/work/src/bin/vetting-agent.linux-amd64'`.

The Makefile now copies the prebuilt agent into mkosi.extra/, which
mkosi merges into the image root automatically. The postinst is
reduced to creating the multi-user.target.wants symlink.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 03:13:38 -04:00
josh d6cdb7caa9 ci: install kmod for mkosi depmod
CI / Lint + build + test (push) Successful in 1m35s
Release / release (push) Failing after 1m38s
After installing the kernel package into the live image, mkosi runs
depmod on the host against the image's module tree. depmod ships in
the kmod package, which isn't in the runner container by default.
2026-04-18 03:05:55 -04:00
josh e6aa57e839 ci: install systemd-boot for mkosi bootctl
CI / Lint + build + test (push) Successful in 1m38s
Release / release (push) Failing after 1m31s
mkosi Bootable=yes shells out to bootctl kernel-identify on the host,
which ships in the systemd-boot package on Ubuntu (not in systemd
itself). Without it, the live-image build fails at the end with
"bootctl: not found" after successfully installing all packages.
2026-04-18 03:01:30 -04:00
josh 3dc0ca0bc2 ci: install debian-archive-keyring for mkosi bootstrap
CI / Lint + build + test (push) Successful in 1m34s
Release / release (push) Failing after 1m29s
mkosi's apt-get (inside the mkosi workspace) couldn't verify Debian's
InRelease signatures because the act_runner's Ubuntu base image ships
Ubuntu's keyring, not Debian's. Adding `debian-archive-keyring` to the
apt install list exposes /usr/share/keyrings/debian-archive-keyring.gpg
which debootstrap and apt need for the bookworm repos.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 02:54:04 -04:00
josh a427640608 ci: install systemd-ukify so mkosi's Bootable=yes step succeeds
CI / Lint + build + test (push) Successful in 1m35s
Release / release (push) Failing after 1m1s
mkosi refused with "Could not find 'ukify'". The live image's mkosi.conf
sets Bootable=yes, and mkosi invokes ukify to package the Unified
Kernel Image alongside vmlinuz+initrd.img. On Debian/Ubuntu, ukify
ships in the `systemd-ukify` apt package (not in `systemd`).

Added to both release.yml and e2e.yml's live-image dep lists.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 02:50:39 -04:00
josh 4dda1dad83 live-image: mark mkosi.postinst executable in git index
CI / Lint + build + test (push) Successful in 1m38s
Release / release (push) Failing after 1m4s
mkosi refuses to run a non-executable postinst. git was tracking it
as 100644 because it was added from Windows (no POSIX exec bit on the
FS), so CI saw a non-executable file even though WSL/Linux had been
treating it fine locally. Same fix applied earlier to install.sh +
pxe-setup.sh.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 02:41:40 -04:00
josh 180a5212a4 deps: add missing go.sum entry for golang.org/x/term v0.25.0
CI / Lint + build + test (push) Successful in 1m37s
Release / release (push) Failing after 1m4s
CI's `go mod tidy` check caught the drift. The module was used
transitively but never recorded.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 02:38:13 -04:00
josh 74c09e9596 ci: disable setup-go cache to skip 4m Gitea cache server timeout
CI / Lint + build + test (push) Failing after 32s
Release / release (push) Has been cancelled
The action tries to restore from 172.18.0.2:36061 (Gitea's cache
server), times out, falls through to a fresh download anyway. Pure
waste since the runner already has the toolchain in
/opt/hostedtoolcache. Turn cache off.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 02:37:16 -04:00
josh 869cd78d0b ci: quote e2e.yml input description so Gitea's YAML parser accepts it
CI / Lint + build + test (push) Has been cancelled
Release / release (push) Has been cancelled
Unquoted `(default: main)` trips Gitea Actions' strict YAML parser
with "mapping values are not allowed in this context" because the
inline colon reads as a nested mapping. GitHub Actions' parser was
lenient about this; Gitea's isn't.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 02:34:35 -04:00
josh 03dcf33686 ci: switch runs-on to ubuntu-latest to match runner label
CI / Lint + build + test (push) Failing after 8m44s
Release / release (push) Has been cancelled
The self-hosted Gitea runner advertises itself as `ubuntu-latest`,
not `self-hosted`, so the jobs were never getting picked up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 02:25:22 -04:00
josh f188c7add4 proxmox-install: fetch prebuilt bundle from Gitea package registry
CI / Lint + build + test (push) Has been cancelled
Release / release (push) Has been cancelled
Drops the per-install Go toolchain dance + source build. The installer
now just curls the bundle from
${REGISTRY_URL}/api/packages/${PACKAGE_OWNER}/generic/vetting/${VETTING_VERSION}/vetting-bundle.tar.gz,
extracts it, and hands off to the bundled install.sh with explicit
--binary / --agent-binary paths so the in-bundle layout is picked up.

Default version is `latest` (rolling alias, overwritten by release.yml
on each push to main). Pin via `VETTING_VERSION=sha-abc1234 curl ... |
bash` when rolling back or testing a specific commit.

Removes the `apt install build-essential git` + Go toolchain download
+ templ install + `make orchestrator-linux agent-linux` path — the CI
workflow already produced all of that. Install time on a cold LXC
drops from minutes to under a minute, and live-image kernel/initrd
now arrive with every install instead of requiring a separate WSL
build.

Also rewrites docs/operations.md's install section around the
one-liner, keeps the `make release` + scp path as the offline
fallback, and swaps the upgrade section to just "rerun the one-liner."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 02:16:02 -04:00
josh 609ad2e383 ci: migrate to Gitea Actions + publish release bundle to package registry
CI / Lint + build + test (push) Has been cancelled
Release / release (push) Has been cancelled
Adds `.gitea/workflows/{ci,e2e,release}.yml` and removes the old
`.github/workflows/` counterparts. Gitea reads both paths, so keeping
them would double-run every job on every push.

- ci.yml / e2e.yml are 1:1 ports of the GitHub versions, just with
  `runs-on: self-hosted` (Gitea has no hosted runners).
- release.yml is new: fires on push to main, runs `make release`, then
  publishes `vetting-bundle.tar.gz` to the Gitea generic package
  registry under two versions — `sha-<short-sha>` (immutable, pinnable)
  and `latest` (rolling alias, DELETE+PUT on each run). Auth via a
  REGISTRY_TOKEN secret + REGISTRY_URL variable configured on the Gitea
  side.

The runner is being reconfigured to privileged so `mkosi` + `debootstrap`
can build the live image inside CI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 02:14:08 -04:00
josh 05bd88b016 pxe-setup: handle quoted defaults whose comments contain quotes
CI / Lint + build + test (push) Failing after 5m14s
The production yaml ships `interface: ""                          # e.g. "eth0"`.
The old extractor did `gsub(/^"|"$/, "")` which only strips outer quotes, so
with an inline comment containing quotes it produced garbage like
`"                          # e.g. "eth0`, tripping the idempotency check.

Replaces the two inline extractors with one `extract_yaml_value` helper
that first tries to match `"[^"]*"` (grabbing only the first quoted
value), falling back to strip-trailing-comment + trim for unquoted
values.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 01:52:31 -04:00
josh 6ce95547f4 deploy: mark install.sh + pxe-setup.sh executable in git index
CI / Lint + build + test (push) Failing after 5m13s
Git on Windows dropped the exec bit when the files were first committed,
so `sudo ./pxe-setup.sh` on the LXC errored with "command not found".
Fix via `git update-index --chmod=+x`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 01:43:02 -04:00
josh a5055b3c7a Automate PXE setup: release bundle + pxe-setup.sh + startup validation
CI / Lint + build + test (push) Has been cancelled
Collapses the LXC side of PXE enablement from a six-step manual dance
(build, fetch iPXE, scp, bridge, hand-edit yaml) into:

  make release                   # dev box (Linux/WSL)
  scp bundle.tar.gz lxc:/tmp/
  sudo ./install.sh              # base install, unchanged
  sudo ./pxe-setup.sh --interface ... --dhcp-range ... --orchestrator-url ...

pxe-setup.sh fetches iPXE from boot.ipxe.org, verifies against pinned
SHA256s in deploy/ipxe-shas.txt (fail-closed), places vmlinuz/initrd.img
from the bundle, and rewrites only the pxe: block of vetting.yaml.
Idempotent; --force gates overwriting a hand-edited block.

Adds Supervisor.Validate() — called before dnsmasq spawn — so typo'd
configs fail at orchestrator startup with clear errors naming the
missing file or yaml key, instead of silently serving broken TFTP
until a real host tries to PXE-boot. Nine tests cover missing files,
bogus interface, malformed dhcp_range, bad orchestrator_url, and
aggregate reporting.

Hypervisor bridge creation stays documented (LXC can't do it) but
everything downstream of the bridge is now scripted.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 01:38:43 -04:00
josh d245fa6235 quick.sh: stage+install agent to avoid ETXTBSY, restart service
CI / Lint + build + test (push) Failing after 5m17s
Re-running quick.sh on a host where vetting-reporter was already
running failed with curl error 23 because curl can't overwrite a
busy executable. Download to a staging path, then use `install(1)`
which unlinks the target before writing. Swap `enable --now` for
`enable` + `restart` so the service picks up the new binary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 01:13:14 -04:00
josh d0bfae14c8 Heartbeat-first dispatch: retire WoL-as-default, add WaitingReboot
CI / Lint + build + test (push) Has been cancelled
Every supported host runs vetting-reporter in-OS and heartbeats every
30s. WoL was never the thing that started vetting — the heartbeat
response's reboot_for_vetting command was. Firing WoL first only
crowded the run log with misleading diagnostics when the real failure
mode is "reporter isn't installed."

- StartRun 409s if the host hasn't heartbeated within 60s, pointing
  the operator at /register/quick.sh.
- Dispatcher re-checks LastSeenAt at dispatch time (run may sit in
  Queued long enough for the host to go offline); stale hosts mark
  the run Failed with failed_stage=dispatch instead of looping.
- New StateWaitingReboot + TriggerRebootCommanded capture the actual
  semantics. StateWaitingWoL kept as the hook point for a future
  manual-override button.
- Tile disables the Start button with a quick.sh tooltip when the
  host is offline, matching the server-side 409.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 01:10:34 -04:00
josh c9927ca2bf config: default agent.asset_dir so old configs still serve /assets
CI / Lint + build + test (push) Failing after 5m12s
Operators who installed vetting before agent.asset_dir existed keep
their config preserved by install.sh on upgrade, which left them
with AssetDir="" — the router silently dropped the /assets/*
mount and the quick-register one-liner hit 404 fetching the agent
binary. Default AssetDir alongside the database file so the same
directory install.sh already creates + drops the agent binary into
is picked up automatically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 00:41:09 -04:00
josh 1694c20b12 Host detail v2: full pipeline + per-stage logs + WoL diagnostics
CI / Lint + build + test (push) Has been cancelled
Pipeline now always renders all 13 nodes (3 pre-stage + 9 stage +
Completed), synthesising ghosts from run state when stage rows
aren't seeded yet. Makes a WaitingWoL host show the full timeline
ahead of it instead of just 4 dots.

Agent tags each log line with its stage; logs.Hub fans out to both
log-{runID} and log-{runID}-{stage} SSE events so the detail page
can show per-stage tabs with a pure-CSS radio-sibling switch. Flat
run log prepends [stage] so grep still works.

Dispatcher writes picked/sent-WoL/heartbeat lines into the per-run
log — the operator opens the detail page, sees WaitingWoL stuck,
and reads exactly what the dispatcher did and why nothing's
progressing, instead of having to tail journalctl on the LXC.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 00:38:27 -04:00
josh a3d5e2d0a4 proxmox-install: build agent binary for serving
CI / Lint + build + test (push) Failing after 5m22s
The agent binary is never run on the LXC, but it has to be present
so /assets/vetting-agent-linux-amd64 can serve it to target hosts
via the quick-register one-liner. Install was failing because only
orchestrator-linux was being built.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 00:12:41 -04:00
josh af41644929 proxmox-install: reset hard instead of checkout
CI / Lint + build + test (push) Failing after 5m25s
Regenerated _templ.go files embed the templ source path at runtime,
which differs between the dev machine and /opt/vetting-src on the
target. That left tracked files modified after every build, and the
next upgrade-run hit "local changes would be overwritten by checkout"
and aborted. /opt/vetting-src is script-managed, so `git reset --hard
origin/<branch>` is the right semantics.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 00:02:01 -04:00
josh bb658a8435 Host detail page + pipeline timeline
CI / Lint + build + test (push) Has been cancelled
Click a tile to open /hosts/{id} — the canonical control surface per
host. Timeline renders every pre-stage, stage, and terminal node in
order, with the current one pulsing, failed ones flagged, and
downstream ones dimmed as skipped. Detail page shows summary, hold
card (when holding), all action buttons, spec diffs, a full-height
log pane, and a collapsed expected-spec YAML.

Tile slims to name, last-seen, status, and one primary action; a
CSS-overlay <a> makes the whole card clickable while buttons stay
receptive via z-index.

Runner.publishTileUpdate now also emits pipeline-{runID} fragments,
and CompleteStage wraps Stages.CompleteByName so stage completions
advance the timeline live — without this the dots only moved on
state transitions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-17 23:59:43 -04:00
josh 9b16ed80e6 Heartbeat command channel: reboot_for_vetting skips WoL
CI / Lint + build + test (push) Failing after 5m13s
When the operator clicks Start vetting and the host is heartbeating,
the heartbeat response now carries cmd=reboot_for_vetting + run_id.
The handler drives the Queued → WaitingWoL transition via the existing
state machine, so a benign race with the 2s dispatcher poll is refused
by the state machine (not double-dispatched). WaitingWoL retries for
10 minutes to cover a crashed-mid-reboot case, then falls back to
operator action.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-17 23:37:01 -04:00
josh a0c0fb114f Add host-mode heartbeat: vetting-agent host + last-seen badge
CI / Lint + build + test (push) Has been cancelled
vetting-agent gains a `host` subcommand that runs as a systemd service
installed by the quick-register one-liner, POSTing every 30s to
/api/v1/hosts/{mac}/heartbeat so the dashboard tile shows "online" or
"Nm ago" without waiting on WoL. Ships dormant client code for the
Phase 2 reboot_for_vetting command so the server can flip it on later
without a binary redeploy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-17 23:34:15 -04:00
josh d24207427f Fix quick-register broadcast detection on Proxmox bridges
CI / Lint + build + test (push) Failing after 5m17s
Two bugs compounded on Proxmox hosts: primary_iface walked
`ip link show` and picked the physical NIC (e.g. enp1s0), which has
no IPv4 on Proxmox because the address lives on vmbr0. Even if vmbr0
had been picked, the kernel reports its broadcast as 0.0.0.0, so the
script fell all the way back to 255.255.255.255.

Now we prefer the default-route interface (vmbr0 on Proxmox, eno1 on
bare metal) and, when `ip` doesn't surface a usable `brd`, compute
the broadcast from the inet CIDR instead of giving up.
2026-04-17 22:57:49 -04:00
josh 8b3d9a312e Add quick-register one-liner for target-host registration
CI / Lint + build + test (push) Failing after 5m15s
Operator pastes `curl -fsSL $ORCH/register/quick.sh | sudo bash` on the
target host (pre-wipe). The script probes MAC + CPU/RAM/disks/NICs/GPUs,
emits an expected-spec YAML, and POSTs to a new LAN-trusted JSON
endpoint /api/v1/hosts. The register page shows the command prefilled
with the orchestrator URL; the manual form moves into a collapsible
"Register manually" disclosure.
2026-04-17 22:50:54 -04:00