Previously tftp_root defaulted to logs.dir/../tftp and the pxe
runtime dir to logs.dir/../pxe. On a production install that
resolves to /var/log/tftp and /var/log/pxe, both outside the
systemd unit's ReadWritePaths=/var/lib/vetting /var/log/vetting
sandbox. The service crash-looped with "mkdir /var/log/pxe:
read-only file system" as soon as PXE was enabled.
Switch the anchor to filepath.Dir(cfg.Artifacts.Dir) — typically
/var/lib/vetting — which the sandbox already allows.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The linux-image-amd64 postinst creates /vmlinuz but the paired
/initrd.img symlink only shows up via an initramfs-tools hook that
doesn't fire when we call update-initramfs ourselves. Without it,
the top-level Makefile's `cp live-image/build/initrd.img` fails and
`make release` aborts with a broken bundle.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two bugs chained together to ship a broken bundle:
1. With Bootable=no, mkosi skips update-initramfs, so no
/boot/initrd.img-<kver> ever gets generated inside the rootfs.
The postinst now runs update-initramfs via chroot to produce it.
2. The `make release` recipe chained its `cp` calls with `;`, so
a missing live-image/build/initrd.img silently failed and the
bundle still got tarred + uploaded. Adding `set -e` at the top
of the recipe makes any missing component fail the build loudly
instead of shipping a half-bundle.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-command upgrades were leaving /var/lib/vetting/live/ stale on
PXE-enabled LXCs because install.sh explicitly punted live-image
staging to pxe-setup.sh. That was right when make-release ran on a
dev box, but the new registry-pull flow ships vmlinuz+initrd.img
inside the bundle — they should land in place during every install.
install.sh now:
- auto-detects live-image/{vmlinuz,initrd.img} (release bundle
layout) or ../live-image/build/ (repo dev checkout) and stages
them into --live-dir (default /var/lib/vetting/live).
- restarts vetting.service when already enabled, so the
curl | sudo bash one-liner is the full upgrade loop. First-
install path still leaves the service stopped for config edits.
pxe-setup.sh's own live-image copy is now redundant on upgrade but
still runs for first-time PXE setup (it also writes the pxe: block
of vetting.yaml, which install.sh has no business touching).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mkosi was failing with "systemd-boot was not found at
usr/lib/systemd/boot/efi" because Bootable=yes expects systemd-boot
installed *inside* the image for EFI boot. This image is only ever
PXE-booted — iPXE loads vmlinuz+initrd from TFTP directly, so the
rootfs itself needs no bootloader.
Switching to Bootable=no drops the EFI-image assembly step; the
linux-image-amd64 postinst still creates /vmlinuz and /initrd.img
symlinks that the top-level Makefile copies into the bundle.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mkosi only mounts live-image/ as /work/src, so the postinst couldn't
reach the repo-root bin/vetting-agent.linux-amd64 — the build failed
in CI with `install: cannot stat '/work/src/bin/vetting-agent.linux-amd64'`.
The Makefile now copies the prebuilt agent into mkosi.extra/, which
mkosi merges into the image root automatically. The postinst is
reduced to creating the multi-user.target.wants symlink.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After installing the kernel package into the live image, mkosi runs
depmod on the host against the image's module tree. depmod ships in
the kmod package, which isn't in the runner container by default.
mkosi Bootable=yes shells out to bootctl kernel-identify on the host,
which ships in the systemd-boot package on Ubuntu (not in systemd
itself). Without it, the live-image build fails at the end with
"bootctl: not found" after successfully installing all packages.
mkosi's apt-get (inside the mkosi workspace) couldn't verify Debian's
InRelease signatures because the act_runner's Ubuntu base image ships
Ubuntu's keyring, not Debian's. Adding `debian-archive-keyring` to the
apt install list exposes /usr/share/keyrings/debian-archive-keyring.gpg
which debootstrap and apt need for the bookworm repos.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mkosi refused with "Could not find 'ukify'". The live image's mkosi.conf
sets Bootable=yes, and mkosi invokes ukify to package the Unified
Kernel Image alongside vmlinuz+initrd.img. On Debian/Ubuntu, ukify
ships in the `systemd-ukify` apt package (not in `systemd`).
Added to both release.yml and e2e.yml's live-image dep lists.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mkosi refuses to run a non-executable postinst. git was tracking it
as 100644 because it was added from Windows (no POSIX exec bit on the
FS), so CI saw a non-executable file even though WSL/Linux had been
treating it fine locally. Same fix applied earlier to install.sh +
pxe-setup.sh.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI's `go mod tidy` check caught the drift. The module was used
transitively but never recorded.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The action tries to restore from 172.18.0.2:36061 (Gitea's cache
server), times out, falls through to a fresh download anyway. Pure
waste since the runner already has the toolchain in
/opt/hostedtoolcache. Turn cache off.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Unquoted `(default: main)` trips Gitea Actions' strict YAML parser
with "mapping values are not allowed in this context" because the
inline colon reads as a nested mapping. GitHub Actions' parser was
lenient about this; Gitea's isn't.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The self-hosted Gitea runner advertises itself as `ubuntu-latest`,
not `self-hosted`, so the jobs were never getting picked up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops the per-install Go toolchain dance + source build. The installer
now just curls the bundle from
${REGISTRY_URL}/api/packages/${PACKAGE_OWNER}/generic/vetting/${VETTING_VERSION}/vetting-bundle.tar.gz,
extracts it, and hands off to the bundled install.sh with explicit
--binary / --agent-binary paths so the in-bundle layout is picked up.
Default version is `latest` (rolling alias, overwritten by release.yml
on each push to main). Pin via `VETTING_VERSION=sha-abc1234 curl ... |
bash` when rolling back or testing a specific commit.
Removes the `apt install build-essential git` + Go toolchain download
+ templ install + `make orchestrator-linux agent-linux` path — the CI
workflow already produced all of that. Install time on a cold LXC
drops from minutes to under a minute, and live-image kernel/initrd
now arrive with every install instead of requiring a separate WSL
build.
Also rewrites docs/operations.md's install section around the
one-liner, keeps the `make release` + scp path as the offline
fallback, and swaps the upgrade section to just "rerun the one-liner."
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `.gitea/workflows/{ci,e2e,release}.yml` and removes the old
`.github/workflows/` counterparts. Gitea reads both paths, so keeping
them would double-run every job on every push.
- ci.yml / e2e.yml are 1:1 ports of the GitHub versions, just with
`runs-on: self-hosted` (Gitea has no hosted runners).
- release.yml is new: fires on push to main, runs `make release`, then
publishes `vetting-bundle.tar.gz` to the Gitea generic package
registry under two versions — `sha-<short-sha>` (immutable, pinnable)
and `latest` (rolling alias, DELETE+PUT on each run). Auth via a
REGISTRY_TOKEN secret + REGISTRY_URL variable configured on the Gitea
side.
The runner is being reconfigured to privileged so `mkosi` + `debootstrap`
can build the live image inside CI.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The production yaml ships `interface: "" # e.g. "eth0"`.
The old extractor did `gsub(/^"|"$/, "")` which only strips outer quotes, so
with an inline comment containing quotes it produced garbage like
`" # e.g. "eth0`, tripping the idempotency check.
Replaces the two inline extractors with one `extract_yaml_value` helper
that first tries to match `"[^"]*"` (grabbing only the first quoted
value), falling back to strip-trailing-comment + trim for unquoted
values.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Git on Windows dropped the exec bit when the files were first committed,
so `sudo ./pxe-setup.sh` on the LXC errored with "command not found".
Fix via `git update-index --chmod=+x`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Collapses the LXC side of PXE enablement from a six-step manual dance
(build, fetch iPXE, scp, bridge, hand-edit yaml) into:
make release # dev box (Linux/WSL)
scp bundle.tar.gz lxc:/tmp/
sudo ./install.sh # base install, unchanged
sudo ./pxe-setup.sh --interface ... --dhcp-range ... --orchestrator-url ...
pxe-setup.sh fetches iPXE from boot.ipxe.org, verifies against pinned
SHA256s in deploy/ipxe-shas.txt (fail-closed), places vmlinuz/initrd.img
from the bundle, and rewrites only the pxe: block of vetting.yaml.
Idempotent; --force gates overwriting a hand-edited block.
Adds Supervisor.Validate() — called before dnsmasq spawn — so typo'd
configs fail at orchestrator startup with clear errors naming the
missing file or yaml key, instead of silently serving broken TFTP
until a real host tries to PXE-boot. Nine tests cover missing files,
bogus interface, malformed dhcp_range, bad orchestrator_url, and
aggregate reporting.
Hypervisor bridge creation stays documented (LXC can't do it) but
everything downstream of the bridge is now scripted.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-running quick.sh on a host where vetting-reporter was already
running failed with curl error 23 because curl can't overwrite a
busy executable. Download to a staging path, then use `install(1)`
which unlinks the target before writing. Swap `enable --now` for
`enable` + `restart` so the service picks up the new binary.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every supported host runs vetting-reporter in-OS and heartbeats every
30s. WoL was never the thing that started vetting — the heartbeat
response's reboot_for_vetting command was. Firing WoL first only
crowded the run log with misleading diagnostics when the real failure
mode is "reporter isn't installed."
- StartRun 409s if the host hasn't heartbeated within 60s, pointing
the operator at /register/quick.sh.
- Dispatcher re-checks LastSeenAt at dispatch time (run may sit in
Queued long enough for the host to go offline); stale hosts mark
the run Failed with failed_stage=dispatch instead of looping.
- New StateWaitingReboot + TriggerRebootCommanded capture the actual
semantics. StateWaitingWoL kept as the hook point for a future
manual-override button.
- Tile disables the Start button with a quick.sh tooltip when the
host is offline, matching the server-side 409.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Operators who installed vetting before agent.asset_dir existed keep
their config preserved by install.sh on upgrade, which left them
with AssetDir="" — the router silently dropped the /assets/*
mount and the quick-register one-liner hit 404 fetching the agent
binary. Default AssetDir alongside the database file so the same
directory install.sh already creates + drops the agent binary into
is picked up automatically.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline now always renders all 13 nodes (3 pre-stage + 9 stage +
Completed), synthesising ghosts from run state when stage rows
aren't seeded yet. Makes a WaitingWoL host show the full timeline
ahead of it instead of just 4 dots.
Agent tags each log line with its stage; logs.Hub fans out to both
log-{runID} and log-{runID}-{stage} SSE events so the detail page
can show per-stage tabs with a pure-CSS radio-sibling switch. Flat
run log prepends [stage] so grep still works.
Dispatcher writes picked/sent-WoL/heartbeat lines into the per-run
log — the operator opens the detail page, sees WaitingWoL stuck,
and reads exactly what the dispatcher did and why nothing's
progressing, instead of having to tail journalctl on the LXC.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The agent binary is never run on the LXC, but it has to be present
so /assets/vetting-agent-linux-amd64 can serve it to target hosts
via the quick-register one-liner. Install was failing because only
orchestrator-linux was being built.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Regenerated _templ.go files embed the templ source path at runtime,
which differs between the dev machine and /opt/vetting-src on the
target. That left tracked files modified after every build, and the
next upgrade-run hit "local changes would be overwritten by checkout"
and aborted. /opt/vetting-src is script-managed, so `git reset --hard
origin/<branch>` is the right semantics.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Click a tile to open /hosts/{id} — the canonical control surface per
host. Timeline renders every pre-stage, stage, and terminal node in
order, with the current one pulsing, failed ones flagged, and
downstream ones dimmed as skipped. Detail page shows summary, hold
card (when holding), all action buttons, spec diffs, a full-height
log pane, and a collapsed expected-spec YAML.
Tile slims to name, last-seen, status, and one primary action; a
CSS-overlay <a> makes the whole card clickable while buttons stay
receptive via z-index.
Runner.publishTileUpdate now also emits pipeline-{runID} fragments,
and CompleteStage wraps Stages.CompleteByName so stage completions
advance the timeline live — without this the dots only moved on
state transitions.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the operator clicks Start vetting and the host is heartbeating,
the heartbeat response now carries cmd=reboot_for_vetting + run_id.
The handler drives the Queued → WaitingWoL transition via the existing
state machine, so a benign race with the 2s dispatcher poll is refused
by the state machine (not double-dispatched). WaitingWoL retries for
10 minutes to cover a crashed-mid-reboot case, then falls back to
operator action.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
vetting-agent gains a `host` subcommand that runs as a systemd service
installed by the quick-register one-liner, POSTing every 30s to
/api/v1/hosts/{mac}/heartbeat so the dashboard tile shows "online" or
"Nm ago" without waiting on WoL. Ships dormant client code for the
Phase 2 reboot_for_vetting command so the server can flip it on later
without a binary redeploy.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two bugs compounded on Proxmox hosts: primary_iface walked
`ip link show` and picked the physical NIC (e.g. enp1s0), which has
no IPv4 on Proxmox because the address lives on vmbr0. Even if vmbr0
had been picked, the kernel reports its broadcast as 0.0.0.0, so the
script fell all the way back to 255.255.255.255.
Now we prefer the default-route interface (vmbr0 on Proxmox, eno1 on
bare metal) and, when `ip` doesn't surface a usable `brd`, compute
the broadcast from the inet CIDR instead of giving up.
Operator pastes `curl -fsSL $ORCH/register/quick.sh | sudo bash` on the
target host (pre-wipe). The script probes MAC + CPU/RAM/disks/NICs/GPUs,
emits an expected-spec YAML, and POSTs to a new LAN-trusted JSON
endpoint /api/v1/hosts. The register page shows the command prefilled
with the orchestrator URL; the manual form moves into a collapsible
"Register manually" disclosure.
Can't log in from a fresh LXC deploy, and the service is LAN-only by
design. Rip out the whole bcrypt-password / signed-cookie session
layer: internal/auth, login templates, gen-admin-password binary +
Makefile targets, auth config block, login/logout routes and the
RequireSession middleware wrap. Agent bearer-token auth on
/api/v1/runs/{id}/* is untouched.
Operators who want a password can front the service with a reverse
proxy — noted in README and docs/operations.md.
Service was crashing on every boot because vetting.example.yaml uses
./var/... relative paths that resolve to / under ProtectSystem=strict.
Ship a separate vetting.production.yaml with absolute /var/lib/vetting
+ /var/log/vetting paths that match the unit's ReadWritePaths, and
have install.sh copy that one. Also move StartLimit* keys into [Unit]
to silence the 'Unknown key' warning on modern systemd.
proxmox-install.sh + install.sh left operators with no way to
generate the bcrypt hash on the LXC — 'vetting gen-admin-password'
was suggested in the post-install message but the binary has no
subcommands. Cross-build gen-admin-password-linux-amd64 during the
one-liner flow and drop it into /usr/local/bin.
deploy/proxmox-install.sh bootstraps a fresh LXC end-to-end: apt
prereqs, Go toolchain (if missing), git clone, build, then hands off
to deploy/install.sh. README documents the curl|bash invocation.