Remove ~126 lines of orphaned CSS from tile slim-down and old detail
layout. Consolidate 4 duplicate duration formatters into shared
elapsed()/fmtElapsed() helpers. Break 160-line Result handler into
focused sub-functions. Implement real Hub.Shutdown() (was a no-op).
Standardize agent error responses to JSON. Replace panic() in router
init with error return. Extract magic numbers as named constants.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run status, Start/Cancel/View controls, and non-destructive toggle all
live on /hosts/{id} — duplicating them on the dashboard tile clogged
the grid and wouldn't scale past a handful of hosts.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wrap the three install scripts in a shared inline style block (TTY/UTF-8/
NO_COLOR-aware) so the one-liner install looks and feels intentional:
banner on start, timed step lines, braille spinner over silent apt/
systemctl calls with failure log dumps, single-line curl progress bars
with size-prefixed headers, and a summary box at the end with live-image
version + service state + next steps. install.sh defers banner/summary
to proxmox-install.sh when VETTING_INSTALL_WRAPPED is set so the two
scripts compose without duplication.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Echoes OWNER, token length, and whoami before the upload so a 401
disambiguates: missing/empty token, bad OWNER resolution, or token
authenticating as a different user.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Splits the release workflow into three jobs (detect, build-live-image,
bundle) so the ~9 min mkosi build only runs when live-image/VERSION
bumps. The slim bundle (~30 MB: orchestrator + agent + deploy scripts
+ a live-image/VERSION pointer) rebuilds every push; the ~300 MB
vmlinuz+initrd.img are published separately under the immutable
live-image/<version>/ path. install.sh compares the pointer to
/var/lib/vetting/live/VERSION and fetches the files only on mismatch,
cutting repeat-install wall-clock from ~30 s + 300 MB to ~10 s + 0 MB
on the common no-live-image-change release.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Catches the generated file up to the .templ source committed in 599fd15.
No behavior change — the generator just hadn't been re-run.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A stale /etc/vetting/vetting.yaml (e.g. pxe.interface=eth1 after an
LXC rebuild renamed the NIC to eth0) blocks vetting.service startup
with "pxe.interface 'eth1' not found on host", requiring the operator
to ssh in and hand-edit the yaml after every rebuild.
install.sh now validates the pxe block against the host's actual
network state on every install/upgrade run. If pxe.enabled is true and
pxe.interface doesn't exist (or pxe.subnet is missing/malformed), the
script auto-detects the primary NIC via the default route, reads its
subnet from the kernel-scope route, and patches both values in place.
Valid configs are left exactly as the operator had them; fresh
installs with pxe.enabled=false skip the check entirely.
The one-liner install/update is now self-healing for the most common
stale-config failure mode.
Before, a run that failed, held for operator review, and was then
cancelled showed up on the tile and run header as plain "Cancelled"
with an idle-grey mood — indistinguishable from a mid-stage cancel of
a healthy run. That hides the actual failure from the dashboard.
Now: when State=Cancelled with FailedStage still set (the hold-cancel
signature the heartbeat handler already uses to pick reboot vs
cancel_stage), the badge reads "Failed (cancelled)" with a fail-
colored mood. Mid-stage cancels keep reading as plain "Cancelled".
The heartbeat handler was returning cmd=abort for FailedHolding, which
caused the agent's heartbeat goroutine to exit after ~10s in hold.
Subsequent state changes (Cancel -> reboot, Override -> retry_stage)
then had no recipient, so the host sat idle at the SSH hold prompt
forever. Narrowed cmd=abort to StateReleased only; FailedHolding falls
through to cmd=continue so the loop keeps polling and can receive the
operator's eventual command.
A held run sits indefinitely at an SSH prompt waiting for operator
investigation. Previously the only exits were Override (re-enter the
failed stage) or leaving the host on forever — Cancel rejected any
terminal state, including FailedHolding, and there was no button in
the UI anyway.
Add a dedicated exit path:
- statemachine: TriggerOperatorCancelled now accepts FailedHolding
as a valid source, transitioning to Cancelled like any other
live state.
- CancelRun handler: treats FailedHolding as cancellable even
though IsTerminal reports true.
- heartbeat: Cancelled runs fork on FailedStage. Set means the
agent is parked in waitForOverride with no subprocess in
flight, so cmd=reboot tells it to systemctl reboot; the host
falls through iPXE's no-active-run script to the local disk.
Empty FailedStage keeps the pre-existing cmd=cancel_stage path
for mid-stage cancels (kill stage ctx, then power off).
- UI: canCancel now returns true for FailedHolding, and the
run-detail page renders a distinct "Cancel & reboot" button
with a hold-specific confirm message so the action doesn't
look identical to a mid-run cancel.
Tests cover the new statemachine transition, the heartbeat fork
(reboot vs cancel_stage), and keep the pre-existing mid-run cancel
behaviour locked in.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`lspci -D -mm -nn` prefixes every line with the PCI address as a bare
token before the three quoted class/vendor/device fields, so the
device name sits at fields[3] — not fields[2], which is the vendor.
The probe was indexing [2] and recording every GPU's model as its
vendor string ("Intel Corporation" instead of "Alder Lake-N [UHD
Graphics]"), which made every SpecValidate mismatch on real hosts
once the expected spec named the device.
Extract the per-line parse into parseLspciMMLine, handle both the
modern -D layout (addr + class/vendor/device) and the legacy
layout without an address prefix (class/vendor/device), and cover
both paths plus the non-GPU-class skip in inventory_test.go.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
15 nodes (3 pre-stage + 11 stage + Completed) exceeded the 1280px main
container's usable width, producing a horizontal scrollbar under the
pipeline on the run page. Widen main to 1440px, tighten per-node min
widths, drop the scrollbar, and split camelCase labels so multi-word
stages ("WaitingReboot", "SpecValidate", "CPUStress") wrap onto two
lines instead of forcing node width.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Completed runs now reboot the host and fall through iPXE to the next
boot device (local disk) instead of powering off. Three coordinated
changes:
- pxe/ipxe: NoActiveRunScript exits iPXE (drops to next boot entry)
instead of `sleep 10; poweroff`. Without this, a Completed reboot
just loops through PXE and gets told to poweroff.
- api/agent_handlers: heartbeat returns cmd=reboot (was cmd=shutdown)
when the run reaches Completed.
- agent/runner: runs `systemctl reboot` (with `shutdown -r now`
fallback) in response to cmd=reboot.
Operator cancel still powers off — powerOffAndReturn is unchanged
because a cancel means the operator wants the host idle so they can
walk up to it, not back in rotation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extend Inventory stage from a one-liner summary to a per-probe substep
emitter with ~20-30 narrative log lines per run.
- spec: per-DIMM memory (slot/size/speed/manufacturer/part_number),
richer CPU (vendor/stepping/physical_cores/flags), disk
model/transport/rotational, NIC driver/pci_addr, GPU vram/pci/driver,
new System/Baseboard/PSU/OS top-level sections. All fields omitempty
so existing expected-spec YAML and artifacts stay compatible.
- spec.Diff: new diffDIMMs/diffSystem/diffBaseboard/diffPSU/diffOS
helpers; extended diffDisks/diffNICs/diffGPUs for new fields. GPU
diff gains PCIAddr-pinned matching alongside count-by-model.
- agent/probes/inventory: CPU (/proc/cpuinfo extended), Memory
(dmidecode -t 17 multi-block), Disks (+model/transport/rotational),
NICs (+driver/pci from sysfs), GPUs (VRAM from lspci -vv),
new System/Baseboard (dmidecode -t system/baseboard), PSU
(dmidecode -t 39), OS (/proc/sys/kernel/osrelease + /etc/os-release).
All probes accept a Logger and emit per-finding info/warn lines.
- agent/probes/firmware: parseDmidecodeAllSections for multi-block
fixtures (memory / PSU).
- agent/runner: Inventory case becomes 9 substep rows (CPU / Memory /
Disks / NICs / GPUs / System / Baseboard / PSU / OS) with per-probe
start/complete timestamps.
- report: new Inventory HTML section between Stages and Firmware;
resolveReporting loads the inventory.json artifact.
- agent/tests/fakes/dmidecode: dispatches on -t flag to serve bios /
memory / system / baseboard / 39 fixtures for unit tests.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mkosi.conf: add ipmitool, ethtool, nvme-cli so the Firmware stage
can actually read BMC revisions, NIC firmware versions, and fall
back to nvme-cli when sysfs firmware_rev is missing.
firmware.go: probeNICFirmware and probeHBAFirmware now return
(snapshots, warning) so a missing ethtool/lspci surfaces in the
stage log the same way probeBIOS/probeBMC already do. Before, a
host without ethtool silently reported "bios=1 nvme_fw=1
microcode=1" with no hint that nic coverage was dropped.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a 1s client-side ticker that rewrites .run-duration text from a
data-started-at attribute, so the header timer on /runs/{id}
increments every second while the run is active. When an SSE swap
lands a fresh header the new server-rendered value seamlessly takes
over; when the run goes terminal the template drops the attribute
and the ticker silently skips the node, leaving the final elapsed in
place.
Other templ_*.go churn is cosmetic — regenerator versions differ
between CI and local and only the filename field in templ.Error
callsites changed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a paths-ignore filter to the push trigger so README tweaks,
*_test.go edits, other workflows, and fake-binary scaffolding no
longer spend 45 min debootstrapping + republishing an identical
bundle to the package registry. Adds workflow_dispatch as a manual
escape hatch for the cases where paths-ignore swallows something
that needs republishing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CancelRun goes through Runner.Transition → Runs.SetState, which was a
bare UPDATE state=? with no completed_at write. The host-page
runDuration helper treats nil CompletedAt as "still running", so a
cancelled run kept ticking forever. MarkCompleted / MarkFailed /
MarkDispatchFailed already stamp completed_at; SetState now does the
same for any terminal target state, using COALESCE so we never
clobber an already-set timestamp.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sibling Run-scan sites (Get, LatestForHost, ListForHost, Active) were
updated to include COALESCE(profile,'quick') in the SELECT when the
Phase 1 migration added the column; FindActiveByMAC was missed, so
Scan got 14 destination args for a 13-column row. The symptom is
/ipxe/{mac} returning 500 and the host booting nothing, since that
handler is what returns the live-image script.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Gitea's act_runner rejects @actions/artifact v2 (the engine behind
upload-artifact@v4). v3 is the last GHES-compatible major and still
supports the path: glob + retention-days we need.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ships all five phases of the deep-profile overhaul together. Runs now
carry a profile (quick/deep/soak); every profile walks the same
11-stage order — Inventory → Firmware → SpecValidate → SMART →
CPUStress → Storage → Network → Burn → GPU → PSU → Reporting —
with only per-stage durations and concurrency scaled.
Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile
column + CreateWithProfile; threshold table + evaluator seeded per-run
from the shared vetting.thresholds block; breach flips result at
/sensor + /result.
Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify +
EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta),
Network (sustained iperf + /proc/net/dev deltas) with per-profile
knobs from Deps.
Phase 3: Burn super-stage with goroutine fan-out for CPU + memory +
fio + iperf, PSU rails sampled across the Burn window, SensorMux
(2 s flush, 500-sample cap) to absorb backpressure.
Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode
(BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl),
lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into
SpecValidate with pin-by-identifier and fan-out-across-component
matching; mismatches park the run in FailedHolding.
Phase 5: profile radio on the host start form, profile chip on the
run header, Firmware section in the HTML report, coverage artifact
uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath
seam + stress_ng and dmidecode example fakes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
File-level DELETE leaves a ghost version directory that makes the
subsequent PUT 404 after a full 9-minute upload. Delete the whole
'latest' version, log the status code, and wait briefly before PUT.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Host page owns host metadata, full runs table with per-row stage strip,
in-flight banner, and empty-state CTA. Run page owns pipeline, active
step, logs, sub-steps, spec diffs, and hold banner with a breadcrumb
back to the host. Dashboard tile reverts to host-only.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The .log-line grid was templated with 5 columns (anchor/ln/lvl/ts/text),
but renderLogSSE inserts an optional log-stage span, making 6 children.
The 6th child wrapped to row 2 column 1 (24px wide), which forced the
message text to break one character per line. Flexbox with min-width:0
on the text span scales cleanly with or without the stage element.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reshapes the detail page into a run-view: hybrid horizontal pipeline
+ expanded active-step pane with sub-steps, a per-step log pane with
line-numbered permalinks and client-side search, and a runs-history
sidebar that navigates via ?run=N. Default step is server-picked
(running → failed → Reporting) so the operator lands on the thing
that's moving.
Adds a sub_steps table + SSE topic (substep-{run}-{stage}-{ordinal})
so per-disk and per-pass work (SMART, CPUStress CPU/RAM, Storage,
GPU) is visible in the UI instead of buried in stage summary JSON.
Agent emits sub-step reports from existing per-iteration loops.
Dashboard tiles become a mini run-view with a 9-dot step strip so
the operator reads run health across the whole grid at a glance.
Register page gets the same card shell + button styling.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Detail-page pipeline + log panes weren't updating without a manual
refresh. Root cause: the integrity attribute on htmx-ext-sse@2.2.2
in layout.templ was wrong, so the browser refused to execute the
script (SRI enforcement is silent — no user-visible error unless
you open devtools). htmx core loaded, boosted nav worked, forms
worked — but sse-connect/sse-swap were inert because the extension
never registered, so no EventSource was ever opened.
Replaced the claimed hash (Y4gc0CK6...) with the real one
(fw+eTlCc...) computed via
curl -sL https://unpkg.com/htmx-ext-sse@2.2.2 |
openssl dgst -sha384 -binary | openssl base64 -A
Added sse_e2e_test.go as a regression canary that mounts the real
chi router (RealIP + Recoverer + Logger middleware), opens
GET /events, publishes a tile-update via Runner, and asserts the
event lands on the wire. Server-side unit tests only verified
rendered HTML — this one covers the full publish→wire path, which
is what the next regression in this area will hit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping
CPUStress. Two compounding bugs:
1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently.
On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer
fired, usually on the agent itself. Replaced with two sequential
passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1,
--vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify)
for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a
premature clean exit counts as failure instead of a silent pass.
2. On systemd-restart after the OOM, the agent hardcoded nextStage :=
"Inventory" and re-ran it. The orchestrator's /result handler
advances run state via TriggerStageCompleted against the *current*
RunState, not against body.Stage — so an Inventory result posted
while the run was in StateCPUStress silently advanced CPUStress →
Storage and marked CPUStress passed without it ever running.
Two-layer defense for #2:
- agent-side: /claim response now carries current_state; agent resumes
at the matching stage on a re-claim (happy path).
- server-side: new TriggerStageMismatch + StageNameForState helper
backstop. If body.Stage doesn't match the run's current stage, /result
parks the run in FailedHolding with failed_stage labeled
"<got> (expected <expected>)" and returns 409.
Other stages audited for similar unbounded concurrency — none found;
only CPUStress was unsafe.
Tests:
- cpustress_test.go — parseMemAvailable parses real meminfo, errors on
missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB
headroom on normal/huge boxes.
- statemachine_test.go — TriggerStageMismatch lands at FailedHolding
from every stage state and is rejected from pre-stage/terminal
states; StageNameForState round-trips the stageStates map.
- agent_handlers_test.go — TestResult_RejectsMismatchedStage proves
the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage
proves the guard doesn't break the happy path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline fragment payload was a bare <div class=pipeline>, but the
sse-swap=pipeline-N wrapper lived only in the page shell. The first
outerHTML swap destroyed the wrapper, so every subsequent pipeline
event had nothing to target — forcing a manual refresh. RenderPipelineString
now emits the full <section id=pipeline-N sse-swap=... hx-swap=outerHTML>
wrapper, used from both the shell and the orchestrator publish path.
Also drop the red-bar styling from the empty DetailHold placeholder:
the wrapper's detail-hold class was painting an unconditional red band
between Pipeline and Actions whenever no hold was active.
The live image was still carrying the Phase 2 package list, so SMART,
CPUStress, and Network each hit a LookPath miss and returned
pass-with-skip. A run that skipped every real check still ended in
"completed" — nothing on the report said the image was broken.
Add smartmontools, stress-ng, fio, iperf3, lshw, lm-sensors,
e2fsprogs, and util-linux to mkosi.conf. Flip the three stages from
skip-pass to fail when their binary is missing so any future
packaging regression blocks the run instead of whispering past it.
Legitimate "no hardware" skips (no GPU, no hwmon, no disks,
non-destructive) are untouched.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The detail page was only partly live: Pipeline + LogTabs subscribed to
SSE, but the summary header, actions row, spec-diffs list and hold-key
block all froze at page-load and required a manual refresh to catch up
with state changes.
Extract each of those four regions into its own named templ component
with a stable id and sse-swap target, add Render*String helpers so the
orchestrator can publish pre-rendered fragments, and register a
HostDetailRenderer alongside the existing Tile/Pipeline renderers.
PublishHostDetail is folded into publishTileUpdate so every call site
that already refreshes a tile now also refreshes the detail page —
keeps the fan-out honest without scattering new publish calls.
The empty-state wrappers for spec-diffs and hold are load-bearing:
without the <section id=... sse-swap=...> present at initial GET, the
first live event after SpecValidate or Hold writes would have no DOM
node to swap into.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related bugs were producing different map keys for identical
hardware depending on whether the inventory probe ran in the reporter
on the Proxmox host or in the live-image agent after PXE boot.
1. diskSerial read /sys/block/<dev>/device/{serial,vpd_pg80} and only
TrimSpace'd the result. vpd_pg80 is a binary SCSI VPD page with a
4-byte header, and some SSDs leak NUL/control bytes into the text
serial file. Those bytes survive into the Go string, lowercase
unchanged, and become a garbage map key that the reporter's cleaner
read can't match. Sanitize to ASCII-printable range at ingest.
2. probeGPUs built the model slug from fields[2] + " " + fields[3] of
`lspci -mm -nnk` output. fields[3] is subsystem vendor/device info,
which varies between otherwise-identical cards and carries the
`-rXX` revision marker — stable-enough for display but not for
identity. Use fields[2] alone, strip the trailing `[NNNN]` PCI
device-ID that lspci -nn appends, and sanitize for consistency.
After deploying the new orchestrator + re-running the configure step
on each registered host, SpecValidate will match cleanly. Disk diffs
self-resolve because the reporter already stored clean serials; GPU
diffs need one reporter re-run because the old expected slug still
carries subsystem noise.
Belt-and-braces for the kernel-cmdline systemd.firstboot=off fix.
mkosi ships /etc/machine-id empty, which triggers firstboot's
interactive locale/timezone/root-password prompt on every PXE boot;
with the agent running unattended there's nobody to answer and
sysinit.target blocks indefinitely.
Mask via a /dev/null symlink in /etc/systemd/system so the service
is unstartable regardless of cmdline — rules out the failure mode
where an older orchestrator binary serves an iPXE script without
the off-switch arg.
systemd-firstboot.service is an interactive wizard that asks for
locale, timezone, and root password when /etc/machine-id isn't
populated — i.e. every PXE boot of a mkosi-built image. It sits on
sysinit.target waiting for input that will never arrive, blocking
the agent service and every other downstream unit indefinitely.
systemd.firstboot=off on the kernel cmdline is the documented kill
switch; no image-side changes needed.
Drop --progress-bar (curl's minimal hash meter) in favor of the default
progress output, which includes transfer rate and time remaining.
Bundles grew from ~30 MB to ~300 MB with the full-rootfs initrd, and
a percentage-only bar with no speed hint makes a slow registry look
indistinguishable from a hang.
systemd-getty-generator reads console=ttyS0 off the kernel cmdline and
auto-creates serial-getty@ttyS0.service, which BindsTo dev-ttyS0.device.
On hardware without a physical serial port the device node never shows
up, systemd waits its full default 90s timeout, and only then proceeds.
systemd.mask= on the kernel cmdline is a first-class option — masks
the unit before the generator's link even gets activated. Kernel
messages still go to ttyS0 if a port is present; we just don't try
to spawn a login prompt there.
Host boots past kernel init and then stalls silently. ACPI DSDT error
about TXHC.RHUB.SS01 is benign noise (Tiger Lake firmware bug) — the
actual problem is that nothing between kernel handoff and (maybe)
systemd is visible on the console.
Two changes:
1. Replace the /init → sbin/init symlink with a real shell script
(live-image/mkosi.extra/init) that mounts /proc /sys /dev /dev/pts
/dev/shm /run before execing systemd. Systemd has fallback mount
code for these, but when it fails the failure is silent. Doing it
explicitly in /init keeps failures visible and avoids the fragile
symlink-resolution trick.
2. Drop 'quiet' from the kernel cmdline and add loglevel=7 plus
systemd.log_target=kmsg + journald.forward_to_console=1 so every
early-boot message reaches both tty0 and ttyS0. Will be dialed
back once boot is stable.
Also: .gitattributes pins LF on live-image/, .gitea/, Makefile, and
*.sh so Windows checkouts don't break shell scripts and Makefile
recipes with CRLF. /init also gets chmod 0755 in repack-initrd as a
belt-and-braces against mode loss on non-Linux checkouts.
update-initramfs produces a boot stub (~50 MB) that expects to mount a
separate rootfs over squashfs/disk/NFS. Our PXE channel only ships
vmlinuz+initrd.img, so the stub had nothing to pivot to — kernel
finished hand-off and the system wedged with firmware, modules, and
userspace stranded in the 545 MB rootfs dir we never delivered.
Replace with an everything-in-initramfs build: cpio.zst the full
rootfs (minus /boot) as the initrd, add /init -> sbin/init for the
kernel's runtime entrypoint, materialize the kernel symlink into a
real file. Bump check-initrd floor to 200 MB and switch the firmware
grep from unmkinitramfs (boot-stub-specific) to zstd | cpio -t.
Also add cpio to the CI apt deps.
Previous run actually built the 518 MB rootfs with firmware-misc-nonfree
et al. installed — the real payload is working. Two follow-ups:
- check-initrd was reading stat on a symlink path and getting 30 bytes
(the symlink's own size), not the 6.1.0-44-amd64 kernel initrd it
points to. Switched to wc -c, which follows symlinks, and to du -hL
for the OK message.
- Add zstd to Packages= so COMPRESS=zstd in initramfs.conf can be
honored; without it update-initramfs falls back to gzip with a
"No zstd in PATH" warning.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
v24.3 crashed in cp_version() during the copy-package-manager-trees
step because its sandbox needs bubblewrap (not present in the runner
apt list), and cp --version returned empty output inside the broken
sandbox. Installing bubblewrap and bumping to v25.3 which has tighter
sandbox fallback behavior.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous commit pinned mkosi==24.3 via pip but mkosi isn't published
on PyPI past ancient versions — the runner hit
"Could not find a version that satisfies the requirement mkosi==24.3".
Install from the upstream git tag v24.3 instead; added git to the apt
dep list for pip's VCS fetch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous attempt (c962d6d) added firmware-linux-nonfree to mkosi.conf,
but the CI bundle was still 63 MB and Tiger Lake wedged on tgl_guc.
Two reasons: (1) firmware-linux-nonfree on bookworm is a thin
metapackage that doesn't include firmware-misc-nonfree, which is where
i915 GuC/HuC blobs actually live; (2) Ubuntu's apt-packaged mkosi is
old enough that Repositories=non-free-firmware shorthand likely isn't
wired through to the debootstrap invocation, so firmware packages
silently miss the bootstrap step entirely.
Changes:
- Enumerate firmware packages explicitly in mkosi.conf (firmware-
misc-nonfree, firmware-iwlwifi, firmware-realtek, firmware-amd-
graphics, firmware-intel-sound, intel/amd64-microcode).
- Ship mkosi.sources.d/debian.sources with explicit deb822 so the
non-free-firmware component is unambiguously available.
- Install mkosi 24.3 via pip in CI instead of apt's older build.
- Pin MODULES=most and COMPRESS=zstd via a tracked initramfs-tools
config under mkosi.extra/.
- Narrow .gitignore so only the generated agent binary is ignored,
not the whole mkosi.extra/ tree.
- New check-initrd Makefile target asserts both size (>=150 MB) and
actual presence of i915/tgl_guc_*.bin inside the built initrd, so
a silent firmware-drop regression fails the build loudly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tiger Lake and later Intel iGPUs need i915/tgl_guc_*.bin; without
it the i915 init wedges and floods the console. Same story on most
modern wifi/NIC hardware. Pull firmware-linux-nonfree (metapackage
covering misc-nonfree, iwlwifi, realtek, amd-graphics, …) from the
bookworm non-free-firmware repo — single line fix, ~500MB cost to
the squashfs, worth it for booting arbitrary repaired hosts.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Non-destructive pre-declares "don't touch the disks" on Start: the
Storage stage skips wipe-probe, badblocks -w, and write-mode fio,
and reports a read-only summary. Runs a new non_destructive column;
threaded through Claim → agent tests.Deps → Storage stage.
Cancel halts an in-flight run. The orchestrator transitions to a
new StateCancelled via TriggerOperatorCancelled (valid from any
active state); the agent's next heartbeat returns cmd=cancel_stage,
which fires a stored CancelFunc on the per-stage context. Stage
subprocesses spawned with exec.CommandContext die with the context,
the agent posts a cancelled outcome, then powers the host off.
Destructive stages mid-run may leave the host in an intermediate
state — the UI confirm dialog warns the operator; recovery is
manual for now.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dnsmasq's SIGHUP re-reads /etc/ethers and any --dhcp-hostsfile= paths,
but NOT dhcp-host= lines from the main conf. Reload() was faithfully
rewriting dnsmasq.conf with the new MAC, sending SIGHUP, and then
dnsmasq kept serving its startup view — so a freshly-registered host
still showed up as "proxy-ignored, tags: eth0" with no "known" tag.
Split the allowlist into ${RuntimeDir}/dhcp-hosts, referenced from the
main conf via dhcp-hostsfile=. writeConf() is static-ish now; Reload
just rewrites the hosts file and SIGHUPs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
pxe.Supervisor.Reload() was defined but never wired up. After a host
was registered in the UI or via the quick-register JSON endpoint, the
dnsmasq conf still held only the hosts that existed at orchestrator
startup. The new MAC wasn't tagged `known`, so when the host PXE'd,
dnsmasq logged "PXE(eth0) <mac> proxy-ignored" and the boot timed out
back to the BIOS.
Add an optional PXEReloader interface to api.UI, wire it from main
when pxe is enabled, and call u.reloadPXE() after successful Create
and Delete. Logs-and-continues on failure — host registration itself
has already committed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dnsmasq's proxy-DHCP syntax is `dhcp-range=<network-ip>,proxy[,<mask>]`,
not a CIDR. Passing "192.168.1.0/24,proxy" made dnsmasq refuse to start
with "bad dhcp-range at line 12". Parse the CIDR once in writeConf()
and render Network + Netmask as separate template fields.
The config surface (pxe.subnet) stays CIDR because that's the right
shape for humans; we just unpack it before handing to dnsmasq.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
proxmox-install.sh tarball-extracts into a tempdir that gets wiped on
EXIT, so after the one-liner there's no pxe-setup.sh on disk for the
operator to run. Have install.sh drop the script + ipxe-shas.txt into
/usr/local/share/vetting/ and symlink it as
/usr/local/sbin/vetting-pxe-setup (in PATH).
pxe-setup.sh now readlink -f's BASH_SOURCE so the symlink resolves to
the share dir where ipxe-shas.txt lives, and gracefully handles the
case where install.sh already staged vmlinuz + initrd.img into
LIVE_DIR (no bundle live-image/ needed at that point).
Update the trailing hint in proxmox-install.sh and the operations
runbook to surface the new `sudo vetting-pxe-setup ...` command.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rewrites the PXE section of the ops runbook around the new proxy-DHCP
model (no dedicated bridge, coexists with UniFi/pfSense/etc.) and
swaps the e2e test's default bridge + orchestrator URL to match. The
e2e file now calls out the LAN-DHCP precondition in its header so
future-me (or CI) doesn't hang at PXE wondering why nothing answers.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously the orchestrator ran a full DHCP server on a dedicated
br-vetting bridge (10.77.0.0/24), which required a hypervisor-level
bridge + physical cabling onto that bridge for every repaired host.
Real-world bite: the LXC's br-vetting had no L2 path to the target
host's PXE NIC, so DHCPDISCOVERs never reached eth1 and PXE silently
timed out.
dnsmasq's proxy-DHCP mode is the idiomatic answer: it coexists with
the LAN's existing DHCP server (UniFi, etc.), never assigns an IP
itself, and only supplements the PXE options. No dedicated bridge,
no VLAN, no cabling changes \u2014 dnsmasq binds to the LAN interface
and layers option 66/67 + the PXE BINL on top of the real DHCP
exchange. The MAC allowlist still gates replies, so random LAN
clients booting from network get nothing.
Template switches dhcp-range=<start,end,lease> to
dhcp-range=<cidr>,proxy and replaces dhcp-boot= for first-boot ROM
clients with pxe-service= directives (the correct proxy-mode
chainload form). Validation drops the dhcp_range regex for a
net.ParseCIDR check on pxe.subnet. Config, production/example yaml,
and pxe-setup.sh swap --dhcp-range for --subnet.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>