Sibling Run-scan sites (Get, LatestForHost, ListForHost, Active) were
updated to include COALESCE(profile,'quick') in the SELECT when the
Phase 1 migration added the column; FindActiveByMAC was missed, so
Scan got 14 destination args for a 13-column row. The symptom is
/ipxe/{mac} returning 500 and the host booting nothing, since that
handler is what returns the live-image script.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ships all five phases of the deep-profile overhaul together. Runs now
carry a profile (quick/deep/soak); every profile walks the same
11-stage order — Inventory → Firmware → SpecValidate → SMART →
CPUStress → Storage → Network → Burn → GPU → PSU → Reporting —
with only per-stage durations and concurrency scaled.
Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile
column + CreateWithProfile; threshold table + evaluator seeded per-run
from the shared vetting.thresholds block; breach flips result at
/sensor + /result.
Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify +
EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta),
Network (sustained iperf + /proc/net/dev deltas) with per-profile
knobs from Deps.
Phase 3: Burn super-stage with goroutine fan-out for CPU + memory +
fio + iperf, PSU rails sampled across the Burn window, SensorMux
(2 s flush, 500-sample cap) to absorb backpressure.
Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode
(BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl),
lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into
SpecValidate with pin-by-identifier and fan-out-across-component
matching; mismatches park the run in FailedHolding.
Phase 5: profile radio on the host start form, profile chip on the
run header, Firmware section in the HTML report, coverage artifact
uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath
seam + stress_ng and dmidecode example fakes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Host page owns host metadata, full runs table with per-row stage strip,
in-flight banner, and empty-state CTA. Run page owns pipeline, active
step, logs, sub-steps, spec diffs, and hold banner with a breadcrumb
back to the host. Dashboard tile reverts to host-only.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The .log-line grid was templated with 5 columns (anchor/ln/lvl/ts/text),
but renderLogSSE inserts an optional log-stage span, making 6 children.
The 6th child wrapped to row 2 column 1 (24px wide), which forced the
message text to break one character per line. Flexbox with min-width:0
on the text span scales cleanly with or without the stage element.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reshapes the detail page into a run-view: hybrid horizontal pipeline
+ expanded active-step pane with sub-steps, a per-step log pane with
line-numbered permalinks and client-side search, and a runs-history
sidebar that navigates via ?run=N. Default step is server-picked
(running → failed → Reporting) so the operator lands on the thing
that's moving.
Adds a sub_steps table + SSE topic (substep-{run}-{stage}-{ordinal})
so per-disk and per-pass work (SMART, CPUStress CPU/RAM, Storage,
GPU) is visible in the UI instead of buried in stage summary JSON.
Agent emits sub-step reports from existing per-iteration loops.
Dashboard tiles become a mini run-view with a 9-dot step strip so
the operator reads run health across the whole grid at a glance.
Register page gets the same card shell + button styling.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Detail-page pipeline + log panes weren't updating without a manual
refresh. Root cause: the integrity attribute on htmx-ext-sse@2.2.2
in layout.templ was wrong, so the browser refused to execute the
script (SRI enforcement is silent — no user-visible error unless
you open devtools). htmx core loaded, boosted nav worked, forms
worked — but sse-connect/sse-swap were inert because the extension
never registered, so no EventSource was ever opened.
Replaced the claimed hash (Y4gc0CK6...) with the real one
(fw+eTlCc...) computed via
curl -sL https://unpkg.com/htmx-ext-sse@2.2.2 |
openssl dgst -sha384 -binary | openssl base64 -A
Added sse_e2e_test.go as a regression canary that mounts the real
chi router (RealIP + Recoverer + Logger middleware), opens
GET /events, publishes a tile-update via Runner, and asserts the
event lands on the wire. Server-side unit tests only verified
rendered HTML — this one covers the full publish→wire path, which
is what the next regression in this area will hit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping
CPUStress. Two compounding bugs:
1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently.
On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer
fired, usually on the agent itself. Replaced with two sequential
passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1,
--vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify)
for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a
premature clean exit counts as failure instead of a silent pass.
2. On systemd-restart after the OOM, the agent hardcoded nextStage :=
"Inventory" and re-ran it. The orchestrator's /result handler
advances run state via TriggerStageCompleted against the *current*
RunState, not against body.Stage — so an Inventory result posted
while the run was in StateCPUStress silently advanced CPUStress →
Storage and marked CPUStress passed without it ever running.
Two-layer defense for #2:
- agent-side: /claim response now carries current_state; agent resumes
at the matching stage on a re-claim (happy path).
- server-side: new TriggerStageMismatch + StageNameForState helper
backstop. If body.Stage doesn't match the run's current stage, /result
parks the run in FailedHolding with failed_stage labeled
"<got> (expected <expected>)" and returns 409.
Other stages audited for similar unbounded concurrency — none found;
only CPUStress was unsafe.
Tests:
- cpustress_test.go — parseMemAvailable parses real meminfo, errors on
missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB
headroom on normal/huge boxes.
- statemachine_test.go — TriggerStageMismatch lands at FailedHolding
from every stage state and is rejected from pre-stage/terminal
states; StageNameForState round-trips the stageStates map.
- agent_handlers_test.go — TestResult_RejectsMismatchedStage proves
the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage
proves the guard doesn't break the happy path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline fragment payload was a bare <div class=pipeline>, but the
sse-swap=pipeline-N wrapper lived only in the page shell. The first
outerHTML swap destroyed the wrapper, so every subsequent pipeline
event had nothing to target — forcing a manual refresh. RenderPipelineString
now emits the full <section id=pipeline-N sse-swap=... hx-swap=outerHTML>
wrapper, used from both the shell and the orchestrator publish path.
Also drop the red-bar styling from the empty DetailHold placeholder:
the wrapper's detail-hold class was painting an unconditional red band
between Pipeline and Actions whenever no hold was active.
The detail page was only partly live: Pipeline + LogTabs subscribed to
SSE, but the summary header, actions row, spec-diffs list and hold-key
block all froze at page-load and required a manual refresh to catch up
with state changes.
Extract each of those four regions into its own named templ component
with a stable id and sse-swap target, add Render*String helpers so the
orchestrator can publish pre-rendered fragments, and register a
HostDetailRenderer alongside the existing Tile/Pipeline renderers.
PublishHostDetail is folded into publishTileUpdate so every call site
that already refreshes a tile now also refreshes the detail page —
keeps the fan-out honest without scattering new publish calls.
The empty-state wrappers for spec-diffs and hold are load-bearing:
without the <section id=... sse-swap=...> present at initial GET, the
first live event after SpecValidate or Hold writes would have no DOM
node to swap into.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
systemd-firstboot.service is an interactive wizard that asks for
locale, timezone, and root password when /etc/machine-id isn't
populated — i.e. every PXE boot of a mkosi-built image. It sits on
sysinit.target waiting for input that will never arrive, blocking
the agent service and every other downstream unit indefinitely.
systemd.firstboot=off on the kernel cmdline is the documented kill
switch; no image-side changes needed.
systemd-getty-generator reads console=ttyS0 off the kernel cmdline and
auto-creates serial-getty@ttyS0.service, which BindsTo dev-ttyS0.device.
On hardware without a physical serial port the device node never shows
up, systemd waits its full default 90s timeout, and only then proceeds.
systemd.mask= on the kernel cmdline is a first-class option — masks
the unit before the generator's link even gets activated. Kernel
messages still go to ttyS0 if a port is present; we just don't try
to spawn a login prompt there.
Host boots past kernel init and then stalls silently. ACPI DSDT error
about TXHC.RHUB.SS01 is benign noise (Tiger Lake firmware bug) — the
actual problem is that nothing between kernel handoff and (maybe)
systemd is visible on the console.
Two changes:
1. Replace the /init → sbin/init symlink with a real shell script
(live-image/mkosi.extra/init) that mounts /proc /sys /dev /dev/pts
/dev/shm /run before execing systemd. Systemd has fallback mount
code for these, but when it fails the failure is silent. Doing it
explicitly in /init keeps failures visible and avoids the fragile
symlink-resolution trick.
2. Drop 'quiet' from the kernel cmdline and add loglevel=7 plus
systemd.log_target=kmsg + journald.forward_to_console=1 so every
early-boot message reaches both tty0 and ttyS0. Will be dialed
back once boot is stable.
Also: .gitattributes pins LF on live-image/, .gitea/, Makefile, and
*.sh so Windows checkouts don't break shell scripts and Makefile
recipes with CRLF. /init also gets chmod 0755 in repack-initrd as a
belt-and-braces against mode loss on non-Linux checkouts.
Non-destructive pre-declares "don't touch the disks" on Start: the
Storage stage skips wipe-probe, badblocks -w, and write-mode fio,
and reports a read-only summary. Runs a new non_destructive column;
threaded through Claim → agent tests.Deps → Storage stage.
Cancel halts an in-flight run. The orchestrator transitions to a
new StateCancelled via TriggerOperatorCancelled (valid from any
active state); the agent's next heartbeat returns cmd=cancel_stage,
which fires a stored CancelFunc on the per-stage context. Stage
subprocesses spawned with exec.CommandContext die with the context,
the agent posts a cancelled outcome, then powers the host off.
Destructive stages mid-run may leave the host in an intermediate
state — the UI confirm dialog warns the operator; recovery is
manual for now.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dnsmasq's SIGHUP re-reads /etc/ethers and any --dhcp-hostsfile= paths,
but NOT dhcp-host= lines from the main conf. Reload() was faithfully
rewriting dnsmasq.conf with the new MAC, sending SIGHUP, and then
dnsmasq kept serving its startup view — so a freshly-registered host
still showed up as "proxy-ignored, tags: eth0" with no "known" tag.
Split the allowlist into ${RuntimeDir}/dhcp-hosts, referenced from the
main conf via dhcp-hostsfile=. writeConf() is static-ish now; Reload
just rewrites the hosts file and SIGHUPs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
pxe.Supervisor.Reload() was defined but never wired up. After a host
was registered in the UI or via the quick-register JSON endpoint, the
dnsmasq conf still held only the hosts that existed at orchestrator
startup. The new MAC wasn't tagged `known`, so when the host PXE'd,
dnsmasq logged "PXE(eth0) <mac> proxy-ignored" and the boot timed out
back to the BIOS.
Add an optional PXEReloader interface to api.UI, wire it from main
when pxe is enabled, and call u.reloadPXE() after successful Create
and Delete. Logs-and-continues on failure — host registration itself
has already committed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dnsmasq's proxy-DHCP syntax is `dhcp-range=<network-ip>,proxy[,<mask>]`,
not a CIDR. Passing "192.168.1.0/24,proxy" made dnsmasq refuse to start
with "bad dhcp-range at line 12". Parse the CIDR once in writeConf()
and render Network + Netmask as separate template fields.
The config surface (pxe.subnet) stays CIDR because that's the right
shape for humans; we just unpack it before handing to dnsmasq.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously the orchestrator ran a full DHCP server on a dedicated
br-vetting bridge (10.77.0.0/24), which required a hypervisor-level
bridge + physical cabling onto that bridge for every repaired host.
Real-world bite: the LXC's br-vetting had no L2 path to the target
host's PXE NIC, so DHCPDISCOVERs never reached eth1 and PXE silently
timed out.
dnsmasq's proxy-DHCP mode is the idiomatic answer: it coexists with
the LAN's existing DHCP server (UniFi, etc.), never assigns an IP
itself, and only supplements the PXE options. No dedicated bridge,
no VLAN, no cabling changes \u2014 dnsmasq binds to the LAN interface
and layers option 66/67 + the PXE BINL on top of the real DHCP
exchange. The MAC allowlist still gates replies, so random LAN
clients booting from network get nothing.
Template switches dhcp-range=<start,end,lease> to
dhcp-range=<cidr>,proxy and replaces dhcp-boot= for first-boot ROM
clients with pxe-service= directives (the correct proxy-mode
chainload form). Validation drops the dhcp_range regex for a
net.ParseCIDR check on pxe.subnet. Config, production/example yaml,
and pxe-setup.sh swap --dhcp-range for --subnet.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Without explicit dhcp-leasefile and pid-file, dnsmasq reaches for
its distro defaults (/var/lib/misc/dnsmasq.leases,
/run/dnsmasq.pid) — both outside the systemd unit's
ReadWritePaths=/var/lib/vetting /var/log/vetting sandbox, causing
'Read-only file system' on every start.
RuntimeDir is already writable by construction (Supervisor.Start
mkdir's it), so writing both files there keeps dnsmasq entirely
inside the sandbox.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Collapses the LXC side of PXE enablement from a six-step manual dance
(build, fetch iPXE, scp, bridge, hand-edit yaml) into:
make release # dev box (Linux/WSL)
scp bundle.tar.gz lxc:/tmp/
sudo ./install.sh # base install, unchanged
sudo ./pxe-setup.sh --interface ... --dhcp-range ... --orchestrator-url ...
pxe-setup.sh fetches iPXE from boot.ipxe.org, verifies against pinned
SHA256s in deploy/ipxe-shas.txt (fail-closed), places vmlinuz/initrd.img
from the bundle, and rewrites only the pxe: block of vetting.yaml.
Idempotent; --force gates overwriting a hand-edited block.
Adds Supervisor.Validate() — called before dnsmasq spawn — so typo'd
configs fail at orchestrator startup with clear errors naming the
missing file or yaml key, instead of silently serving broken TFTP
until a real host tries to PXE-boot. Nine tests cover missing files,
bogus interface, malformed dhcp_range, bad orchestrator_url, and
aggregate reporting.
Hypervisor bridge creation stays documented (LXC can't do it) but
everything downstream of the bridge is now scripted.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-running quick.sh on a host where vetting-reporter was already
running failed with curl error 23 because curl can't overwrite a
busy executable. Download to a staging path, then use `install(1)`
which unlinks the target before writing. Swap `enable --now` for
`enable` + `restart` so the service picks up the new binary.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every supported host runs vetting-reporter in-OS and heartbeats every
30s. WoL was never the thing that started vetting — the heartbeat
response's reboot_for_vetting command was. Firing WoL first only
crowded the run log with misleading diagnostics when the real failure
mode is "reporter isn't installed."
- StartRun 409s if the host hasn't heartbeated within 60s, pointing
the operator at /register/quick.sh.
- Dispatcher re-checks LastSeenAt at dispatch time (run may sit in
Queued long enough for the host to go offline); stale hosts mark
the run Failed with failed_stage=dispatch instead of looping.
- New StateWaitingReboot + TriggerRebootCommanded capture the actual
semantics. StateWaitingWoL kept as the hook point for a future
manual-override button.
- Tile disables the Start button with a quick.sh tooltip when the
host is offline, matching the server-side 409.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Operators who installed vetting before agent.asset_dir existed keep
their config preserved by install.sh on upgrade, which left them
with AssetDir="" — the router silently dropped the /assets/*
mount and the quick-register one-liner hit 404 fetching the agent
binary. Default AssetDir alongside the database file so the same
directory install.sh already creates + drops the agent binary into
is picked up automatically.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline now always renders all 13 nodes (3 pre-stage + 9 stage +
Completed), synthesising ghosts from run state when stage rows
aren't seeded yet. Makes a WaitingWoL host show the full timeline
ahead of it instead of just 4 dots.
Agent tags each log line with its stage; logs.Hub fans out to both
log-{runID} and log-{runID}-{stage} SSE events so the detail page
can show per-stage tabs with a pure-CSS radio-sibling switch. Flat
run log prepends [stage] so grep still works.
Dispatcher writes picked/sent-WoL/heartbeat lines into the per-run
log — the operator opens the detail page, sees WaitingWoL stuck,
and reads exactly what the dispatcher did and why nothing's
progressing, instead of having to tail journalctl on the LXC.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Click a tile to open /hosts/{id} — the canonical control surface per
host. Timeline renders every pre-stage, stage, and terminal node in
order, with the current one pulsing, failed ones flagged, and
downstream ones dimmed as skipped. Detail page shows summary, hold
card (when holding), all action buttons, spec diffs, a full-height
log pane, and a collapsed expected-spec YAML.
Tile slims to name, last-seen, status, and one primary action; a
CSS-overlay <a> makes the whole card clickable while buttons stay
receptive via z-index.
Runner.publishTileUpdate now also emits pipeline-{runID} fragments,
and CompleteStage wraps Stages.CompleteByName so stage completions
advance the timeline live — without this the dots only moved on
state transitions.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the operator clicks Start vetting and the host is heartbeating,
the heartbeat response now carries cmd=reboot_for_vetting + run_id.
The handler drives the Queued → WaitingWoL transition via the existing
state machine, so a benign race with the 2s dispatcher poll is refused
by the state machine (not double-dispatched). WaitingWoL retries for
10 minutes to cover a crashed-mid-reboot case, then falls back to
operator action.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
vetting-agent gains a `host` subcommand that runs as a systemd service
installed by the quick-register one-liner, POSTing every 30s to
/api/v1/hosts/{mac}/heartbeat so the dashboard tile shows "online" or
"Nm ago" without waiting on WoL. Ships dormant client code for the
Phase 2 reboot_for_vetting command so the server can flip it on later
without a binary redeploy.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two bugs compounded on Proxmox hosts: primary_iface walked
`ip link show` and picked the physical NIC (e.g. enp1s0), which has
no IPv4 on Proxmox because the address lives on vmbr0. Even if vmbr0
had been picked, the kernel reports its broadcast as 0.0.0.0, so the
script fell all the way back to 255.255.255.255.
Now we prefer the default-route interface (vmbr0 on Proxmox, eno1 on
bare metal) and, when `ip` doesn't surface a usable `brd`, compute
the broadcast from the inet CIDR instead of giving up.
Operator pastes `curl -fsSL $ORCH/register/quick.sh | sudo bash` on the
target host (pre-wipe). The script probes MAC + CPU/RAM/disks/NICs/GPUs,
emits an expected-spec YAML, and POSTs to a new LAN-trusted JSON
endpoint /api/v1/hosts. The register page shows the command prefilled
with the orchestrator URL; the manual form moves into a collapsible
"Register manually" disclosure.
Can't log in from a fresh LXC deploy, and the service is LAN-only by
design. Rip out the whole bcrypt-password / signed-cookie session
layer: internal/auth, login templates, gen-admin-password binary +
Makefile targets, auth config block, login/logout routes and the
RequireSession middleware wrap. Agent bearer-token auth on
/api/v1/runs/{id}/* is untouched.
Operators who want a password can front the service with a reverse
proxy — noted in README and docs/operations.md.