Commit Graph

89 Commits

Author SHA1 Message Date
josh 8367ec2a9f docs: comprehensive documentation expansion
CI / Lint + build + test (push) Successful in 1m36s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 49s
Add 4 new doc files (configuration reference, development guide, API
reference with full request/response schemas, database schema), expand
the README with a feature list and how-it-works walkthrough, fix
missing Firmware and Burn stages in architecture.md and test-suite.md,
add threshold engine and host-mode agent sections, and add godoc
comments to 11 packages and 6 model types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-23 18:37:26 -04:00
josh 17ec55cb85 chore: cleanup sprint — dead CSS, dedup helpers, handler refactor
CI / Lint + build + test (push) Successful in 1m34s
Release / detect (push) Successful in 4s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 1m5s
Remove ~126 lines of orphaned CSS from tile slim-down and old detail
layout. Consolidate 4 duplicate duration formatters into shared
elapsed()/fmtElapsed() helpers. Break 160-line Result handler into
focused sub-functions. Implement real Hub.Shutdown() (was a no-op).
Standardize agent error responses to JSON. Replace panic() in router
init with error return. Extract magic numbers as named constants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 20:39:38 -04:00
josh c11573eeeb feat(ui): slim dashboard tile to hostname + online/offline only
CI / Lint + build + test (push) Successful in 1m33s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 53s
Run status, Start/Cancel/View controls, and non-destructive toggle all
live on /hosts/{id} — duplicating them on the dashboard tile clogged
the grid and wouldn't scale past a handful of hosts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 22:56:05 -04:00
josh 6d50f3a804 feat(install): polish install UX with banner, spinner, progress bar, summary
CI / Lint + build + test (push) Successful in 1m38s
Release / detect (push) Successful in 7s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 55s
Wrap the three install scripts in a shared inline style block (TTY/UTF-8/
NO_COLOR-aware) so the one-liner install looks and feels intentional:
banner on start, timed step lines, braille spinner over silent apt/
systemctl calls with failure log dumps, single-line curl progress bars
with size-prefixed headers, and a summary box at the end with live-image
version + service state + next steps. install.sh defers banner/summary
to proxmox-install.sh when VETTING_INSTALL_WRAPPED is set so the two
scripts compose without duplication.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 22:29:44 -04:00
josh 48f992a451 bump live-image
CI / Lint + build + test (push) Successful in 1m35s
Release / detect (push) Successful in 6s
Release / build-live-image (push) Successful in 7m40s
Release / bundle (push) Successful in 50s
2026-04-20 21:31:09 -04:00
josh 98cdd95b50 chore(release): add registry auth diagnostic to build-live-image
CI / Lint + build + test (push) Successful in 1m38s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Failing after 53s
Echoes OWNER, token length, and whoami before the upload so a 401
disambiguates: missing/empty token, bad OWNER resolution, or token
authenticating as a different user.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 21:27:23 -04:00
josh 211abdf08f feat(release): version live-image, skip rebuild+redownload when unchanged
CI / Lint + build + test (push) Successful in 1m41s
Release / detect (push) Successful in 7s
Release / build-live-image (push) Failing after 3m58s
Release / bundle (push) Has been skipped
Splits the release workflow into three jobs (detect, build-live-image,
bundle) so the ~9 min mkosi build only runs when live-image/VERSION
bumps. The slim bundle (~30 MB: orchestrator + agent + deploy scripts
+ a live-image/VERSION pointer) rebuilds every push; the ~300 MB
vmlinuz+initrd.img are published separately under the immutable
live-image/<version>/ path. install.sh compares the pointer to
/var/lib/vetting/live/VERSION and fetches the files only on mismatch,
cutting repeat-install wall-clock from ~30 s + 300 MB to ~10 s + 0 MB
on the common no-live-image-change release.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 21:04:14 -04:00
josh 4c153bb115 chore(templ): regenerate host_tile_templ.go for cancelledFromHold
Catches the generated file up to the .templ source committed in 599fd15.
No behavior change — the generator just hadn't been re-run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-20 21:04:01 -04:00
josh a01db63952 feat(install): auto-heal pxe.interface/pxe.subnet against the host
CI / Lint + build + test (push) Successful in 1m42s
Release / release (push) Successful in 19m30s
A stale /etc/vetting/vetting.yaml (e.g. pxe.interface=eth1 after an
LXC rebuild renamed the NIC to eth0) blocks vetting.service startup
with "pxe.interface 'eth1' not found on host", requiring the operator
to ssh in and hand-edit the yaml after every rebuild.

install.sh now validates the pxe block against the host's actual
network state on every install/upgrade run. If pxe.enabled is true and
pxe.interface doesn't exist (or pxe.subnet is missing/malformed), the
script auto-detects the primary NIC via the default route, reads its
subnet from the kernel-scope route, and patches both values in place.
Valid configs are left exactly as the operator had them; fresh
installs with pxe.enabled=false skip the check entirely.

The one-liner install/update is now self-healing for the most common
stale-config failure mode.
2026-04-20 19:56:39 -04:00
josh 599fd156d0 feat(ui): distinguish cancel-from-hold as "Failed (cancelled)"
CI / Lint + build + test (push) Successful in 1m33s
Release / release (push) Successful in 12m41s
Before, a run that failed, held for operator review, and was then
cancelled showed up on the tile and run header as plain "Cancelled"
with an idle-grey mood — indistinguishable from a mid-stage cancel of
a healthy run. That hides the actual failure from the dashboard.

Now: when State=Cancelled with FailedStage still set (the hold-cancel
signature the heartbeat handler already uses to pick reboot vs
cancel_stage), the badge reads "Failed (cancelled)" with a fail-
colored mood. Mid-stage cancels keep reading as plain "Cancelled".
2026-04-20 18:54:04 -04:00
josh 73f727b4c1 fix(agent): keep heartbeat loop alive during FailedHolding
CI / Lint + build + test (push) Successful in 1m51s
Release / release (push) Failing after 4m28s
The heartbeat handler was returning cmd=abort for FailedHolding, which
caused the agent's heartbeat goroutine to exit after ~10s in hold.
Subsequent state changes (Cancel -> reboot, Override -> retry_stage)
then had no recipient, so the host sat idle at the SSH hold prompt
forever. Narrowed cmd=abort to StateReleased only; FailedHolding falls
through to cmd=continue so the loop keeps polling and can receive the
operator's eventual command.
2026-04-20 18:28:43 -04:00
josh 62bddac110 feat(cancel): allow cancel from FailedHolding, reboot to local disk
CI / Lint + build + test (push) Successful in 1m38s
Release / release (push) Successful in 6m10s
A held run sits indefinitely at an SSH prompt waiting for operator
investigation. Previously the only exits were Override (re-enter the
failed stage) or leaving the host on forever — Cancel rejected any
terminal state, including FailedHolding, and there was no button in
the UI anyway.

Add a dedicated exit path:
  - statemachine: TriggerOperatorCancelled now accepts FailedHolding
    as a valid source, transitioning to Cancelled like any other
    live state.
  - CancelRun handler: treats FailedHolding as cancellable even
    though IsTerminal reports true.
  - heartbeat: Cancelled runs fork on FailedStage. Set means the
    agent is parked in waitForOverride with no subprocess in
    flight, so cmd=reboot tells it to systemctl reboot; the host
    falls through iPXE's no-active-run script to the local disk.
    Empty FailedStage keeps the pre-existing cmd=cancel_stage path
    for mid-stage cancels (kill stage ctx, then power off).
  - UI: canCancel now returns true for FailedHolding, and the
    run-detail page renders a distinct "Cancel & reboot" button
    with a hold-specific confirm message so the action doesn't
    look identical to a mid-run cancel.

Tests cover the new statemachine transition, the heartbeat fork
(reboot vs cancel_stage), and keep the pre-existing mid-run cancel
behaviour locked in.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 22:59:34 -04:00
josh 21014c1268 fix(inventory): read GPU model from device field, not vendor field
CI / Lint + build + test (push) Successful in 1m37s
Release / release (push) Successful in 11m43s
`lspci -D -mm -nn` prefixes every line with the PCI address as a bare
token before the three quoted class/vendor/device fields, so the
device name sits at fields[3] — not fields[2], which is the vendor.
The probe was indexing [2] and recording every GPU's model as its
vendor string ("Intel Corporation" instead of "Alder Lake-N [UHD
Graphics]"), which made every SpecValidate mismatch on real hosts
once the expected spec named the device.

Extract the per-line parse into parseLspciMMLine, handle both the
modern -D layout (addr + class/vendor/device) and the legacy
layout without an address prefix (class/vendor/device), and cover
both paths plus the non-GPU-class skip in inventory_test.go.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 22:53:42 -04:00
josh e73b221a8c fix(ui): fit pipeline timeline without horizontal scroll
CI / Lint + build + test (push) Successful in 1m39s
Release / release (push) Successful in 7m30s
15 nodes (3 pre-stage + 11 stage + Completed) exceeded the 1280px main
container's usable width, producing a horizontal scrollbar under the
pipeline on the run page. Widen main to 1440px, tighten per-node min
widths, drop the scrollbar, and split camelCase labels so multi-word
stages ("WaitingReboot", "SpecValidate", "CPUStress") wrap onto two
lines instead of forcing node width.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 22:51:10 -04:00
josh 3656af9823 feat(end-of-run): reboot to local disk instead of powering off
CI / Lint + build + test (push) Successful in 1m47s
Release / release (push) Successful in 10m8s
Completed runs now reboot the host and fall through iPXE to the next
boot device (local disk) instead of powering off. Three coordinated
changes:

- pxe/ipxe: NoActiveRunScript exits iPXE (drops to next boot entry)
  instead of `sleep 10; poweroff`. Without this, a Completed reboot
  just loops through PXE and gets told to poweroff.
- api/agent_handlers: heartbeat returns cmd=reboot (was cmd=shutdown)
  when the run reaches Completed.
- agent/runner: runs `systemctl reboot` (with `shutdown -r now`
  fallback) in response to cmd=reboot.

Operator cancel still powers off — powerOffAndReturn is unchanged
because a cancel means the operator wants the host idle so they can
walk up to it, not back in rotation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 22:45:11 -04:00
josh 8acef92a60 feat(inventory): deep hardware capture + per-probe substeps + verbose logs
CI / Lint + build + test (push) Successful in 1m35s
Release / release (push) Successful in 9m34s
Extend Inventory stage from a one-liner summary to a per-probe substep
emitter with ~20-30 narrative log lines per run.

- spec: per-DIMM memory (slot/size/speed/manufacturer/part_number),
  richer CPU (vendor/stepping/physical_cores/flags), disk
  model/transport/rotational, NIC driver/pci_addr, GPU vram/pci/driver,
  new System/Baseboard/PSU/OS top-level sections. All fields omitempty
  so existing expected-spec YAML and artifacts stay compatible.
- spec.Diff: new diffDIMMs/diffSystem/diffBaseboard/diffPSU/diffOS
  helpers; extended diffDisks/diffNICs/diffGPUs for new fields. GPU
  diff gains PCIAddr-pinned matching alongside count-by-model.
- agent/probes/inventory: CPU (/proc/cpuinfo extended), Memory
  (dmidecode -t 17 multi-block), Disks (+model/transport/rotational),
  NICs (+driver/pci from sysfs), GPUs (VRAM from lspci -vv),
  new System/Baseboard (dmidecode -t system/baseboard), PSU
  (dmidecode -t 39), OS (/proc/sys/kernel/osrelease + /etc/os-release).
  All probes accept a Logger and emit per-finding info/warn lines.
- agent/probes/firmware: parseDmidecodeAllSections for multi-block
  fixtures (memory / PSU).
- agent/runner: Inventory case becomes 9 substep rows (CPU / Memory /
  Disks / NICs / GPUs / System / Baseboard / PSU / OS) with per-probe
  start/complete timestamps.
- report: new Inventory HTML section between Stages and Firmware;
  resolveReporting loads the inventory.json artifact.
- agent/tests/fakes/dmidecode: dispatches on -t flag to serve bios /
  memory / system / baseboard / 39 fixtures for unit tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 22:21:17 -04:00
josh 481b67fb69 feat(firmware): install probe tools in live image + surface nic/hba gaps
CI / Lint + build + test (push) Successful in 1m42s
Release / release (push) Successful in 11m25s
mkosi.conf: add ipmitool, ethtool, nvme-cli so the Firmware stage
can actually read BMC revisions, NIC firmware versions, and fall
back to nvme-cli when sysfs firmware_rev is missing.

firmware.go: probeNICFirmware and probeHBAFirmware now return
(snapshots, warning) so a missing ethtool/lspci surfaces in the
stage log the same way probeBIOS/probeBMC already do. Before, a
host without ethtool silently reported "bios=1 nvme_fw=1
microcode=1" with no hint that nic coverage was dropped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 21:56:18 -04:00
josh c545028903 feat(run-page): tick the run-duration timer between SSE pushes
CI / Lint + build + test (push) Successful in 1m34s
Release / release (push) Has been cancelled
Adds a 1s client-side ticker that rewrites .run-duration text from a
data-started-at attribute, so the header timer on /runs/{id}
increments every second while the run is active. When an SSE swap
lands a fresh header the new server-rendered value seamlessly takes
over; when the run goes terminal the template drops the attribute
and the ticker silently skips the node, leaving the final elapsed in
place.

Other templ_*.go churn is cosmetic — regenerator versions differ
between CI and local and only the filename field in templ.Error
callsites changed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 21:53:40 -04:00
josh 05ceb8e042 ci(release): skip release workflow for non-bundle changes
CI / Lint + build + test (push) Successful in 1m41s
Release / release (push) Successful in 16m47s
Adds a paths-ignore filter to the push trigger so README tweaks,
*_test.go edits, other workflows, and fake-binary scaffolding no
longer spend 45 min debootstrapping + republishing an identical
bundle to the package registry. Adds workflow_dispatch as a manual
escape hatch for the cases where paths-ignore swallows something
that needs republishing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 21:08:26 -04:00
josh 988448664a fix(runs): stamp completed_at on cancel/terminal SetState transitions
CI / Lint + build + test (push) Successful in 1m35s
Release / release (push) Successful in 11m53s
CancelRun goes through Runner.Transition → Runs.SetState, which was a
bare UPDATE state=? with no completed_at write. The host-page
runDuration helper treats nil CompletedAt as "still running", so a
cancelled run kept ticking forever. MarkCompleted / MarkFailed /
MarkDispatchFailed already stamp completed_at; SetState now does the
same for any terminal target state, using COALESCE so we never
clobber an already-set timestamp.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 20:21:39 -04:00
josh bbe1b19819 store: fix FindActiveByMAC scanning profile column that wasn't selected
CI / Lint + build + test (push) Successful in 1m40s
Release / release (push) Has been cancelled
Sibling Run-scan sites (Get, LatestForHost, ListForHost, Active) were
updated to include COALESCE(profile,'quick') in the SELECT when the
Phase 1 migration added the column; FindActiveByMAC was missed, so
Scan got 14 destination args for a 13-column row. The symptom is
/ipxe/{mac} returning 500 and the host booting nothing, since that
handler is what returns the live-image script.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 20:14:16 -04:00
josh 75c29bb31a ci: pin upload-artifact to v3 for Gitea compatibility
CI / Lint + build + test (push) Successful in 1m56s
Release / release (push) Successful in 10m13s
Gitea's act_runner rejects @actions/artifact v2 (the engine behind
upload-artifact@v4). v3 is the last GHES-compatible major and still
supports the path: glob + retention-days we need.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 22:58:59 -04:00
josh 23c689aa5b deep profile + threshold gating + firmware stage + Burn super-stage
CI / Lint + build + test (push) Failing after 1m57s
Release / release (push) Has been cancelled
Ships all five phases of the deep-profile overhaul together. Runs now
carry a profile (quick/deep/soak); every profile walks the same
11-stage order — Inventory → Firmware → SpecValidate → SMART →
CPUStress → Storage → Network → Burn → GPU → PSU → Reporting —
with only per-stage durations and concurrency scaled.

Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile
column + CreateWithProfile; threshold table + evaluator seeded per-run
from the shared vetting.thresholds block; breach flips result at
/sensor + /result.

Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify +
EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta),
Network (sustained iperf + /proc/net/dev deltas) with per-profile
knobs from Deps.

Phase 3: Burn super-stage with goroutine fan-out for CPU + memory +
fio + iperf, PSU rails sampled across the Burn window, SensorMux
(2 s flush, 500-sample cap) to absorb backpressure.

Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode
(BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl),
lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into
SpecValidate with pin-by-identifier and fan-out-across-component
matching; mismatches park the run in FailedHolding.

Phase 5: profile radio on the host start form, profile chip on the
run header, Firmware section in the HTML report, coverage artifact
uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath
seam + stress_ng and dmidecode example fakes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 22:50:57 -04:00
josh fbb21cbafd ci: delete latest version, not the file, before re-uploading
Release / release (push) Waiting to run
CI / Lint + build + test (push) Successful in 1m42s
File-level DELETE leaves a ghost version directory that makes the
subsequent PUT 404 after a full 9-minute upload. Delete the whole
'latest' version, log the status code, and wait briefly before PUT.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 21:07:24 -04:00
josh 19608bef1b ui: split /hosts/{id} into host page + /runs/{runID} run page
CI / Lint + build + test (push) Successful in 1m35s
Release / release (push) Successful in 23m47s
Host page owns host metadata, full runs table with per-row stage strip,
in-flight banner, and empty-state CTA. Run page owns pipeline, active
step, logs, sub-steps, spec diffs, and hold banner with a breadcrumb
back to the host. Dashboard tile reverts to host-only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 20:37:57 -04:00
josh 5c6bfa5ffa ui: fix log lines rendering vertically when stage prefix is present
CI / Lint + build + test (push) Successful in 1m39s
Release / release (push) Has been cancelled
The .log-line grid was templated with 5 columns (anchor/ln/lvl/ts/text),
but renderLogSSE inserts an optional log-stage span, making 6 children.
The 6th child wrapped to row 2 column 1 (24px wide), which forced the
message text to break one character per line. Flexbox with min-width:0
on the text span scales cleanly with or without the stage element.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 19:20:51 -04:00
josh f79fe0f0db ui: GitHub-Actions-style detail page, sub-steps, mini-tile run-view
CI / Lint + build + test (push) Successful in 1m26s
Release / release (push) Successful in 6m47s
Reshapes the detail page into a run-view: hybrid horizontal pipeline
+ expanded active-step pane with sub-steps, a per-step log pane with
line-numbered permalinks and client-side search, and a runs-history
sidebar that navigates via ?run=N. Default step is server-picked
(running → failed → Reporting) so the operator lands on the thing
that's moving.

Adds a sub_steps table + SSE topic (substep-{run}-{stage}-{ordinal})
so per-disk and per-pass work (SMART, CPUStress CPU/RAM, Storage,
GPU) is visible in the UI instead of buried in stage summary JSON.
Agent emits sub-step reports from existing per-iteration loops.

Dashboard tiles become a mini run-view with a 9-dot step strip so
the operator reads run health across the whole grid at a glance.
Register page gets the same card shell + button styling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 19:00:11 -04:00
josh 5c00edd7b6 ui: fix htmx-ext-sse integrity hash (was silently blocked by browser)
CI / Lint + build + test (push) Successful in 1m20s
Release / release (push) Successful in 5m48s
Detail-page pipeline + log panes weren't updating without a manual
refresh. Root cause: the integrity attribute on htmx-ext-sse@2.2.2
in layout.templ was wrong, so the browser refused to execute the
script (SRI enforcement is silent — no user-visible error unless
you open devtools). htmx core loaded, boosted nav worked, forms
worked — but sse-connect/sse-swap were inert because the extension
never registered, so no EventSource was ever opened.

Replaced the claimed hash (Y4gc0CK6...) with the real one
(fw+eTlCc...) computed via
  curl -sL https://unpkg.com/htmx-ext-sse@2.2.2 |
  openssl dgst -sha384 -binary | openssl base64 -A

Added sse_e2e_test.go as a regression canary that mounts the real
chi router (RealIP + Recoverer + Logger middleware), opens
GET /events, publishes a tile-update via Runner, and asserts the
event lands on the wire. Server-side unit tests only verified
rendered HTML — this one covers the full publish→wire path, which
is what the next regression in this area will hit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 17:51:58 -04:00
josh 27098fc7ed cpustress+orchestrator: serial CPU/RAM passes + silent-skip guard
CI / Lint + build + test (push) Successful in 1m23s
Release / release (push) Successful in 6m2s
Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping
CPUStress. Two compounding bugs:

1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently.
   On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer
   fired, usually on the agent itself. Replaced with two sequential
   passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1,
   --vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify)
   for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a
   premature clean exit counts as failure instead of a silent pass.

2. On systemd-restart after the OOM, the agent hardcoded nextStage :=
   "Inventory" and re-ran it. The orchestrator's /result handler
   advances run state via TriggerStageCompleted against the *current*
   RunState, not against body.Stage — so an Inventory result posted
   while the run was in StateCPUStress silently advanced CPUStress →
   Storage and marked CPUStress passed without it ever running.

Two-layer defense for #2:
- agent-side: /claim response now carries current_state; agent resumes
  at the matching stage on a re-claim (happy path).
- server-side: new TriggerStageMismatch + StageNameForState helper
  backstop. If body.Stage doesn't match the run's current stage, /result
  parks the run in FailedHolding with failed_stage labeled
  "<got> (expected <expected>)" and returns 409.

Other stages audited for similar unbounded concurrency — none found;
only CPUStress was unsafe.

Tests:
- cpustress_test.go — parseMemAvailable parses real meminfo, errors on
  missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB
  headroom on normal/huge boxes.
- statemachine_test.go — TriggerStageMismatch lands at FailedHolding
  from every stage state and is rejected from pre-stage/terminal
  states; StageNameForState round-trips the stageStates map.
- agent_handlers_test.go — TestResult_RejectsMismatchedStage proves
  the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage
  proves the guard doesn't break the happy path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 17:29:13 -04:00
josh cdd6cae3b0 ui: keep detail-page SSE swaps live after the first outerHTML replace
CI / Lint + build + test (push) Successful in 1m28s
Release / release (push) Successful in 6m29s
Pipeline fragment payload was a bare <div class=pipeline>, but the
sse-swap=pipeline-N wrapper lived only in the page shell. The first
outerHTML swap destroyed the wrapper, so every subsequent pipeline
event had nothing to target — forcing a manual refresh. RenderPipelineString
now emits the full <section id=pipeline-N sse-swap=... hx-swap=outerHTML>
wrapper, used from both the shell and the orchestrator publish path.

Also drop the red-bar styling from the empty DetailHold placeholder:
the wrapper's detail-hold class was painting an unconditional red band
between Pipeline and Actions whenever no hold was active.
2026-04-18 17:03:39 -04:00
josh e73e31af92 live-image: install stage tools and fail loudly if any are missing
CI / Lint + build + test (push) Successful in 1m32s
Release / release (push) Successful in 6m28s
The live image was still carrying the Phase 2 package list, so SMART,
CPUStress, and Network each hit a LookPath miss and returned
pass-with-skip. A run that skipped every real check still ended in
"completed" — nothing on the report said the image was broken.

Add smartmontools, stress-ng, fio, iperf3, lshw, lm-sensors,
e2fsprogs, and util-linux to mkosi.conf. Flip the three stages from
skip-pass to fail when their binary is missing so any future
packaging regression blocks the run instead of whispering past it.
Legitimate "no hardware" skips (no GPU, no hwmon, no disks,
non-destructive) are untouched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 16:39:28 -04:00
josh 0db790ae3e ui: stream host-detail fragments over SSE so the page updates live
CI / Lint + build + test (push) Successful in 1m29s
Release / release (push) Has been cancelled
The detail page was only partly live: Pipeline + LogTabs subscribed to
SSE, but the summary header, actions row, spec-diffs list and hold-key
block all froze at page-load and required a manual refresh to catch up
with state changes.

Extract each of those four regions into its own named templ component
with a stable id and sse-swap target, add Render*String helpers so the
orchestrator can publish pre-rendered fragments, and register a
HostDetailRenderer alongside the existing Tile/Pipeline renderers.
PublishHostDetail is folded into publishTileUpdate so every call site
that already refreshes a tile now also refreshes the detail page —
keeps the fan-out honest without scattering new publish calls.

The empty-state wrappers for spec-diffs and hold are load-bearing:
without the <section id=... sse-swap=...> present at initial GET, the
first live event after SpecValidate or Hold writes would have no DOM
node to swap into.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 16:36:13 -04:00
josh 5e9ad7f569 probes: sanitize disk serials and normalize GPU model for stable spec keys
CI / Lint + build + test (push) Successful in 1m25s
Release / release (push) Successful in 5m38s
Two related bugs were producing different map keys for identical
hardware depending on whether the inventory probe ran in the reporter
on the Proxmox host or in the live-image agent after PXE boot.

1. diskSerial read /sys/block/<dev>/device/{serial,vpd_pg80} and only
   TrimSpace'd the result. vpd_pg80 is a binary SCSI VPD page with a
   4-byte header, and some SSDs leak NUL/control bytes into the text
   serial file. Those bytes survive into the Go string, lowercase
   unchanged, and become a garbage map key that the reporter's cleaner
   read can't match. Sanitize to ASCII-printable range at ingest.

2. probeGPUs built the model slug from fields[2] + " " + fields[3] of
   `lspci -mm -nnk` output. fields[3] is subsystem vendor/device info,
   which varies between otherwise-identical cards and carries the
   `-rXX` revision marker — stable-enough for display but not for
   identity. Use fields[2] alone, strip the trailing `[NNNN]` PCI
   device-ID that lspci -nn appends, and sanitize for consistency.

After deploying the new orchestrator + re-running the configure step
on each registered host, SpecValidate will match cleanly. Disk diffs
self-resolve because the reporter already stored clean serials; GPU
diffs need one reporter re-run because the old expected slug still
carries subsystem noise.
2026-04-18 16:06:18 -04:00
josh d48cf146f4 live-image: mask systemd-firstboot at image-build time
CI / Lint + build + test (push) Successful in 1m24s
Release / release (push) Successful in 5m53s
Belt-and-braces for the kernel-cmdline systemd.firstboot=off fix.
mkosi ships /etc/machine-id empty, which triggers firstboot's
interactive locale/timezone/root-password prompt on every PXE boot;
with the agent running unattended there's nobody to answer and
sysinit.target blocks indefinitely.

Mask via a /dev/null symlink in /etc/systemd/system so the service
is unstartable regardless of cmdline — rules out the failure mode
where an older orchestrator binary serves an iPXE script without
the off-switch arg.
2026-04-18 15:41:46 -04:00
josh 026923075c pxe: disable systemd-firstboot so the live image doesn't prompt
CI / Lint + build + test (push) Successful in 1m22s
Release / release (push) Has been cancelled
systemd-firstboot.service is an interactive wizard that asks for
locale, timezone, and root password when /etc/machine-id isn't
populated — i.e. every PXE boot of a mkosi-built image. It sits on
sysinit.target waiting for input that will never arrive, blocking
the agent service and every other downstream unit indefinitely.

systemd.firstboot=off on the kernel cmdline is the documented kill
switch; no image-side changes needed.
2026-04-18 15:35:24 -04:00
josh 956120b80e deploy: show speed + ETA in bundle-download progress meter
CI / Lint + build + test (push) Successful in 1m24s
Release / release (push) Successful in 5m30s
Drop --progress-bar (curl's minimal hash meter) in favor of the default
progress output, which includes transfer rate and time remaining.
Bundles grew from ~30 MB to ~300 MB with the full-rootfs initrd, and
a percentage-only bar with no speed hint makes a slow registry look
indistinguishable from a hang.
2026-04-18 15:04:26 -04:00
josh c45349f62c pxe: mask serial-getty@ttyS0 so hosts without serial don't wait 90s
CI / Lint + build + test (push) Successful in 1m47s
Release / release (push) Successful in 5m16s
systemd-getty-generator reads console=ttyS0 off the kernel cmdline and
auto-creates serial-getty@ttyS0.service, which BindsTo dev-ttyS0.device.
On hardware without a physical serial port the device node never shows
up, systemd waits its full default 90s timeout, and only then proceeds.

systemd.mask= on the kernel cmdline is a first-class option — masks
the unit before the generator's link even gets activated. Kernel
messages still go to ttyS0 if a port is present; we just don't try
to spawn a login prompt there.
2026-04-18 14:47:03 -04:00
josh a88e24bef4 live-image: real /init + verbose boot for first-boot diagnosis
CI / Lint + build + test (push) Successful in 1m23s
Release / release (push) Successful in 4m49s
Host boots past kernel init and then stalls silently. ACPI DSDT error
about TXHC.RHUB.SS01 is benign noise (Tiger Lake firmware bug) — the
actual problem is that nothing between kernel handoff and (maybe)
systemd is visible on the console.

Two changes:

1. Replace the /init → sbin/init symlink with a real shell script
   (live-image/mkosi.extra/init) that mounts /proc /sys /dev /dev/pts
   /dev/shm /run before execing systemd. Systemd has fallback mount
   code for these, but when it fails the failure is silent. Doing it
   explicitly in /init keeps failures visible and avoids the fragile
   symlink-resolution trick.

2. Drop 'quiet' from the kernel cmdline and add loglevel=7 plus
   systemd.log_target=kmsg + journald.forward_to_console=1 so every
   early-boot message reaches both tty0 and ttyS0. Will be dialed
   back once boot is stable.

Also: .gitattributes pins LF on live-image/, .gitea/, Makefile, and
*.sh so Windows checkouts don't break shell scripts and Makefile
recipes with CRLF. /init also gets chmod 0755 in repack-initrd as a
belt-and-braces against mode loss on non-Linux checkouts.
2026-04-18 14:31:40 -04:00
josh 43ea845ac0 live-image: pack full rootfs as initrd so PXE actually boots userspace
CI / Lint + build + test (push) Successful in 1m54s
Release / release (push) Successful in 5m10s
update-initramfs produces a boot stub (~50 MB) that expects to mount a
separate rootfs over squashfs/disk/NFS. Our PXE channel only ships
vmlinuz+initrd.img, so the stub had nothing to pivot to — kernel
finished hand-off and the system wedged with firmware, modules, and
userspace stranded in the 545 MB rootfs dir we never delivered.

Replace with an everything-in-initramfs build: cpio.zst the full
rootfs (minus /boot) as the initrd, add /init -> sbin/init for the
kernel's runtime entrypoint, materialize the kernel symlink into a
real file. Bump check-initrd floor to 200 MB and switch the firmware
grep from unmkinitramfs (boot-stub-specific) to zstd | cpio -t.

Also add cpio to the CI apt deps.
2026-04-18 14:14:08 -04:00
josh 6c6d20710f live-image: fix check-initrd size measurement; add zstd to image
CI / Lint + build + test (push) Successful in 1m28s
Release / release (push) Failing after 4m10s
Previous run actually built the 518 MB rootfs with firmware-misc-nonfree
et al. installed — the real payload is working. Two follow-ups:

- check-initrd was reading stat on a symlink path and getting 30 bytes
  (the symlink's own size), not the 6.1.0-44-amd64 kernel initrd it
  points to. Switched to wc -c, which follows symlinks, and to du -hL
  for the OK message.
- Add zstd to Packages= so COMPRESS=zstd in initramfs.conf can be
  honored; without it update-initramfs falls back to gzip with a
  "No zstd in PATH" warning.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 14:00:07 -04:00
josh 0a5e5d0b39 ci: add bubblewrap dep and bump mkosi to v25.3
CI / Lint + build + test (push) Successful in 1m31s
Release / release (push) Failing after 3m47s
v24.3 crashed in cp_version() during the copy-package-manager-trees
step because its sandbox needs bubblewrap (not present in the runner
apt list), and cp --version returned empty output inside the broken
sandbox. Installing bubblewrap and bumping to v25.3 which has tighter
sandbox fallback behavior.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 13:53:09 -04:00
josh 488a0d1052 ci: install mkosi from upstream git tag, not PyPI
Release / release (push) Failing after 1m54s
CI / Lint + build + test (push) Has been cancelled
Previous commit pinned mkosi==24.3 via pip but mkosi isn't published
on PyPI past ancient versions — the runner hit
"Could not find a version that satisfies the requirement mkosi==24.3".
Install from the upstream git tag v24.3 instead; added git to the apt
dep list for pip's VCS fetch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 13:44:51 -04:00
josh 28918bad15 live-image: fix firmware so i915 actually loads at boot
CI / Lint + build + test (push) Successful in 1m35s
Release / release (push) Failing after 22s
Previous attempt (c962d6d) added firmware-linux-nonfree to mkosi.conf,
but the CI bundle was still 63 MB and Tiger Lake wedged on tgl_guc.
Two reasons: (1) firmware-linux-nonfree on bookworm is a thin
metapackage that doesn't include firmware-misc-nonfree, which is where
i915 GuC/HuC blobs actually live; (2) Ubuntu's apt-packaged mkosi is
old enough that Repositories=non-free-firmware shorthand likely isn't
wired through to the debootstrap invocation, so firmware packages
silently miss the bootstrap step entirely.

Changes:
- Enumerate firmware packages explicitly in mkosi.conf (firmware-
  misc-nonfree, firmware-iwlwifi, firmware-realtek, firmware-amd-
  graphics, firmware-intel-sound, intel/amd64-microcode).
- Ship mkosi.sources.d/debian.sources with explicit deb822 so the
  non-free-firmware component is unambiguously available.
- Install mkosi 24.3 via pip in CI instead of apt's older build.
- Pin MODULES=most and COMPRESS=zstd via a tracked initramfs-tools
  config under mkosi.extra/.
- Narrow .gitignore so only the generated agent binary is ignored,
  not the whole mkosi.extra/ tree.
- New check-initrd Makefile target asserts both size (>=150 MB) and
  actual presence of i915/tgl_guc_*.bin inside the built initrd, so
  a silent firmware-drop regression fails the build loudly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 13:38:40 -04:00
josh c962d6d8ab live-image: bundle nonfree firmware (i915 GuC et al.)
CI / Lint + build + test (push) Successful in 2m19s
Release / release (push) Successful in 3m28s
Tiger Lake and later Intel iGPUs need i915/tgl_guc_*.bin; without
it the i915 init wedges and floods the console. Same story on most
modern wifi/NIC hardware. Pull firmware-linux-nonfree (metapackage
covering misc-nonfree, iwlwifi, realtek, amd-graphics, …) from the
bookworm non-free-firmware repo — single line fix, ~500MB cost to
the squashfs, worth it for booting arbitrary repaired hosts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 13:14:19 -04:00
josh 4524ab8dc0 runs: add non-destructive flag + operator Cancel button
CI / Lint + build + test (push) Successful in 2m5s
Release / release (push) Successful in 3m5s
Non-destructive pre-declares "don't touch the disks" on Start: the
Storage stage skips wipe-probe, badblocks -w, and write-mode fio,
and reports a read-only summary. Runs a new non_destructive column;
threaded through Claim → agent tests.Deps → Storage stage.

Cancel halts an in-flight run. The orchestrator transitions to a
new StateCancelled via TriggerOperatorCancelled (valid from any
active state); the agent's next heartbeat returns cmd=cancel_stage,
which fires a stored CancelFunc on the per-stage context. Stage
subprocesses spawned with exec.CommandContext die with the context,
the agent posts a cancelled outcome, then powers the host off.

Destructive stages mid-run may leave the host in an intermediate
state — the UI confirm dialog warns the operator; recovery is
manual for now.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 13:01:42 -04:00
josh 2c440fce8a pxe: move dhcp-host allowlist into a SIGHUP-reloadable file
CI / Lint + build + test (push) Successful in 1m38s
Release / release (push) Successful in 2m25s
dnsmasq's SIGHUP re-reads /etc/ethers and any --dhcp-hostsfile= paths,
but NOT dhcp-host= lines from the main conf. Reload() was faithfully
rewriting dnsmasq.conf with the new MAC, sending SIGHUP, and then
dnsmasq kept serving its startup view — so a freshly-registered host
still showed up as "proxy-ignored, tags: eth0" with no "known" tag.

Split the allowlist into ${RuntimeDir}/dhcp-hosts, referenced from the
main conf via dhcp-hostsfile=. writeConf() is static-ish now; Reload
just rewrites the hosts file and SIGHUPs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:41:27 -04:00
josh bce6e08524 pxe: reload dnsmasq on host create/delete
CI / Lint + build + test (push) Successful in 1m54s
Release / release (push) Successful in 2m36s
pxe.Supervisor.Reload() was defined but never wired up. After a host
was registered in the UI or via the quick-register JSON endpoint, the
dnsmasq conf still held only the hosts that existed at orchestrator
startup. The new MAC wasn't tagged `known`, so when the host PXE'd,
dnsmasq logged "PXE(eth0) <mac> proxy-ignored" and the boot timed out
back to the BIOS.

Add an optional PXEReloader interface to api.UI, wire it from main
when pxe is enabled, and call u.reloadPXE() after successful Create
and Delete. Logs-and-continues on failure — host registration itself
has already committed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:31:00 -04:00
josh 157b70f536 pxe: split subnet into network+netmask for dnsmasq proxy-DHCP
CI / Lint + build + test (push) Successful in 2m0s
Release / release (push) Successful in 3m35s
dnsmasq's proxy-DHCP syntax is `dhcp-range=<network-ip>,proxy[,<mask>]`,
not a CIDR. Passing "192.168.1.0/24,proxy" made dnsmasq refuse to start
with "bad dhcp-range at line 12". Parse the CIDR once in writeConf()
and render Network + Netmask as separate template fields.

The config surface (pxe.subnet) stays CIDR because that's the right
shape for humans; we just unpack it before handing to dnsmasq.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:17:10 -04:00
josh cf3a75591c install: stage pxe-setup.sh at /usr/local/sbin/vetting-pxe-setup
CI / Lint + build + test (push) Successful in 1m36s
Release / release (push) Successful in 2m29s
proxmox-install.sh tarball-extracts into a tempdir that gets wiped on
EXIT, so after the one-liner there's no pxe-setup.sh on disk for the
operator to run. Have install.sh drop the script + ipxe-shas.txt into
/usr/local/share/vetting/ and symlink it as
/usr/local/sbin/vetting-pxe-setup (in PATH).

pxe-setup.sh now readlink -f's BASH_SOURCE so the symlink resolves to
the share dir where ipxe-shas.txt lives, and gracefully handles the
case where install.sh already staged vmlinuz + initrd.img into
LIVE_DIR (no bundle live-image/ needed at that point).

Update the trailing hint in proxmox-install.sh and the operations
runbook to surface the new `sudo vetting-pxe-setup ...` command.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:10:23 -04:00
josh bcbbc35489 docs+e2e: document proxy-DHCP topology; default e2e bridge to LAN
CI / Lint + build + test (push) Successful in 1m37s
Release / release (push) Has been cancelled
Rewrites the PXE section of the ops runbook around the new proxy-DHCP
model (no dedicated bridge, coexists with UniFi/pfSense/etc.) and
swaps the e2e test's default bridge + orchestrator URL to match. The
e2e file now calls out the LAN-DHCP precondition in its header so
future-me (or CI) doesn't hang at PXE wondering why nothing answers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 12:07:05 -04:00