Vetting

Author	SHA1	Message	Date
josh	17ec55cb85	chore: cleanup sprint — dead CSS, dedup helpers, handler refactor CI / Lint + build + test (push) Successful in 1m34s Details Release / detect (push) Successful in 4s Details Release / build-live-image (push) Has been skipped Details Release / bundle (push) Successful in 1m5s Details Remove ~126 lines of orphaned CSS from tile slim-down and old detail layout. Consolidate 4 duplicate duration formatters into shared elapsed()/fmtElapsed() helpers. Break 160-line Result handler into focused sub-functions. Implement real Hub.Shutdown() (was a no-op). Standardize agent error responses to JSON. Replace panic() in router init with error return. Extract magic numbers as named constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-21 20:39:38 -04:00
josh	c11573eeeb	feat(ui): slim dashboard tile to hostname + online/offline only CI / Lint + build + test (push) Successful in 1m33s Details Release / detect (push) Successful in 5s Details Release / build-live-image (push) Has been skipped Details Release / bundle (push) Successful in 53s Details Run status, Start/Cancel/View controls, and non-destructive toggle all live on /hosts/{id} — duplicating them on the dashboard tile clogged the grid and wouldn't scale past a handful of hosts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-20 22:56:05 -04:00
josh	6d50f3a804	feat(install): polish install UX with banner, spinner, progress bar, summary CI / Lint + build + test (push) Successful in 1m38s Details Release / detect (push) Successful in 7s Details Release / build-live-image (push) Has been skipped Details Release / bundle (push) Successful in 55s Details Wrap the three install scripts in a shared inline style block (TTY/UTF-8/ NO_COLOR-aware) so the one-liner install looks and feels intentional: banner on start, timed step lines, braille spinner over silent apt/ systemctl calls with failure log dumps, single-line curl progress bars with size-prefixed headers, and a summary box at the end with live-image version + service state + next steps. install.sh defers banner/summary to proxmox-install.sh when VETTING_INSTALL_WRAPPED is set so the two scripts compose without duplication. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-20 22:29:44 -04:00
josh	48f992a451	bump live-image CI / Lint + build + test (push) Successful in 1m35s Details Release / detect (push) Successful in 6s Details Release / build-live-image (push) Successful in 7m40s Details Release / bundle (push) Successful in 50s Details	2026-04-20 21:31:09 -04:00
josh	98cdd95b50	chore(release): add registry auth diagnostic to build-live-image CI / Lint + build + test (push) Successful in 1m38s Details Release / detect (push) Successful in 5s Details Release / build-live-image (push) Has been skipped Details Release / bundle (push) Failing after 53s Details Echoes OWNER, token length, and whoami before the upload so a 401 disambiguates: missing/empty token, bad OWNER resolution, or token authenticating as a different user. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-20 21:27:23 -04:00
josh	211abdf08f	feat(release): version live-image, skip rebuild+redownload when unchanged CI / Lint + build + test (push) Successful in 1m41s Details Release / detect (push) Successful in 7s Details Release / build-live-image (push) Failing after 3m58s Details Release / bundle (push) Has been skipped Details Splits the release workflow into three jobs (detect, build-live-image, bundle) so the ~9 min mkosi build only runs when live-image/VERSION bumps. The slim bundle (~30 MB: orchestrator + agent + deploy scripts + a live-image/VERSION pointer) rebuilds every push; the ~300 MB vmlinuz+initrd.img are published separately under the immutable live-image/<version>/ path. install.sh compares the pointer to /var/lib/vetting/live/VERSION and fetches the files only on mismatch, cutting repeat-install wall-clock from ~30 s + 300 MB to ~10 s + 0 MB on the common no-live-image-change release. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-20 21:04:14 -04:00
josh	4c153bb115	chore(templ): regenerate host_tile_templ.go for cancelledFromHold Catches the generated file up to the .templ source committed in `599fd15`. No behavior change — the generator just hadn't been re-run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-20 21:04:01 -04:00
josh	a01db63952	feat(install): auto-heal pxe.interface/pxe.subnet against the host CI / Lint + build + test (push) Successful in 1m42s Details Release / release (push) Successful in 19m30s Details A stale /etc/vetting/vetting.yaml (e.g. pxe.interface=eth1 after an LXC rebuild renamed the NIC to eth0) blocks vetting.service startup with "pxe.interface 'eth1' not found on host", requiring the operator to ssh in and hand-edit the yaml after every rebuild. install.sh now validates the pxe block against the host's actual network state on every install/upgrade run. If pxe.enabled is true and pxe.interface doesn't exist (or pxe.subnet is missing/malformed), the script auto-detects the primary NIC via the default route, reads its subnet from the kernel-scope route, and patches both values in place. Valid configs are left exactly as the operator had them; fresh installs with pxe.enabled=false skip the check entirely. The one-liner install/update is now self-healing for the most common stale-config failure mode.	2026-04-20 19:56:39 -04:00
josh	599fd156d0	feat(ui): distinguish cancel-from-hold as "Failed (cancelled)" CI / Lint + build + test (push) Successful in 1m33s Details Release / release (push) Successful in 12m41s Details Before, a run that failed, held for operator review, and was then cancelled showed up on the tile and run header as plain "Cancelled" with an idle-grey mood — indistinguishable from a mid-stage cancel of a healthy run. That hides the actual failure from the dashboard. Now: when State=Cancelled with FailedStage still set (the hold-cancel signature the heartbeat handler already uses to pick reboot vs cancel_stage), the badge reads "Failed (cancelled)" with a fail- colored mood. Mid-stage cancels keep reading as plain "Cancelled".	2026-04-20 18:54:04 -04:00
josh	73f727b4c1	fix(agent): keep heartbeat loop alive during FailedHolding CI / Lint + build + test (push) Successful in 1m51s Details Release / release (push) Failing after 4m28s Details The heartbeat handler was returning cmd=abort for FailedHolding, which caused the agent's heartbeat goroutine to exit after ~10s in hold. Subsequent state changes (Cancel -> reboot, Override -> retry_stage) then had no recipient, so the host sat idle at the SSH hold prompt forever. Narrowed cmd=abort to StateReleased only; FailedHolding falls through to cmd=continue so the loop keeps polling and can receive the operator's eventual command.	2026-04-20 18:28:43 -04:00
josh	62bddac110	feat(cancel): allow cancel from FailedHolding, reboot to local disk CI / Lint + build + test (push) Successful in 1m38s Details Release / release (push) Successful in 6m10s Details A held run sits indefinitely at an SSH prompt waiting for operator investigation. Previously the only exits were Override (re-enter the failed stage) or leaving the host on forever — Cancel rejected any terminal state, including FailedHolding, and there was no button in the UI anyway. Add a dedicated exit path: - statemachine: TriggerOperatorCancelled now accepts FailedHolding as a valid source, transitioning to Cancelled like any other live state. - CancelRun handler: treats FailedHolding as cancellable even though IsTerminal reports true. - heartbeat: Cancelled runs fork on FailedStage. Set means the agent is parked in waitForOverride with no subprocess in flight, so cmd=reboot tells it to systemctl reboot; the host falls through iPXE's no-active-run script to the local disk. Empty FailedStage keeps the pre-existing cmd=cancel_stage path for mid-stage cancels (kill stage ctx, then power off). - UI: canCancel now returns true for FailedHolding, and the run-detail page renders a distinct "Cancel & reboot" button with a hold-specific confirm message so the action doesn't look identical to a mid-run cancel. Tests cover the new statemachine transition, the heartbeat fork (reboot vs cancel_stage), and keep the pre-existing mid-run cancel behaviour locked in. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 22:59:34 -04:00
josh	21014c1268	fix(inventory): read GPU model from device field, not vendor field CI / Lint + build + test (push) Successful in 1m37s Details Release / release (push) Successful in 11m43s Details `lspci -D -mm -nn` prefixes every line with the PCI address as a bare token before the three quoted class/vendor/device fields, so the device name sits at fields[3] — not fields[2], which is the vendor. The probe was indexing [2] and recording every GPU's model as its vendor string ("Intel Corporation" instead of "Alder Lake-N [UHD Graphics]"), which made every SpecValidate mismatch on real hosts once the expected spec named the device. Extract the per-line parse into parseLspciMMLine, handle both the modern -D layout (addr + class/vendor/device) and the legacy layout without an address prefix (class/vendor/device), and cover both paths plus the non-GPU-class skip in inventory_test.go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 22:53:42 -04:00
josh	e73b221a8c	fix(ui): fit pipeline timeline without horizontal scroll CI / Lint + build + test (push) Successful in 1m39s Details Release / release (push) Successful in 7m30s Details 15 nodes (3 pre-stage + 11 stage + Completed) exceeded the 1280px main container's usable width, producing a horizontal scrollbar under the pipeline on the run page. Widen main to 1440px, tighten per-node min widths, drop the scrollbar, and split camelCase labels so multi-word stages ("WaitingReboot", "SpecValidate", "CPUStress") wrap onto two lines instead of forcing node width. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 22:51:10 -04:00
josh	3656af9823	feat(end-of-run): reboot to local disk instead of powering off CI / Lint + build + test (push) Successful in 1m47s Details Release / release (push) Successful in 10m8s Details Completed runs now reboot the host and fall through iPXE to the next boot device (local disk) instead of powering off. Three coordinated changes: - pxe/ipxe: NoActiveRunScript exits iPXE (drops to next boot entry) instead of `sleep 10; poweroff`. Without this, a Completed reboot just loops through PXE and gets told to poweroff. - api/agent_handlers: heartbeat returns cmd=reboot (was cmd=shutdown) when the run reaches Completed. - agent/runner: runs `systemctl reboot` (with `shutdown -r now` fallback) in response to cmd=reboot. Operator cancel still powers off — powerOffAndReturn is unchanged because a cancel means the operator wants the host idle so they can walk up to it, not back in rotation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 22:45:11 -04:00
josh	8acef92a60	feat(inventory): deep hardware capture + per-probe substeps + verbose logs CI / Lint + build + test (push) Successful in 1m35s Details Release / release (push) Successful in 9m34s Details Extend Inventory stage from a one-liner summary to a per-probe substep emitter with ~20-30 narrative log lines per run. - spec: per-DIMM memory (slot/size/speed/manufacturer/part_number), richer CPU (vendor/stepping/physical_cores/flags), disk model/transport/rotational, NIC driver/pci_addr, GPU vram/pci/driver, new System/Baseboard/PSU/OS top-level sections. All fields omitempty so existing expected-spec YAML and artifacts stay compatible. - spec.Diff: new diffDIMMs/diffSystem/diffBaseboard/diffPSU/diffOS helpers; extended diffDisks/diffNICs/diffGPUs for new fields. GPU diff gains PCIAddr-pinned matching alongside count-by-model. - agent/probes/inventory: CPU (/proc/cpuinfo extended), Memory (dmidecode -t 17 multi-block), Disks (+model/transport/rotational), NICs (+driver/pci from sysfs), GPUs (VRAM from lspci -vv), new System/Baseboard (dmidecode -t system/baseboard), PSU (dmidecode -t 39), OS (/proc/sys/kernel/osrelease + /etc/os-release). All probes accept a Logger and emit per-finding info/warn lines. - agent/probes/firmware: parseDmidecodeAllSections for multi-block fixtures (memory / PSU). - agent/runner: Inventory case becomes 9 substep rows (CPU / Memory / Disks / NICs / GPUs / System / Baseboard / PSU / OS) with per-probe start/complete timestamps. - report: new Inventory HTML section between Stages and Firmware; resolveReporting loads the inventory.json artifact. - agent/tests/fakes/dmidecode: dispatches on -t flag to serve bios / memory / system / baseboard / 39 fixtures for unit tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 22:21:17 -04:00
josh	481b67fb69	feat(firmware): install probe tools in live image + surface nic/hba gaps CI / Lint + build + test (push) Successful in 1m42s Details Release / release (push) Successful in 11m25s Details mkosi.conf: add ipmitool, ethtool, nvme-cli so the Firmware stage can actually read BMC revisions, NIC firmware versions, and fall back to nvme-cli when sysfs firmware_rev is missing. firmware.go: probeNICFirmware and probeHBAFirmware now return (snapshots, warning) so a missing ethtool/lspci surfaces in the stage log the same way probeBIOS/probeBMC already do. Before, a host without ethtool silently reported "bios=1 nvme_fw=1 microcode=1" with no hint that nic coverage was dropped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 21:56:18 -04:00
josh	c545028903	feat(run-page): tick the run-duration timer between SSE pushes CI / Lint + build + test (push) Successful in 1m34s Details Release / release (push) Has been cancelled Details Adds a 1s client-side ticker that rewrites .run-duration text from a data-started-at attribute, so the header timer on /runs/{id} increments every second while the run is active. When an SSE swap lands a fresh header the new server-rendered value seamlessly takes over; when the run goes terminal the template drops the attribute and the ticker silently skips the node, leaving the final elapsed in place. Other templ_*.go churn is cosmetic — regenerator versions differ between CI and local and only the filename field in templ.Error callsites changed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 21:53:40 -04:00
josh	05ceb8e042	ci(release): skip release workflow for non-bundle changes CI / Lint + build + test (push) Successful in 1m41s Details Release / release (push) Successful in 16m47s Details Adds a paths-ignore filter to the push trigger so README tweaks, *_test.go edits, other workflows, and fake-binary scaffolding no longer spend 45 min debootstrapping + republishing an identical bundle to the package registry. Adds workflow_dispatch as a manual escape hatch for the cases where paths-ignore swallows something that needs republishing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 21:08:26 -04:00
josh	988448664a	fix(runs): stamp completed_at on cancel/terminal SetState transitions CI / Lint + build + test (push) Successful in 1m35s Details Release / release (push) Successful in 11m53s Details CancelRun goes through Runner.Transition → Runs.SetState, which was a bare UPDATE state=? with no completed_at write. The host-page runDuration helper treats nil CompletedAt as "still running", so a cancelled run kept ticking forever. MarkCompleted / MarkFailed / MarkDispatchFailed already stamp completed_at; SetState now does the same for any terminal target state, using COALESCE so we never clobber an already-set timestamp. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 20:21:39 -04:00
josh	bbe1b19819	store: fix FindActiveByMAC scanning profile column that wasn't selected CI / Lint + build + test (push) Successful in 1m40s Details Release / release (push) Has been cancelled Details Sibling Run-scan sites (Get, LatestForHost, ListForHost, Active) were updated to include COALESCE(profile,'quick') in the SELECT when the Phase 1 migration added the column; FindActiveByMAC was missed, so Scan got 14 destination args for a 13-column row. The symptom is /ipxe/{mac} returning 500 and the host booting nothing, since that handler is what returns the live-image script. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 20:14:16 -04:00
josh	75c29bb31a	ci: pin upload-artifact to v3 for Gitea compatibility CI / Lint + build + test (push) Successful in 1m56s Details Release / release (push) Successful in 10m13s Details Gitea's act_runner rejects @actions/artifact v2 (the engine behind upload-artifact@v4). v3 is the last GHES-compatible major and still supports the path: glob + retention-days we need. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 22:58:59 -04:00
josh	23c689aa5b	deep profile + threshold gating + firmware stage + Burn super-stage CI / Lint + build + test (push) Failing after 1m57s Details Release / release (push) Has been cancelled Details Ships all five phases of the deep-profile overhaul together. Runs now carry a profile (quick/deep/soak); every profile walks the same 11-stage order — Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage → Network → Burn → GPU → PSU → Reporting — with only per-stage durations and concurrency scaled. Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile column + CreateWithProfile; threshold table + evaluator seeded per-run from the shared vetting.thresholds block; breach flips result at /sensor + /result. Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify + EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta), Network (sustained iperf + /proc/net/dev deltas) with per-profile knobs from Deps. Phase 3: Burn super-stage with goroutine fan-out for CPU + memory + fio + iperf, PSU rails sampled across the Burn window, SensorMux (2 s flush, 500-sample cap) to absorb backpressure. Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode (BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl), lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into SpecValidate with pin-by-identifier and fan-out-across-component matching; mismatches park the run in FailedHolding. Phase 5: profile radio on the host start form, profile chip on the run header, Firmware section in the HTML report, coverage artifact uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath seam + stress_ng and dmidecode example fakes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 22:50:57 -04:00
josh	fbb21cbafd	ci: delete latest version, not the file, before re-uploading Release / release (push) Waiting to run Details CI / Lint + build + test (push) Successful in 1m42s Details File-level DELETE leaves a ghost version directory that makes the subsequent PUT 404 after a full 9-minute upload. Delete the whole 'latest' version, log the status code, and wait briefly before PUT. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 21:07:24 -04:00
josh	19608bef1b	ui: split /hosts/{id} into host page + /runs/{runID} run page CI / Lint + build + test (push) Successful in 1m35s Details Release / release (push) Successful in 23m47s Details Host page owns host metadata, full runs table with per-row stage strip, in-flight banner, and empty-state CTA. Run page owns pipeline, active step, logs, sub-steps, spec diffs, and hold banner with a breadcrumb back to the host. Dashboard tile reverts to host-only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 20:37:57 -04:00
josh	5c6bfa5ffa	ui: fix log lines rendering vertically when stage prefix is present CI / Lint + build + test (push) Successful in 1m39s Details Release / release (push) Has been cancelled Details The .log-line grid was templated with 5 columns (anchor/ln/lvl/ts/text), but renderLogSSE inserts an optional log-stage span, making 6 children. The 6th child wrapped to row 2 column 1 (24px wide), which forced the message text to break one character per line. Flexbox with min-width:0 on the text span scales cleanly with or without the stage element. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 19:20:51 -04:00
josh	f79fe0f0db	ui: GitHub-Actions-style detail page, sub-steps, mini-tile run-view CI / Lint + build + test (push) Successful in 1m26s Details Release / release (push) Successful in 6m47s Details Reshapes the detail page into a run-view: hybrid horizontal pipeline + expanded active-step pane with sub-steps, a per-step log pane with line-numbered permalinks and client-side search, and a runs-history sidebar that navigates via ?run=N. Default step is server-picked (running → failed → Reporting) so the operator lands on the thing that's moving. Adds a sub_steps table + SSE topic (substep-{run}-{stage}-{ordinal}) so per-disk and per-pass work (SMART, CPUStress CPU/RAM, Storage, GPU) is visible in the UI instead of buried in stage summary JSON. Agent emits sub-step reports from existing per-iteration loops. Dashboard tiles become a mini run-view with a 9-dot step strip so the operator reads run health across the whole grid at a glance. Register page gets the same card shell + button styling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 19:00:11 -04:00
josh	5c00edd7b6	ui: fix htmx-ext-sse integrity hash (was silently blocked by browser) CI / Lint + build + test (push) Successful in 1m20s Details Release / release (push) Successful in 5m48s Details Detail-page pipeline + log panes weren't updating without a manual refresh. Root cause: the integrity attribute on htmx-ext-sse@2.2.2 in layout.templ was wrong, so the browser refused to execute the script (SRI enforcement is silent — no user-visible error unless you open devtools). htmx core loaded, boosted nav worked, forms worked — but sse-connect/sse-swap were inert because the extension never registered, so no EventSource was ever opened. Replaced the claimed hash (Y4gc0CK6...) with the real one (fw+eTlCc...) computed via curl -sL https://unpkg.com/htmx-ext-sse@2.2.2 \| openssl dgst -sha384 -binary \| openssl base64 -A Added sse_e2e_test.go as a regression canary that mounts the real chi router (RealIP + Recoverer + Logger middleware), opens GET /events, publishes a tile-update via Runner, and asserts the event lands on the wire. Server-side unit tests only verified rendered HTML — this one covers the full publish→wire path, which is what the next regression in this area will hit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 17:51:58 -04:00
josh	27098fc7ed	cpustress+orchestrator: serial CPU/RAM passes + silent-skip guard CI / Lint + build + test (push) Successful in 1m23s Details Release / release (push) Successful in 6m2s Details Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping CPUStress. Two compounding bugs: 1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently. On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer fired, usually on the agent itself. Replaced with two sequential passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1, --vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify) for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a premature clean exit counts as failure instead of a silent pass. 2. On systemd-restart after the OOM, the agent hardcoded nextStage := "Inventory" and re-ran it. The orchestrator's /result handler advances run state via TriggerStageCompleted against the current RunState, not against body.Stage — so an Inventory result posted while the run was in StateCPUStress silently advanced CPUStress → Storage and marked CPUStress passed without it ever running. Two-layer defense for #2: - agent-side: /claim response now carries current_state; agent resumes at the matching stage on a re-claim (happy path). - server-side: new TriggerStageMismatch + StageNameForState helper backstop. If body.Stage doesn't match the run's current stage, /result parks the run in FailedHolding with failed_stage labeled "<got> (expected <expected>)" and returns 409. Other stages audited for similar unbounded concurrency — none found; only CPUStress was unsafe. Tests: - cpustress_test.go — parseMemAvailable parses real meminfo, errors on missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB headroom on normal/huge boxes. - statemachine_test.go — TriggerStageMismatch lands at FailedHolding from every stage state and is rejected from pre-stage/terminal states; StageNameForState round-trips the stageStates map. - agent_handlers_test.go — TestResult_RejectsMismatchedStage proves the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage proves the guard doesn't break the happy path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 17:29:13 -04:00
josh	cdd6cae3b0	ui: keep detail-page SSE swaps live after the first outerHTML replace CI / Lint + build + test (push) Successful in 1m28s Details Release / release (push) Successful in 6m29s Details Pipeline fragment payload was a bare <div class=pipeline>, but the sse-swap=pipeline-N wrapper lived only in the page shell. The first outerHTML swap destroyed the wrapper, so every subsequent pipeline event had nothing to target — forcing a manual refresh. RenderPipelineString now emits the full <section id=pipeline-N sse-swap=... hx-swap=outerHTML> wrapper, used from both the shell and the orchestrator publish path. Also drop the red-bar styling from the empty DetailHold placeholder: the wrapper's detail-hold class was painting an unconditional red band between Pipeline and Actions whenever no hold was active.	2026-04-18 17:03:39 -04:00
josh	e73e31af92	live-image: install stage tools and fail loudly if any are missing CI / Lint + build + test (push) Successful in 1m32s Details Release / release (push) Successful in 6m28s Details The live image was still carrying the Phase 2 package list, so SMART, CPUStress, and Network each hit a LookPath miss and returned pass-with-skip. A run that skipped every real check still ended in "completed" — nothing on the report said the image was broken. Add smartmontools, stress-ng, fio, iperf3, lshw, lm-sensors, e2fsprogs, and util-linux to mkosi.conf. Flip the three stages from skip-pass to fail when their binary is missing so any future packaging regression blocks the run instead of whispering past it. Legitimate "no hardware" skips (no GPU, no hwmon, no disks, non-destructive) are untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 16:39:28 -04:00
josh	0db790ae3e	ui: stream host-detail fragments over SSE so the page updates live CI / Lint + build + test (push) Successful in 1m29s Details Release / release (push) Has been cancelled Details The detail page was only partly live: Pipeline + LogTabs subscribed to SSE, but the summary header, actions row, spec-diffs list and hold-key block all froze at page-load and required a manual refresh to catch up with state changes. Extract each of those four regions into its own named templ component with a stable id and sse-swap target, add Render*String helpers so the orchestrator can publish pre-rendered fragments, and register a HostDetailRenderer alongside the existing Tile/Pipeline renderers. PublishHostDetail is folded into publishTileUpdate so every call site that already refreshes a tile now also refreshes the detail page — keeps the fan-out honest without scattering new publish calls. The empty-state wrappers for spec-diffs and hold are load-bearing: without the <section id=... sse-swap=...> present at initial GET, the first live event after SpecValidate or Hold writes would have no DOM node to swap into. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 16:36:13 -04:00
josh	5e9ad7f569	probes: sanitize disk serials and normalize GPU model for stable spec keys CI / Lint + build + test (push) Successful in 1m25s Details Release / release (push) Successful in 5m38s Details Two related bugs were producing different map keys for identical hardware depending on whether the inventory probe ran in the reporter on the Proxmox host or in the live-image agent after PXE boot. 1. diskSerial read /sys/block/<dev>/device/{serial,vpd_pg80} and only TrimSpace'd the result. vpd_pg80 is a binary SCSI VPD page with a 4-byte header, and some SSDs leak NUL/control bytes into the text serial file. Those bytes survive into the Go string, lowercase unchanged, and become a garbage map key that the reporter's cleaner read can't match. Sanitize to ASCII-printable range at ingest. 2. probeGPUs built the model slug from fields[2] + " " + fields[3] of `lspci -mm -nnk` output. fields[3] is subsystem vendor/device info, which varies between otherwise-identical cards and carries the `-rXX` revision marker — stable-enough for display but not for identity. Use fields[2] alone, strip the trailing `[NNNN]` PCI device-ID that lspci -nn appends, and sanitize for consistency. After deploying the new orchestrator + re-running the configure step on each registered host, SpecValidate will match cleanly. Disk diffs self-resolve because the reporter already stored clean serials; GPU diffs need one reporter re-run because the old expected slug still carries subsystem noise.	2026-04-18 16:06:18 -04:00
josh	d48cf146f4	live-image: mask systemd-firstboot at image-build time CI / Lint + build + test (push) Successful in 1m24s Details Release / release (push) Successful in 5m53s Details Belt-and-braces for the kernel-cmdline systemd.firstboot=off fix. mkosi ships /etc/machine-id empty, which triggers firstboot's interactive locale/timezone/root-password prompt on every PXE boot; with the agent running unattended there's nobody to answer and sysinit.target blocks indefinitely. Mask via a /dev/null symlink in /etc/systemd/system so the service is unstartable regardless of cmdline — rules out the failure mode where an older orchestrator binary serves an iPXE script without the off-switch arg.	2026-04-18 15:41:46 -04:00
josh	026923075c	pxe: disable systemd-firstboot so the live image doesn't prompt CI / Lint + build + test (push) Successful in 1m22s Details Release / release (push) Has been cancelled Details systemd-firstboot.service is an interactive wizard that asks for locale, timezone, and root password when /etc/machine-id isn't populated — i.e. every PXE boot of a mkosi-built image. It sits on sysinit.target waiting for input that will never arrive, blocking the agent service and every other downstream unit indefinitely. systemd.firstboot=off on the kernel cmdline is the documented kill switch; no image-side changes needed.	2026-04-18 15:35:24 -04:00
josh	956120b80e	deploy: show speed + ETA in bundle-download progress meter CI / Lint + build + test (push) Successful in 1m24s Details Release / release (push) Successful in 5m30s Details Drop --progress-bar (curl's minimal hash meter) in favor of the default progress output, which includes transfer rate and time remaining. Bundles grew from ~30 MB to ~300 MB with the full-rootfs initrd, and a percentage-only bar with no speed hint makes a slow registry look indistinguishable from a hang.	2026-04-18 15:04:26 -04:00
josh	c45349f62c	pxe: mask serial-getty@ttyS0 so hosts without serial don't wait 90s CI / Lint + build + test (push) Successful in 1m47s Details Release / release (push) Successful in 5m16s Details systemd-getty-generator reads console=ttyS0 off the kernel cmdline and auto-creates serial-getty@ttyS0.service, which BindsTo dev-ttyS0.device. On hardware without a physical serial port the device node never shows up, systemd waits its full default 90s timeout, and only then proceeds. systemd.mask= on the kernel cmdline is a first-class option — masks the unit before the generator's link even gets activated. Kernel messages still go to ttyS0 if a port is present; we just don't try to spawn a login prompt there.	2026-04-18 14:47:03 -04:00
josh	a88e24bef4	live-image: real /init + verbose boot for first-boot diagnosis CI / Lint + build + test (push) Successful in 1m23s Details Release / release (push) Successful in 4m49s Details Host boots past kernel init and then stalls silently. ACPI DSDT error about TXHC.RHUB.SS01 is benign noise (Tiger Lake firmware bug) — the actual problem is that nothing between kernel handoff and (maybe) systemd is visible on the console. Two changes: 1. Replace the /init → sbin/init symlink with a real shell script (live-image/mkosi.extra/init) that mounts /proc /sys /dev /dev/pts /dev/shm /run before execing systemd. Systemd has fallback mount code for these, but when it fails the failure is silent. Doing it explicitly in /init keeps failures visible and avoids the fragile symlink-resolution trick. 2. Drop 'quiet' from the kernel cmdline and add loglevel=7 plus systemd.log_target=kmsg + journald.forward_to_console=1 so every early-boot message reaches both tty0 and ttyS0. Will be dialed back once boot is stable. Also: .gitattributes pins LF on live-image/, .gitea/, Makefile, and *.sh so Windows checkouts don't break shell scripts and Makefile recipes with CRLF. /init also gets chmod 0755 in repack-initrd as a belt-and-braces against mode loss on non-Linux checkouts.	2026-04-18 14:31:40 -04:00
josh	43ea845ac0	live-image: pack full rootfs as initrd so PXE actually boots userspace CI / Lint + build + test (push) Successful in 1m54s Details Release / release (push) Successful in 5m10s Details update-initramfs produces a boot stub (~50 MB) that expects to mount a separate rootfs over squashfs/disk/NFS. Our PXE channel only ships vmlinuz+initrd.img, so the stub had nothing to pivot to — kernel finished hand-off and the system wedged with firmware, modules, and userspace stranded in the 545 MB rootfs dir we never delivered. Replace with an everything-in-initramfs build: cpio.zst the full rootfs (minus /boot) as the initrd, add /init -> sbin/init for the kernel's runtime entrypoint, materialize the kernel symlink into a real file. Bump check-initrd floor to 200 MB and switch the firmware grep from unmkinitramfs (boot-stub-specific) to zstd \| cpio -t. Also add cpio to the CI apt deps.	2026-04-18 14:14:08 -04:00
josh	6c6d20710f	live-image: fix check-initrd size measurement; add zstd to image CI / Lint + build + test (push) Successful in 1m28s Details Release / release (push) Failing after 4m10s Details Previous run actually built the 518 MB rootfs with firmware-misc-nonfree et al. installed — the real payload is working. Two follow-ups: - check-initrd was reading stat on a symlink path and getting 30 bytes (the symlink's own size), not the 6.1.0-44-amd64 kernel initrd it points to. Switched to wc -c, which follows symlinks, and to du -hL for the OK message. - Add zstd to Packages= so COMPRESS=zstd in initramfs.conf can be honored; without it update-initramfs falls back to gzip with a "No zstd in PATH" warning. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 14:00:07 -04:00
josh	0a5e5d0b39	ci: add bubblewrap dep and bump mkosi to v25.3 CI / Lint + build + test (push) Successful in 1m31s Details Release / release (push) Failing after 3m47s Details v24.3 crashed in cp_version() during the copy-package-manager-trees step because its sandbox needs bubblewrap (not present in the runner apt list), and cp --version returned empty output inside the broken sandbox. Installing bubblewrap and bumping to v25.3 which has tighter sandbox fallback behavior. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:53:09 -04:00
josh	488a0d1052	ci: install mkosi from upstream git tag, not PyPI Release / release (push) Failing after 1m54s Details CI / Lint + build + test (push) Has been cancelled Details Previous commit pinned mkosi==24.3 via pip but mkosi isn't published on PyPI past ancient versions — the runner hit "Could not find a version that satisfies the requirement mkosi==24.3". Install from the upstream git tag v24.3 instead; added git to the apt dep list for pip's VCS fetch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:44:51 -04:00
josh	28918bad15	live-image: fix firmware so i915 actually loads at boot CI / Lint + build + test (push) Successful in 1m35s Details Release / release (push) Failing after 22s Details Previous attempt (`c962d6d`) added firmware-linux-nonfree to mkosi.conf, but the CI bundle was still 63 MB and Tiger Lake wedged on tgl_guc. Two reasons: (1) firmware-linux-nonfree on bookworm is a thin metapackage that doesn't include firmware-misc-nonfree, which is where i915 GuC/HuC blobs actually live; (2) Ubuntu's apt-packaged mkosi is old enough that Repositories=non-free-firmware shorthand likely isn't wired through to the debootstrap invocation, so firmware packages silently miss the bootstrap step entirely. Changes: - Enumerate firmware packages explicitly in mkosi.conf (firmware- misc-nonfree, firmware-iwlwifi, firmware-realtek, firmware-amd- graphics, firmware-intel-sound, intel/amd64-microcode). - Ship mkosi.sources.d/debian.sources with explicit deb822 so the non-free-firmware component is unambiguously available. - Install mkosi 24.3 via pip in CI instead of apt's older build. - Pin MODULES=most and COMPRESS=zstd via a tracked initramfs-tools config under mkosi.extra/. - Narrow .gitignore so only the generated agent binary is ignored, not the whole mkosi.extra/ tree. - New check-initrd Makefile target asserts both size (>=150 MB) and actual presence of i915/tgl_guc_*.bin inside the built initrd, so a silent firmware-drop regression fails the build loudly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:38:40 -04:00
josh	c962d6d8ab	live-image: bundle nonfree firmware (i915 GuC et al.) CI / Lint + build + test (push) Successful in 2m19s Details Release / release (push) Successful in 3m28s Details Tiger Lake and later Intel iGPUs need i915/tgl_guc_*.bin; without it the i915 init wedges and floods the console. Same story on most modern wifi/NIC hardware. Pull firmware-linux-nonfree (metapackage covering misc-nonfree, iwlwifi, realtek, amd-graphics, …) from the bookworm non-free-firmware repo — single line fix, ~500MB cost to the squashfs, worth it for booting arbitrary repaired hosts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:14:19 -04:00
josh	4524ab8dc0	runs: add non-destructive flag + operator Cancel button CI / Lint + build + test (push) Successful in 2m5s Details Release / release (push) Successful in 3m5s Details Non-destructive pre-declares "don't touch the disks" on Start: the Storage stage skips wipe-probe, badblocks -w, and write-mode fio, and reports a read-only summary. Runs a new non_destructive column; threaded through Claim → agent tests.Deps → Storage stage. Cancel halts an in-flight run. The orchestrator transitions to a new StateCancelled via TriggerOperatorCancelled (valid from any active state); the agent's next heartbeat returns cmd=cancel_stage, which fires a stored CancelFunc on the per-stage context. Stage subprocesses spawned with exec.CommandContext die with the context, the agent posts a cancelled outcome, then powers the host off. Destructive stages mid-run may leave the host in an intermediate state — the UI confirm dialog warns the operator; recovery is manual for now. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:01:42 -04:00
josh	2c440fce8a	pxe: move dhcp-host allowlist into a SIGHUP-reloadable file CI / Lint + build + test (push) Successful in 1m38s Details Release / release (push) Successful in 2m25s Details dnsmasq's SIGHUP re-reads /etc/ethers and any --dhcp-hostsfile= paths, but NOT dhcp-host= lines from the main conf. Reload() was faithfully rewriting dnsmasq.conf with the new MAC, sending SIGHUP, and then dnsmasq kept serving its startup view — so a freshly-registered host still showed up as "proxy-ignored, tags: eth0" with no "known" tag. Split the allowlist into ${RuntimeDir}/dhcp-hosts, referenced from the main conf via dhcp-hostsfile=. writeConf() is static-ish now; Reload just rewrites the hosts file and SIGHUPs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:41:27 -04:00
josh	bce6e08524	pxe: reload dnsmasq on host create/delete CI / Lint + build + test (push) Successful in 1m54s Details Release / release (push) Successful in 2m36s Details pxe.Supervisor.Reload() was defined but never wired up. After a host was registered in the UI or via the quick-register JSON endpoint, the dnsmasq conf still held only the hosts that existed at orchestrator startup. The new MAC wasn't tagged `known`, so when the host PXE'd, dnsmasq logged "PXE(eth0) <mac> proxy-ignored" and the boot timed out back to the BIOS. Add an optional PXEReloader interface to api.UI, wire it from main when pxe is enabled, and call u.reloadPXE() after successful Create and Delete. Logs-and-continues on failure — host registration itself has already committed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:31:00 -04:00
josh	157b70f536	pxe: split subnet into network+netmask for dnsmasq proxy-DHCP CI / Lint + build + test (push) Successful in 2m0s Details Release / release (push) Successful in 3m35s Details dnsmasq's proxy-DHCP syntax is `dhcp-range=<network-ip>,proxy[,<mask>]`, not a CIDR. Passing "192.168.1.0/24,proxy" made dnsmasq refuse to start with "bad dhcp-range at line 12". Parse the CIDR once in writeConf() and render Network + Netmask as separate template fields. The config surface (pxe.subnet) stays CIDR because that's the right shape for humans; we just unpack it before handing to dnsmasq. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:17:10 -04:00
josh	cf3a75591c	install: stage pxe-setup.sh at /usr/local/sbin/vetting-pxe-setup CI / Lint + build + test (push) Successful in 1m36s Details Release / release (push) Successful in 2m29s Details proxmox-install.sh tarball-extracts into a tempdir that gets wiped on EXIT, so after the one-liner there's no pxe-setup.sh on disk for the operator to run. Have install.sh drop the script + ipxe-shas.txt into /usr/local/share/vetting/ and symlink it as /usr/local/sbin/vetting-pxe-setup (in PATH). pxe-setup.sh now readlink -f's BASH_SOURCE so the symlink resolves to the share dir where ipxe-shas.txt lives, and gracefully handles the case where install.sh already staged vmlinuz + initrd.img into LIVE_DIR (no bundle live-image/ needed at that point). Update the trailing hint in proxmox-install.sh and the operations runbook to surface the new `sudo vetting-pxe-setup ...` command. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:10:23 -04:00
josh	bcbbc35489	docs+e2e: document proxy-DHCP topology; default e2e bridge to LAN CI / Lint + build + test (push) Successful in 1m37s Details Release / release (push) Has been cancelled Details Rewrites the PXE section of the ops runbook around the new proxy-DHCP model (no dedicated bridge, coexists with UniFi/pfSense/etc.) and swaps the e2e test's default bridge + orchestrator URL to match. The e2e file now calls out the LAN-DHCP precondition in its header so future-me (or CI) doesn't hang at PXE wondering why nothing answers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:07:05 -04:00
josh	506c856046	pxe: switch dnsmasq to proxy-DHCP mode on the LAN CI / Lint + build + test (push) Successful in 1m48s Details Release / release (push) Successful in 2m22s Details Previously the orchestrator ran a full DHCP server on a dedicated br-vetting bridge (10.77.0.0/24), which required a hypervisor-level bridge + physical cabling onto that bridge for every repaired host. Real-world bite: the LXC's br-vetting had no L2 path to the target host's PXE NIC, so DHCPDISCOVERs never reached eth1 and PXE silently timed out. dnsmasq's proxy-DHCP mode is the idiomatic answer: it coexists with the LAN's existing DHCP server (UniFi, etc.), never assigns an IP itself, and only supplements the PXE options. No dedicated bridge, no VLAN, no cabling changes \u2014 dnsmasq binds to the LAN interface and layers option 66/67 + the PXE BINL on top of the real DHCP exchange. The MAC allowlist still gates replies, so random LAN clients booting from network get nothing. Template switches dhcp-range=<start,end,lease> to dhcp-range=<cidr>,proxy and replaces dhcp-boot= for first-boot ROM clients with pxe-service= directives (the correct proxy-mode chainload form). Validation drops the dhcp_range regex for a net.ParseCIDR check on pxe.subnet. Config, production/example yaml, and pxe-setup.sh swap --dhcp-range for --subnet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:02:49 -04:00

1 2

88 Commits