Vetting

Author	SHA1	Message	Date
josh	8367ec2a9f	docs: comprehensive documentation expansion CI / Lint + build + test (push) Successful in 1m36s Details Release / detect (push) Successful in 5s Details Release / build-live-image (push) Has been skipped Details Release / bundle (push) Successful in 49s Details Add 4 new doc files (configuration reference, development guide, API reference with full request/response schemas, database schema), expand the README with a feature list and how-it-works walkthrough, fix missing Firmware and Burn stages in architecture.md and test-suite.md, add threshold engine and host-mode agent sections, and add godoc comments to 11 packages and 6 model types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-23 18:37:26 -04:00
josh	17ec55cb85	chore: cleanup sprint — dead CSS, dedup helpers, handler refactor CI / Lint + build + test (push) Successful in 1m34s Details Release / detect (push) Successful in 4s Details Release / build-live-image (push) Has been skipped Details Release / bundle (push) Successful in 1m5s Details Remove ~126 lines of orphaned CSS from tile slim-down and old detail layout. Consolidate 4 duplicate duration formatters into shared elapsed()/fmtElapsed() helpers. Break 160-line Result handler into focused sub-functions. Implement real Hub.Shutdown() (was a no-op). Standardize agent error responses to JSON. Replace panic() in router init with error return. Extract magic numbers as named constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-21 20:39:38 -04:00
josh	23c689aa5b	deep profile + threshold gating + firmware stage + Burn super-stage CI / Lint + build + test (push) Failing after 1m57s Details Release / release (push) Has been cancelled Details Ships all five phases of the deep-profile overhaul together. Runs now carry a profile (quick/deep/soak); every profile walks the same 11-stage order — Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage → Network → Burn → GPU → PSU → Reporting — with only per-stage durations and concurrency scaled. Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile column + CreateWithProfile; threshold table + evaluator seeded per-run from the shared vetting.thresholds block; breach flips result at /sensor + /result. Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify + EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta), Network (sustained iperf + /proc/net/dev deltas) with per-profile knobs from Deps. Phase 3: Burn super-stage with goroutine fan-out for CPU + memory + fio + iperf, PSU rails sampled across the Burn window, SensorMux (2 s flush, 500-sample cap) to absorb backpressure. Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode (BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl), lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into SpecValidate with pin-by-identifier and fan-out-across-component matching; mismatches park the run in FailedHolding. Phase 5: profile radio on the host start form, profile chip on the run header, Firmware section in the HTML report, coverage artifact uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath seam + stress_ng and dmidecode example fakes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 22:50:57 -04:00
josh	19608bef1b	ui: split /hosts/{id} into host page + /runs/{runID} run page CI / Lint + build + test (push) Successful in 1m35s Details Release / release (push) Successful in 23m47s Details Host page owns host metadata, full runs table with per-row stage strip, in-flight banner, and empty-state CTA. Run page owns pipeline, active step, logs, sub-steps, spec diffs, and hold banner with a breadcrumb back to the host. Dashboard tile reverts to host-only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 20:37:57 -04:00
josh	f79fe0f0db	ui: GitHub-Actions-style detail page, sub-steps, mini-tile run-view CI / Lint + build + test (push) Successful in 1m26s Details Release / release (push) Successful in 6m47s Details Reshapes the detail page into a run-view: hybrid horizontal pipeline + expanded active-step pane with sub-steps, a per-step log pane with line-numbered permalinks and client-side search, and a runs-history sidebar that navigates via ?run=N. Default step is server-picked (running → failed → Reporting) so the operator lands on the thing that's moving. Adds a sub_steps table + SSE topic (substep-{run}-{stage}-{ordinal}) so per-disk and per-pass work (SMART, CPUStress CPU/RAM, Storage, GPU) is visible in the UI instead of buried in stage summary JSON. Agent emits sub-step reports from existing per-iteration loops. Dashboard tiles become a mini run-view with a 9-dot step strip so the operator reads run health across the whole grid at a glance. Register page gets the same card shell + button styling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 19:00:11 -04:00
josh	0db790ae3e	ui: stream host-detail fragments over SSE so the page updates live CI / Lint + build + test (push) Successful in 1m29s Details Release / release (push) Has been cancelled Details The detail page was only partly live: Pipeline + LogTabs subscribed to SSE, but the summary header, actions row, spec-diffs list and hold-key block all froze at page-load and required a manual refresh to catch up with state changes. Extract each of those four regions into its own named templ component with a stable id and sse-swap target, add Render*String helpers so the orchestrator can publish pre-rendered fragments, and register a HostDetailRenderer alongside the existing Tile/Pipeline renderers. PublishHostDetail is folded into publishTileUpdate so every call site that already refreshes a tile now also refreshes the detail page — keeps the fan-out honest without scattering new publish calls. The empty-state wrappers for spec-diffs and hold are load-bearing: without the <section id=... sse-swap=...> present at initial GET, the first live event after SpecValidate or Hold writes would have no DOM node to swap into. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 16:36:13 -04:00
josh	bce6e08524	pxe: reload dnsmasq on host create/delete CI / Lint + build + test (push) Successful in 1m54s Details Release / release (push) Successful in 2m36s Details pxe.Supervisor.Reload() was defined but never wired up. After a host was registered in the UI or via the quick-register JSON endpoint, the dnsmasq conf still held only the hosts that existed at orchestrator startup. The new MAC wasn't tagged `known`, so when the host PXE'd, dnsmasq logged "PXE(eth0) <mac> proxy-ignored" and the boot timed out back to the BIOS. Add an optional PXEReloader interface to api.UI, wire it from main when pxe is enabled, and call u.reloadPXE() after successful Create and Delete. Logs-and-continues on failure — host registration itself has already committed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:31:00 -04:00
josh	506c856046	pxe: switch dnsmasq to proxy-DHCP mode on the LAN CI / Lint + build + test (push) Successful in 1m48s Details Release / release (push) Successful in 2m22s Details Previously the orchestrator ran a full DHCP server on a dedicated br-vetting bridge (10.77.0.0/24), which required a hypervisor-level bridge + physical cabling onto that bridge for every repaired host. Real-world bite: the LXC's br-vetting had no L2 path to the target host's PXE NIC, so DHCPDISCOVERs never reached eth1 and PXE silently timed out. dnsmasq's proxy-DHCP mode is the idiomatic answer: it coexists with the LAN's existing DHCP server (UniFi, etc.), never assigns an IP itself, and only supplements the PXE options. No dedicated bridge, no VLAN, no cabling changes \u2014 dnsmasq binds to the LAN interface and layers option 66/67 + the PXE BINL on top of the real DHCP exchange. The MAC allowlist still gates replies, so random LAN clients booting from network get nothing. Template switches dhcp-range=<start,end,lease> to dhcp-range=<cidr>,proxy and replaces dhcp-boot= for first-boot ROM clients with pxe-service= directives (the correct proxy-mode chainload form). Validation drops the dhcp_range regex for a net.ParseCIDR check on pxe.subnet. Config, production/example yaml, and pxe-setup.sh swap --dhcp-range for --subnet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 12:02:49 -04:00
josh	9d17859992	orchestrator: anchor pxe+tftp runtime dirs under artifacts parent CI / Lint + build + test (push) Successful in 1m38s Details Release / release (push) Successful in 2m43s Details Previously tftp_root defaulted to logs.dir/../tftp and the pxe runtime dir to logs.dir/../pxe. On a production install that resolves to /var/log/tftp and /var/log/pxe, both outside the systemd unit's ReadWritePaths=/var/lib/vetting /var/log/vetting sandbox. The service crash-looped with "mkdir /var/log/pxe: read-only file system" as soon as PXE was enabled. Switch the anchor to filepath.Dir(cfg.Artifacts.Dir) — typically /var/lib/vetting — which the sandbox already allows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 11:14:11 -04:00
josh	a5055b3c7a	Automate PXE setup: release bundle + pxe-setup.sh + startup validation CI / Lint + build + test (push) Has been cancelled Details Collapses the LXC side of PXE enablement from a six-step manual dance (build, fetch iPXE, scp, bridge, hand-edit yaml) into: make release # dev box (Linux/WSL) scp bundle.tar.gz lxc:/tmp/ sudo ./install.sh # base install, unchanged sudo ./pxe-setup.sh --interface ... --dhcp-range ... --orchestrator-url ... pxe-setup.sh fetches iPXE from boot.ipxe.org, verifies against pinned SHA256s in deploy/ipxe-shas.txt (fail-closed), places vmlinuz/initrd.img from the bundle, and rewrites only the pxe: block of vetting.yaml. Idempotent; --force gates overwriting a hand-edited block. Adds Supervisor.Validate() — called before dnsmasq spawn — so typo'd configs fail at orchestrator startup with clear errors naming the missing file or yaml key, instead of silently serving broken TFTP until a real host tries to PXE-boot. Nine tests cover missing files, bogus interface, malformed dhcp_range, bad orchestrator_url, and aggregate reporting. Hypervisor bridge creation stays documented (LXC can't do it) but everything downstream of the bridge is now scripted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 01:38:43 -04:00
josh	1694c20b12	Host detail v2: full pipeline + per-stage logs + WoL diagnostics CI / Lint + build + test (push) Has been cancelled Details Pipeline now always renders all 13 nodes (3 pre-stage + 9 stage + Completed), synthesising ghosts from run state when stage rows aren't seeded yet. Makes a WaitingWoL host show the full timeline ahead of it instead of just 4 dots. Agent tags each log line with its stage; logs.Hub fans out to both log-{runID} and log-{runID}-{stage} SSE events so the detail page can show per-stage tabs with a pure-CSS radio-sibling switch. Flat run log prepends [stage] so grep still works. Dispatcher writes picked/sent-WoL/heartbeat lines into the per-run log — the operator opens the detail page, sees WaitingWoL stuck, and reads exactly what the dispatcher did and why nothing's progressing, instead of having to tail journalctl on the LXC. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 00:38:27 -04:00
josh	bb658a8435	Host detail page + pipeline timeline CI / Lint + build + test (push) Has been cancelled Details Click a tile to open /hosts/{id} — the canonical control surface per host. Timeline renders every pre-stage, stage, and terminal node in order, with the current one pulsing, failed ones flagged, and downstream ones dimmed as skipped. Detail page shows summary, hold card (when holding), all action buttons, spec diffs, a full-height log pane, and a collapsed expected-spec YAML. Tile slims to name, last-seen, status, and one primary action; a CSS-overlay <a> makes the whole card clickable while buttons stay receptive via z-index. Runner.publishTileUpdate now also emits pipeline-{runID} fragments, and CompleteStage wraps Stages.CompleteByName so stage completions advance the timeline live — without this the dots only moved on state transitions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-17 23:59:43 -04:00
josh	a0c0fb114f	Add host-mode heartbeat: vetting-agent host + last-seen badge CI / Lint + build + test (push) Has been cancelled Details vetting-agent gains a `host` subcommand that runs as a systemd service installed by the quick-register one-liner, POSTing every 30s to /api/v1/hosts/{mac}/heartbeat so the dashboard tile shows "online" or "Nm ago" without waiting on WoL. Ships dormant client code for the Phase 2 reboot_for_vetting command so the server can flip it on later without a binary redeploy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-17 23:34:15 -04:00
josh	8b3d9a312e	Add quick-register one-liner for target-host registration CI / Lint + build + test (push) Failing after 5m15s Details Operator pastes `curl -fsSL $ORCH/register/quick.sh \| sudo bash` on the target host (pre-wipe). The script probes MAC + CPU/RAM/disks/NICs/GPUs, emits an expected-spec YAML, and POSTs to a new LAN-trusted JSON endpoint /api/v1/hosts. The register page shows the command prefilled with the orchestrator URL; the manual form moves into a collapsible "Register manually" disclosure.	2026-04-17 22:50:54 -04:00
josh	42da48864f	Remove operator auth — trust the LAN CI / Lint + build + test (push) Failing after 5m15s Details Can't log in from a fresh LXC deploy, and the service is LAN-only by design. Rip out the whole bcrypt-password / signed-cookie session layer: internal/auth, login templates, gen-admin-password binary + Makefile targets, auth config block, login/logout routes and the RequireSession middleware wrap. Agent bearer-token auth on /api/v1/runs/{id}/* is untouched. Operators who want a password can front the service with a reverse proxy — noted in README and docs/operations.md.	2026-04-17 22:31:49 -04:00
josh	9bb4b09a04	Initial commit: full Phases 1-6 implementation CI / Lint + build + test (push) Has been cancelled Details Post-repair hardware validation pipeline for Proxmox cluster hosts. Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.	2026-04-17 21:32:10 -04:00

16 Commits