Vetting

Author	SHA1	Message	Date
josh	21014c1268	fix(inventory): read GPU model from device field, not vendor field CI / Lint + build + test (push) Successful in 1m37s Details Release / release (push) Successful in 11m43s Details `lspci -D -mm -nn` prefixes every line with the PCI address as a bare token before the three quoted class/vendor/device fields, so the device name sits at fields[3] — not fields[2], which is the vendor. The probe was indexing [2] and recording every GPU's model as its vendor string ("Intel Corporation" instead of "Alder Lake-N [UHD Graphics]"), which made every SpecValidate mismatch on real hosts once the expected spec named the device. Extract the per-line parse into parseLspciMMLine, handle both the modern -D layout (addr + class/vendor/device) and the legacy layout without an address prefix (class/vendor/device), and cover both paths plus the non-GPU-class skip in inventory_test.go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 22:53:42 -04:00
josh	3656af9823	feat(end-of-run): reboot to local disk instead of powering off CI / Lint + build + test (push) Successful in 1m47s Details Release / release (push) Successful in 10m8s Details Completed runs now reboot the host and fall through iPXE to the next boot device (local disk) instead of powering off. Three coordinated changes: - pxe/ipxe: NoActiveRunScript exits iPXE (drops to next boot entry) instead of `sleep 10; poweroff`. Without this, a Completed reboot just loops through PXE and gets told to poweroff. - api/agent_handlers: heartbeat returns cmd=reboot (was cmd=shutdown) when the run reaches Completed. - agent/runner: runs `systemctl reboot` (with `shutdown -r now` fallback) in response to cmd=reboot. Operator cancel still powers off — powerOffAndReturn is unchanged because a cancel means the operator wants the host idle so they can walk up to it, not back in rotation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 22:45:11 -04:00
josh	8acef92a60	feat(inventory): deep hardware capture + per-probe substeps + verbose logs CI / Lint + build + test (push) Successful in 1m35s Details Release / release (push) Successful in 9m34s Details Extend Inventory stage from a one-liner summary to a per-probe substep emitter with ~20-30 narrative log lines per run. - spec: per-DIMM memory (slot/size/speed/manufacturer/part_number), richer CPU (vendor/stepping/physical_cores/flags), disk model/transport/rotational, NIC driver/pci_addr, GPU vram/pci/driver, new System/Baseboard/PSU/OS top-level sections. All fields omitempty so existing expected-spec YAML and artifacts stay compatible. - spec.Diff: new diffDIMMs/diffSystem/diffBaseboard/diffPSU/diffOS helpers; extended diffDisks/diffNICs/diffGPUs for new fields. GPU diff gains PCIAddr-pinned matching alongside count-by-model. - agent/probes/inventory: CPU (/proc/cpuinfo extended), Memory (dmidecode -t 17 multi-block), Disks (+model/transport/rotational), NICs (+driver/pci from sysfs), GPUs (VRAM from lspci -vv), new System/Baseboard (dmidecode -t system/baseboard), PSU (dmidecode -t 39), OS (/proc/sys/kernel/osrelease + /etc/os-release). All probes accept a Logger and emit per-finding info/warn lines. - agent/probes/firmware: parseDmidecodeAllSections for multi-block fixtures (memory / PSU). - agent/runner: Inventory case becomes 9 substep rows (CPU / Memory / Disks / NICs / GPUs / System / Baseboard / PSU / OS) with per-probe start/complete timestamps. - report: new Inventory HTML section between Stages and Firmware; resolveReporting loads the inventory.json artifact. - agent/tests/fakes/dmidecode: dispatches on -t flag to serve bios / memory / system / baseboard / 39 fixtures for unit tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 22:21:17 -04:00
josh	481b67fb69	feat(firmware): install probe tools in live image + surface nic/hba gaps CI / Lint + build + test (push) Successful in 1m42s Details Release / release (push) Successful in 11m25s Details mkosi.conf: add ipmitool, ethtool, nvme-cli so the Firmware stage can actually read BMC revisions, NIC firmware versions, and fall back to nvme-cli when sysfs firmware_rev is missing. firmware.go: probeNICFirmware and probeHBAFirmware now return (snapshots, warning) so a missing ethtool/lspci surfaces in the stage log the same way probeBIOS/probeBMC already do. Before, a host without ethtool silently reported "bios=1 nvme_fw=1 microcode=1" with no hint that nic coverage was dropped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 21:56:18 -04:00
josh	23c689aa5b	deep profile + threshold gating + firmware stage + Burn super-stage CI / Lint + build + test (push) Failing after 1m57s Details Release / release (push) Has been cancelled Details Ships all five phases of the deep-profile overhaul together. Runs now carry a profile (quick/deep/soak); every profile walks the same 11-stage order — Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage → Network → Burn → GPU → PSU → Reporting — with only per-stage durations and concurrency scaled. Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile column + CreateWithProfile; threshold table + evaluator seeded per-run from the shared vetting.thresholds block; breach flips result at /sensor + /result. Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify + EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta), Network (sustained iperf + /proc/net/dev deltas) with per-profile knobs from Deps. Phase 3: Burn super-stage with goroutine fan-out for CPU + memory + fio + iperf, PSU rails sampled across the Burn window, SensorMux (2 s flush, 500-sample cap) to absorb backpressure. Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode (BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl), lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into SpecValidate with pin-by-identifier and fan-out-across-component matching; mismatches park the run in FailedHolding. Phase 5: profile radio on the host start form, profile chip on the run header, Firmware section in the HTML report, coverage artifact uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath seam + stress_ng and dmidecode example fakes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 22:50:57 -04:00
josh	f79fe0f0db	ui: GitHub-Actions-style detail page, sub-steps, mini-tile run-view CI / Lint + build + test (push) Successful in 1m26s Details Release / release (push) Successful in 6m47s Details Reshapes the detail page into a run-view: hybrid horizontal pipeline + expanded active-step pane with sub-steps, a per-step log pane with line-numbered permalinks and client-side search, and a runs-history sidebar that navigates via ?run=N. Default step is server-picked (running → failed → Reporting) so the operator lands on the thing that's moving. Adds a sub_steps table + SSE topic (substep-{run}-{stage}-{ordinal}) so per-disk and per-pass work (SMART, CPUStress CPU/RAM, Storage, GPU) is visible in the UI instead of buried in stage summary JSON. Agent emits sub-step reports from existing per-iteration loops. Dashboard tiles become a mini run-view with a 9-dot step strip so the operator reads run health across the whole grid at a glance. Register page gets the same card shell + button styling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 19:00:11 -04:00
josh	27098fc7ed	cpustress+orchestrator: serial CPU/RAM passes + silent-skip guard CI / Lint + build + test (push) Successful in 1m23s Details Release / release (push) Successful in 6m2s Details Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping CPUStress. Two compounding bugs: 1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently. On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer fired, usually on the agent itself. Replaced with two sequential passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1, --vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify) for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a premature clean exit counts as failure instead of a silent pass. 2. On systemd-restart after the OOM, the agent hardcoded nextStage := "Inventory" and re-ran it. The orchestrator's /result handler advances run state via TriggerStageCompleted against the current RunState, not against body.Stage — so an Inventory result posted while the run was in StateCPUStress silently advanced CPUStress → Storage and marked CPUStress passed without it ever running. Two-layer defense for #2: - agent-side: /claim response now carries current_state; agent resumes at the matching stage on a re-claim (happy path). - server-side: new TriggerStageMismatch + StageNameForState helper backstop. If body.Stage doesn't match the run's current stage, /result parks the run in FailedHolding with failed_stage labeled "<got> (expected <expected>)" and returns 409. Other stages audited for similar unbounded concurrency — none found; only CPUStress was unsafe. Tests: - cpustress_test.go — parseMemAvailable parses real meminfo, errors on missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB headroom on normal/huge boxes. - statemachine_test.go — TriggerStageMismatch lands at FailedHolding from every stage state and is rejected from pre-stage/terminal states; StageNameForState round-trips the stageStates map. - agent_handlers_test.go — TestResult_RejectsMismatchedStage proves the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage proves the guard doesn't break the happy path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 17:29:13 -04:00
josh	e73e31af92	live-image: install stage tools and fail loudly if any are missing CI / Lint + build + test (push) Successful in 1m32s Details Release / release (push) Successful in 6m28s Details The live image was still carrying the Phase 2 package list, so SMART, CPUStress, and Network each hit a LookPath miss and returned pass-with-skip. A run that skipped every real check still ended in "completed" — nothing on the report said the image was broken. Add smartmontools, stress-ng, fio, iperf3, lshw, lm-sensors, e2fsprogs, and util-linux to mkosi.conf. Flip the three stages from skip-pass to fail when their binary is missing so any future packaging regression blocks the run instead of whispering past it. Legitimate "no hardware" skips (no GPU, no hwmon, no disks, non-destructive) are untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 16:39:28 -04:00
josh	5e9ad7f569	probes: sanitize disk serials and normalize GPU model for stable spec keys CI / Lint + build + test (push) Successful in 1m25s Details Release / release (push) Successful in 5m38s Details Two related bugs were producing different map keys for identical hardware depending on whether the inventory probe ran in the reporter on the Proxmox host or in the live-image agent after PXE boot. 1. diskSerial read /sys/block/<dev>/device/{serial,vpd_pg80} and only TrimSpace'd the result. vpd_pg80 is a binary SCSI VPD page with a 4-byte header, and some SSDs leak NUL/control bytes into the text serial file. Those bytes survive into the Go string, lowercase unchanged, and become a garbage map key that the reporter's cleaner read can't match. Sanitize to ASCII-printable range at ingest. 2. probeGPUs built the model slug from fields[2] + " " + fields[3] of `lspci -mm -nnk` output. fields[3] is subsystem vendor/device info, which varies between otherwise-identical cards and carries the `-rXX` revision marker — stable-enough for display but not for identity. Use fields[2] alone, strip the trailing `[NNNN]` PCI device-ID that lspci -nn appends, and sanitize for consistency. After deploying the new orchestrator + re-running the configure step on each registered host, SpecValidate will match cleanly. Disk diffs self-resolve because the reporter already stored clean serials; GPU diffs need one reporter re-run because the old expected slug still carries subsystem noise.	2026-04-18 16:06:18 -04:00
josh	4524ab8dc0	runs: add non-destructive flag + operator Cancel button CI / Lint + build + test (push) Successful in 2m5s Details Release / release (push) Successful in 3m5s Details Non-destructive pre-declares "don't touch the disks" on Start: the Storage stage skips wipe-probe, badblocks -w, and write-mode fio, and reports a read-only summary. Runs a new non_destructive column; threaded through Claim → agent tests.Deps → Storage stage. Cancel halts an in-flight run. The orchestrator transitions to a new StateCancelled via TriggerOperatorCancelled (valid from any active state); the agent's next heartbeat returns cmd=cancel_stage, which fires a stored CancelFunc on the per-stage context. Stage subprocesses spawned with exec.CommandContext die with the context, the agent posts a cancelled outcome, then powers the host off. Destructive stages mid-run may leave the host in an intermediate state — the UI confirm dialog warns the operator; recovery is manual for now. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 13:01:42 -04:00
josh	1694c20b12	Host detail v2: full pipeline + per-stage logs + WoL diagnostics CI / Lint + build + test (push) Has been cancelled Details Pipeline now always renders all 13 nodes (3 pre-stage + 9 stage + Completed), synthesising ghosts from run state when stage rows aren't seeded yet. Makes a WaitingWoL host show the full timeline ahead of it instead of just 4 dots. Agent tags each log line with its stage; logs.Hub fans out to both log-{runID} and log-{runID}-{stage} SSE events so the detail page can show per-stage tabs with a pure-CSS radio-sibling switch. Flat run log prepends [stage] so grep still works. Dispatcher writes picked/sent-WoL/heartbeat lines into the per-run log — the operator opens the detail page, sees WaitingWoL stuck, and reads exactly what the dispatcher did and why nothing's progressing, instead of having to tail journalctl on the LXC. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 00:38:27 -04:00
josh	a0c0fb114f	Add host-mode heartbeat: vetting-agent host + last-seen badge CI / Lint + build + test (push) Has been cancelled Details vetting-agent gains a `host` subcommand that runs as a systemd service installed by the quick-register one-liner, POSTing every 30s to /api/v1/hosts/{mac}/heartbeat so the dashboard tile shows "online" or "Nm ago" without waiting on WoL. Ships dormant client code for the Phase 2 reboot_for_vetting command so the server can flip it on later without a binary redeploy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-17 23:34:15 -04:00
josh	9bb4b09a04	Initial commit: full Phases 1-6 implementation CI / Lint + build + test (push) Has been cancelled Details Post-repair hardware validation pipeline for Proxmox cluster hosts. Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.	2026-04-17 21:32:10 -04:00

13 Commits