`lspci -D -mm -nn` prefixes every line with the PCI address as a bare
token before the three quoted class/vendor/device fields, so the
device name sits at fields[3] — not fields[2], which is the vendor.
The probe was indexing [2] and recording every GPU's model as its
vendor string ("Intel Corporation" instead of "Alder Lake-N [UHD
Graphics]"), which made every SpecValidate mismatch on real hosts
once the expected spec named the device.
Extract the per-line parse into parseLspciMMLine, handle both the
modern -D layout (addr + class/vendor/device) and the legacy
layout without an address prefix (class/vendor/device), and cover
both paths plus the non-GPU-class skip in inventory_test.go.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extend Inventory stage from a one-liner summary to a per-probe substep
emitter with ~20-30 narrative log lines per run.
- spec: per-DIMM memory (slot/size/speed/manufacturer/part_number),
richer CPU (vendor/stepping/physical_cores/flags), disk
model/transport/rotational, NIC driver/pci_addr, GPU vram/pci/driver,
new System/Baseboard/PSU/OS top-level sections. All fields omitempty
so existing expected-spec YAML and artifacts stay compatible.
- spec.Diff: new diffDIMMs/diffSystem/diffBaseboard/diffPSU/diffOS
helpers; extended diffDisks/diffNICs/diffGPUs for new fields. GPU
diff gains PCIAddr-pinned matching alongside count-by-model.
- agent/probes/inventory: CPU (/proc/cpuinfo extended), Memory
(dmidecode -t 17 multi-block), Disks (+model/transport/rotational),
NICs (+driver/pci from sysfs), GPUs (VRAM from lspci -vv),
new System/Baseboard (dmidecode -t system/baseboard), PSU
(dmidecode -t 39), OS (/proc/sys/kernel/osrelease + /etc/os-release).
All probes accept a Logger and emit per-finding info/warn lines.
- agent/probes/firmware: parseDmidecodeAllSections for multi-block
fixtures (memory / PSU).
- agent/runner: Inventory case becomes 9 substep rows (CPU / Memory /
Disks / NICs / GPUs / System / Baseboard / PSU / OS) with per-probe
start/complete timestamps.
- report: new Inventory HTML section between Stages and Firmware;
resolveReporting loads the inventory.json artifact.
- agent/tests/fakes/dmidecode: dispatches on -t flag to serve bios /
memory / system / baseboard / 39 fixtures for unit tests.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mkosi.conf: add ipmitool, ethtool, nvme-cli so the Firmware stage
can actually read BMC revisions, NIC firmware versions, and fall
back to nvme-cli when sysfs firmware_rev is missing.
firmware.go: probeNICFirmware and probeHBAFirmware now return
(snapshots, warning) so a missing ethtool/lspci surfaces in the
stage log the same way probeBIOS/probeBMC already do. Before, a
host without ethtool silently reported "bios=1 nvme_fw=1
microcode=1" with no hint that nic coverage was dropped.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ships all five phases of the deep-profile overhaul together. Runs now
carry a profile (quick/deep/soak); every profile walks the same
11-stage order — Inventory → Firmware → SpecValidate → SMART →
CPUStress → Storage → Network → Burn → GPU → PSU → Reporting —
with only per-stage durations and concurrency scaled.
Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile
column + CreateWithProfile; threshold table + evaluator seeded per-run
from the shared vetting.thresholds block; breach flips result at
/sensor + /result.
Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify +
EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta),
Network (sustained iperf + /proc/net/dev deltas) with per-profile
knobs from Deps.
Phase 3: Burn super-stage with goroutine fan-out for CPU + memory +
fio + iperf, PSU rails sampled across the Burn window, SensorMux
(2 s flush, 500-sample cap) to absorb backpressure.
Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode
(BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl),
lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into
SpecValidate with pin-by-identifier and fan-out-across-component
matching; mismatches park the run in FailedHolding.
Phase 5: profile radio on the host start form, profile chip on the
run header, Firmware section in the HTML report, coverage artifact
uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath
seam + stress_ng and dmidecode example fakes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related bugs were producing different map keys for identical
hardware depending on whether the inventory probe ran in the reporter
on the Proxmox host or in the live-image agent after PXE boot.
1. diskSerial read /sys/block/<dev>/device/{serial,vpd_pg80} and only
TrimSpace'd the result. vpd_pg80 is a binary SCSI VPD page with a
4-byte header, and some SSDs leak NUL/control bytes into the text
serial file. Those bytes survive into the Go string, lowercase
unchanged, and become a garbage map key that the reporter's cleaner
read can't match. Sanitize to ASCII-printable range at ingest.
2. probeGPUs built the model slug from fields[2] + " " + fields[3] of
`lspci -mm -nnk` output. fields[3] is subsystem vendor/device info,
which varies between otherwise-identical cards and carries the
`-rXX` revision marker — stable-enough for display but not for
identity. Use fields[2] alone, strip the trailing `[NNNN]` PCI
device-ID that lspci -nn appends, and sanitize for consistency.
After deploying the new orchestrator + re-running the configure step
on each registered host, SpecValidate will match cleanly. Disk diffs
self-resolve because the reporter already stored clean serials; GPU
diffs need one reporter re-run because the old expected slug still
carries subsystem noise.