deep profile + threshold gating + firmware stage + Burn super-stage
CI / Lint + build + test (push) Failing after 1m57s
Release / release (push) Has been cancelled

Ships all five phases of the deep-profile overhaul together. Runs now
carry a profile (quick/deep/soak); every profile walks the same
11-stage order — Inventory → Firmware → SpecValidate → SMART →
CPUStress → Storage → Network → Burn → GPU → PSU → Reporting —
with only per-stage durations and concurrency scaled.

Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile
column + CreateWithProfile; threshold table + evaluator seeded per-run
from the shared vetting.thresholds block; breach flips result at
/sensor + /result.

Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify +
EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta),
Network (sustained iperf + /proc/net/dev deltas) with per-profile
knobs from Deps.

Phase 3: Burn super-stage with goroutine fan-out for CPU + memory +
fio + iperf, PSU rails sampled across the Burn window, SensorMux
(2 s flush, 500-sample cap) to absorb backpressure.

Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode
(BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl),
lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into
SpecValidate with pin-by-identifier and fan-out-across-component
matching; mismatches park the run in FailedHolding.

Phase 5: profile radio on the host start form, profile chip on the
run header, Firmware section in the HTML report, coverage artifact
uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath
seam + stress_ng and dmidecode example fakes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-04-18 22:50:57 -04:00
parent fbb21cbafd
commit 23c689aa5b
60 changed files with 5911 additions and 527 deletions
+57
View File
@@ -59,6 +59,11 @@ func (o Outcome) MarshalSummary() (json.RawMessage, error) {
// Deps bundles what stages need without pulling in the whole agent.
// Logger methods print to stdout + forward to the orchestrator; Sensor
// drops numeric samples; OverrideFlags carries operator-set bypasses.
//
// CPUStressKnobs / StorageKnobs / NetworkKnobs are Phase-2 profile
// knobs. Zero-valued fields mean "fall back to the compile-time
// default" — that keeps the stages runnable even when the runner can't
// materialize a profile (tests, legacy orchestrator, etc).
type Deps struct {
Info func(string)
Warn func(string)
@@ -68,6 +73,58 @@ type Deps struct {
NonDestructive bool // skip wipe-probe + writes in Storage
ExpectedDisks []ExpectedDisk // serials + sizes from host.expected_spec
StageTimeout time.Duration
CPUStressKnobs CPUStressKnobs
StorageKnobs StorageKnobs
NetworkKnobs NetworkKnobs
BurnKnobs BurnKnobs
// LookPath is the unit-test seam for swapping a real external
// binary (stress-ng, fio, iperf3, dmidecode, …) for a fake. When
// nil the stage falls back to os/exec.LookPath — production and
// existing tests keep working unchanged. Tests under
// agent/tests/fakes/ populate this to redirect lookups to a built
// fake binary in a tempdir.
LookPath func(name string) (string, error)
}
// CPUStressKnobs parameterizes the CPUStress stage. Zero durations fall
// back to the package's compile-time defaults (cpuPassDuration etc).
type CPUStressKnobs struct {
CPUPass time.Duration
MemPass time.Duration
EDACPoll time.Duration
}
// StorageKnobs parameterizes the Storage stage. Mode picks between
// "fio_sample" (bounded tempfile inside the device, quick profile) and
// "full_disk" (whole-device write verify, deep/soak). Empty strings
// fall back to the stage's safe defaults.
type StorageKnobs struct {
Mode string
FioSize string
FioTime time.Duration
FioBS string
FioRW string
Verify string
}
// NetworkKnobs parameterizes the Network stage.
type NetworkKnobs struct {
Duration time.Duration
}
// BurnKnobs parameterizes the Burn super-stage. Duration is the total
// Burn window; sub-workloads run concurrently inside that window.
// CPUWorkers is "all" (runtime.NumCPU) or a numeric string. MemPct is a
// percentage of MemAvailable to allocate for the memory burner (clamped
// 0-90 by the stage). IperfParallel feeds iperf3 -P to generate sustained
// NIC load. FioOnSpare gates the storage sub-workload: true = fio runs
// against the allow-listed disks for the same window; false = skip fio.
type BurnKnobs struct {
Duration time.Duration
CPUWorkers string
MemPct int
FioOnSpare bool
IperfParallel int
}
// Sample mirrors the server's SensorSample but lives in the tests