deep profile + threshold gating + firmware stage + Burn super-stage
Ships all five phases of the deep-profile overhaul together. Runs now carry a profile (quick/deep/soak); every profile walks the same 11-stage order — Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage → Network → Burn → GPU → PSU → Reporting — with only per-stage durations and concurrency scaled. Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile column + CreateWithProfile; threshold table + evaluator seeded per-run from the shared vetting.thresholds block; breach flips result at /sensor + /result. Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify + EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta), Network (sustained iperf + /proc/net/dev deltas) with per-profile knobs from Deps. Phase 3: Burn super-stage with goroutine fan-out for CPU + memory + fio + iperf, PSU rails sampled across the Burn window, SensorMux (2 s flush, 500-sample cap) to absorb backpressure. Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode (BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl), lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into SpecValidate with pin-by-identifier and fan-out-across-component matching; mismatches park the run in FailedHolding. Phase 5: profile radio on the host start form, profile chip on the run header, Firmware section in the HTML report, coverage artifact uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath seam + stress_ng and dmidecode example fakes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -11,7 +11,10 @@ import (
|
||||
"runtime"
|
||||
"strconv"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"vetting/agent/probes"
|
||||
)
|
||||
|
||||
// CPUStress runs stress-ng as two serial passes. The previous shape
|
||||
@@ -55,11 +58,28 @@ func CPUStress(ctx context.Context, d Deps) Outcome {
|
||||
extras := map[string]any{"cores": cores}
|
||||
var subs []SubStepReport
|
||||
|
||||
// EDAC sidecar runs for the lifetime of the stage; cancelled on
|
||||
// return. It polls /sys/devices/system/edac/mc/*/{ce,ue}_count and
|
||||
// posts the current counters so the server-side threshold evaluator
|
||||
// can gate edac_ue > 0 → fail the run. Zero-valued poll falls back
|
||||
// to 10s — the same cadence rasdaemon uses by default.
|
||||
sideCtx, sideCancel := context.WithCancel(ctx)
|
||||
defer sideCancel()
|
||||
var sideWG sync.WaitGroup
|
||||
sideWG.Add(1)
|
||||
go runEDACSidecar(sideCtx, &sideWG, d)
|
||||
|
||||
// Per-profile durations come from Deps; zero values (missing knobs
|
||||
// or legacy orchestrator) fall back to the package default so the
|
||||
// stage always has a defined budget.
|
||||
cpuDur := nonzeroDur(d.CPUStressKnobs.CPUPass, cpuPassDuration)
|
||||
memDur := nonzeroDur(d.CPUStressKnobs.MemPass, memPassDuration)
|
||||
|
||||
// Pass 1: CPU
|
||||
cpu := runStressPass(ctx, d, "CPU", cpuPassDuration, []string{
|
||||
cpu := runStressPass(ctx, d, "CPU", cpuDur, []string{
|
||||
"--cpu", strconv.Itoa(cores),
|
||||
"--cpu-method", "all",
|
||||
"--timeout", durationSeconds(cpuPassDuration),
|
||||
"--timeout", durationSeconds(cpuDur),
|
||||
"--metrics-brief",
|
||||
"--verify",
|
||||
})
|
||||
@@ -104,11 +124,11 @@ func CPUStress(ctx context.Context, d Deps) Outcome {
|
||||
SubSteps: subs,
|
||||
}
|
||||
}
|
||||
mem := runStressPass(ctx, d, "memory", memPassDuration, []string{
|
||||
mem := runStressPass(ctx, d, "memory", memDur, []string{
|
||||
"--vm", "1",
|
||||
"--vm-bytes", strconv.FormatInt(cap, 10),
|
||||
"--vm-keep",
|
||||
"--timeout", durationSeconds(memPassDuration),
|
||||
"--timeout", durationSeconds(memDur),
|
||||
"--metrics-brief",
|
||||
"--verify",
|
||||
})
|
||||
@@ -133,6 +153,64 @@ func CPUStress(ctx context.Context, d Deps) Outcome {
|
||||
}
|
||||
}
|
||||
|
||||
// runEDACSidecar polls /sys EDAC counters on d.CPUStressKnobs.EDACPoll
|
||||
// cadence (or 10s fallback) for the lifetime of the stage ctx, emitting
|
||||
// one sample per (memory-controller × {ce,ue}) pair on each tick. A
|
||||
// single failing read is tolerated: the next tick picks up the counter.
|
||||
//
|
||||
// This is where the critical edac_ue threshold becomes a hard-fail: as
|
||||
// soon as a UE counter advances past 0, the server-side evaluator trips
|
||||
// and flips the run into FailedHolding. The sidecar emits whether or
|
||||
// not stress-ng is still running; that keeps the signal live during
|
||||
// inter-pass gaps.
|
||||
//
|
||||
// MCE counts are intentionally not sampled here — they require
|
||||
// rasdaemon or mcelog and vary by live-image packaging. The threshold
|
||||
// rule for mce stays seeded (so the DB shape is stable) but only fires
|
||||
// once a matching kind lands, which is a follow-up.
|
||||
func runEDACSidecar(ctx context.Context, wg *sync.WaitGroup, d Deps) {
|
||||
defer wg.Done()
|
||||
if d.Sensor == nil {
|
||||
return
|
||||
}
|
||||
poll := d.CPUStressKnobs.EDACPoll
|
||||
if poll <= 0 {
|
||||
poll = 10 * time.Second
|
||||
}
|
||||
t := time.NewTicker(poll)
|
||||
defer t.Stop()
|
||||
for {
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
return
|
||||
case <-t.C:
|
||||
edac := probes.EDAC()
|
||||
if len(edac) == 0 {
|
||||
continue
|
||||
}
|
||||
batch := make([]Sample, 0, len(edac))
|
||||
for _, s := range edac {
|
||||
batch = append(batch, Sample{Kind: s.Kind, Key: s.Key, Value: s.Value, Unit: s.Unit})
|
||||
}
|
||||
sendCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
|
||||
if err := d.Sensor(sendCtx, batch); err != nil {
|
||||
d.Warn("CPUStress: edac sample post: " + err.Error())
|
||||
}
|
||||
cancel()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// nonzeroDur picks override over fallback, but only when override is
|
||||
// strictly positive. Lets callers pass a zero-value duration to mean
|
||||
// "no override; use fallback" without a separate ok return.
|
||||
func nonzeroDur(override, fallback time.Duration) time.Duration {
|
||||
if override > 0 {
|
||||
return override
|
||||
}
|
||||
return fallback
|
||||
}
|
||||
|
||||
// subStepFromPass projects a stressPass into a SubStepReport — shared by
|
||||
// both passes and by the mid-stage early-return paths so the UI always
|
||||
// sees exactly one row per pass, even on failure.
|
||||
|
||||
Reference in New Issue
Block a user