deep profile + threshold gating + firmware stage + Burn super-stage
Ships all five phases of the deep-profile overhaul together. Runs now carry a profile (quick/deep/soak); every profile walks the same 11-stage order — Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage → Network → Burn → GPU → PSU → Reporting — with only per-stage durations and concurrency scaled. Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile column + CreateWithProfile; threshold table + evaluator seeded per-run from the shared vetting.thresholds block; breach flips result at /sensor + /result. Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify + EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta), Network (sustained iperf + /proc/net/dev deltas) with per-profile knobs from Deps. Phase 3: Burn super-stage with goroutine fan-out for CPU + memory + fio + iperf, PSU rails sampled across the Burn window, SensorMux (2 s flush, 500-sample cap) to absorb backpressure. Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode (BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl), lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into SpecValidate with pin-by-identifier and fan-out-across-component matching; mismatches park the run in FailedHolding. Phase 5: profile radio on the host start form, profile chip on the run header, Firmware section in the HTML report, coverage artifact uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath seam + stress_ng and dmidecode example fakes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,70 @@
|
||||
package probes
|
||||
|
||||
import (
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// EDACSample is one counter reading from /sys/devices/system/edac/mc/.
|
||||
// Kind is "edac_ce" (correctable ECC errors) or "edac_ue"
|
||||
// (uncorrectable — always a critical signal). Key identifies the memory
|
||||
// controller (e.g. "mc0"). Value is the cumulative count since boot;
|
||||
// the threshold evaluator flags it the moment it exceeds 0.
|
||||
type EDACSample struct {
|
||||
Kind string
|
||||
Key string
|
||||
Value float64
|
||||
Unit string
|
||||
}
|
||||
|
||||
// EDAC returns one EDACSample per (memory-controller × {ce,ue}) pair
|
||||
// that /sys exposes. Returns an empty slice when EDAC isn't available
|
||||
// (virtualized host, missing kernel driver, mdadm-style boards without
|
||||
// a controller node) — callers treat an empty return as "no data",
|
||||
// not "passed". Errors are swallowed for the same reason: a hot-
|
||||
// swapped DIMM that makes /sys blink briefly shouldn't fail the stage
|
||||
// before the real counter can be read.
|
||||
//
|
||||
// This is intentionally small — the sidecar polls periodically, so one
|
||||
// bad read is recovered on the next tick. The counters are monotonic,
|
||||
// so emitting the current raw value is correct.
|
||||
func EDAC() []EDACSample {
|
||||
root := "/sys/devices/system/edac/mc"
|
||||
entries, err := os.ReadDir(root)
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
var out []EDACSample
|
||||
for _, e := range entries {
|
||||
name := e.Name()
|
||||
if !strings.HasPrefix(name, "mc") {
|
||||
continue
|
||||
}
|
||||
base := filepath.Join(root, name)
|
||||
if ce, ok := readCount(filepath.Join(base, "ce_count")); ok {
|
||||
out = append(out, EDACSample{Kind: "edac_ce", Key: name, Value: ce, Unit: "count"})
|
||||
}
|
||||
if ue, ok := readCount(filepath.Join(base, "ue_count")); ok {
|
||||
out = append(out, EDACSample{Kind: "edac_ue", Key: name, Value: ue, Unit: "count"})
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
// readCount reads a single decimal integer from a sysfs file and
|
||||
// returns it as a float. Returns (0, false) on any failure so callers
|
||||
// can skip the sample without a diagnostic.
|
||||
func readCount(path string) (float64, bool) {
|
||||
b, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
return 0, false
|
||||
}
|
||||
s := strings.TrimSpace(string(b))
|
||||
n, err := strconv.ParseInt(s, 10, 64)
|
||||
if err != nil {
|
||||
return 0, false
|
||||
}
|
||||
return float64(n), true
|
||||
}
|
||||
Reference in New Issue
Block a user