Files
Vetting/agent/probes/edac.go
T
josh 23c689aa5b
CI / Lint + build + test (push) Failing after 1m57s
Release / release (push) Has been cancelled
deep profile + threshold gating + firmware stage + Burn super-stage
Ships all five phases of the deep-profile overhaul together. Runs now
carry a profile (quick/deep/soak); every profile walks the same
11-stage order — Inventory → Firmware → SpecValidate → SMART →
CPUStress → Storage → Network → Burn → GPU → PSU → Reporting —
with only per-stage durations and concurrency scaled.

Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile
column + CreateWithProfile; threshold table + evaluator seeded per-run
from the shared vetting.thresholds block; breach flips result at
/sensor + /result.

Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify +
EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta),
Network (sustained iperf + /proc/net/dev deltas) with per-profile
knobs from Deps.

Phase 3: Burn super-stage with goroutine fan-out for CPU + memory +
fio + iperf, PSU rails sampled across the Burn window, SensorMux
(2 s flush, 500-sample cap) to absorb backpressure.

Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode
(BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl),
lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into
SpecValidate with pin-by-identifier and fan-out-across-component
matching; mismatches park the run in FailedHolding.

Phase 5: profile radio on the host start form, profile chip on the
run header, Firmware section in the HTML report, coverage artifact
uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath
seam + stress_ng and dmidecode example fakes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 22:50:57 -04:00

71 lines
2.2 KiB
Go
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
package probes
import (
"os"
"path/filepath"
"strconv"
"strings"
)
// EDACSample is one counter reading from /sys/devices/system/edac/mc/.
// Kind is "edac_ce" (correctable ECC errors) or "edac_ue"
// (uncorrectable — always a critical signal). Key identifies the memory
// controller (e.g. "mc0"). Value is the cumulative count since boot;
// the threshold evaluator flags it the moment it exceeds 0.
type EDACSample struct {
Kind string
Key string
Value float64
Unit string
}
// EDAC returns one EDACSample per (memory-controller × {ce,ue}) pair
// that /sys exposes. Returns an empty slice when EDAC isn't available
// (virtualized host, missing kernel driver, mdadm-style boards without
// a controller node) — callers treat an empty return as "no data",
// not "passed". Errors are swallowed for the same reason: a hot-
// swapped DIMM that makes /sys blink briefly shouldn't fail the stage
// before the real counter can be read.
//
// This is intentionally small — the sidecar polls periodically, so one
// bad read is recovered on the next tick. The counters are monotonic,
// so emitting the current raw value is correct.
func EDAC() []EDACSample {
root := "/sys/devices/system/edac/mc"
entries, err := os.ReadDir(root)
if err != nil {
return nil
}
var out []EDACSample
for _, e := range entries {
name := e.Name()
if !strings.HasPrefix(name, "mc") {
continue
}
base := filepath.Join(root, name)
if ce, ok := readCount(filepath.Join(base, "ce_count")); ok {
out = append(out, EDACSample{Kind: "edac_ce", Key: name, Value: ce, Unit: "count"})
}
if ue, ok := readCount(filepath.Join(base, "ue_count")); ok {
out = append(out, EDACSample{Kind: "edac_ue", Key: name, Value: ue, Unit: "count"})
}
}
return out
}
// readCount reads a single decimal integer from a sysfs file and
// returns it as a float. Returns (0, false) on any failure so callers
// can skip the sample without a diagnostic.
func readCount(path string) (float64, bool) {
b, err := os.ReadFile(path)
if err != nil {
return 0, false
}
s := strings.TrimSpace(string(b))
n, err := strconv.ParseInt(s, 10, 64)
if err != nil {
return 0, false
}
return float64(n), true
}