deep profile + threshold gating + firmware stage + Burn super-stage

Ships all five phases of the deep-profile overhaul together. Runs now carry a profile (quick/deep/soak); every profile walks the same 11-stage order — Inventory → Firmware → SpecValidate → SMART → CPUStress → Storage → Network → Burn → GPU → PSU → Reporting — with only per-stage durations and concurrency scaled. Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile column + CreateWithProfile; threshold table + evaluator seeded per-run from the shared vetting.thresholds block; breach flips result at /sensor + /result. Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify + EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta), Network (sustained iperf + /proc/net/dev deltas) with per-profile knobs from Deps. Phase 3: Burn super-stage with goroutine fan-out for CPU + memory + fio + iperf, PSU rails sampled across the Burn window, SensorMux (2 s flush, 500-sample cap) to absorb backpressure. Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode (BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl), lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into SpecValidate with pin-by-identifier and fan-out-across-component matching; mismatches park the run in FailedHolding. Phase 5: profile radio on the host start form, profile chip on the run header, Firmware section in the HTML report, coverage artifact uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath seam + stress_ng and dmidecode example fakes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 22:50:57 -04:00
parent fbb21cbafd
commit 23c689aa5b
60 changed files with 5911 additions and 527 deletions
@@ -0,0 +1,58 @@
+package tests
+
+import (
+	"runtime"
+	"testing"
+)
+
+// TestResolveCPUWorkers covers the three parse branches: empty/"all"
+// falls back to NumCPU, a valid integer is used verbatim, and garbage
+// also falls back to NumCPU rather than returning zero. Zero workers
+// would make stress-ng a no-op and silently defeat Burn's CPU load.
+func TestResolveCPUWorkers(t *testing.T) {
+	np := runtime.NumCPU()
+	cases := []struct {
+		name string
+		in   string
+		want int
+	}{
+		{"empty defaults to NumCPU", "", np},
+		{"all defaults to NumCPU", "all", np},
+		{"ALL is case-insensitive", "ALL", np},
+		{"explicit integer", "3", 3},
+		{"negative falls back", "-1", np},
+		{"zero falls back", "0", np},
+		{"garbage falls back", "lots", np},
+	}
+	for _, tc := range cases {
+		t.Run(tc.name, func(t *testing.T) {
+			if got := resolveCPUWorkers(tc.in); got != tc.want {
+				t.Errorf("resolveCPUWorkers(%q) = %d, want %d", tc.in, got, tc.want)
+			}
+		})
+	}
+}
+
+// TestClampMemPct ensures the mem_pct knob never drives the memory
+// burner into OOM territory (upper clamp) or into uselessness (lower
+// clamp). Zero is treated as "use default 50" so a missing knob in an
+// older orchestrator's claim response doesn't collapse the workload.
+func TestClampMemPct(t *testing.T) {
+	cases := []struct {
+		in, want int
+	}{
+		{0, 50},   // default
+		{-10, 50}, // negative treated as default
+		{5, 10},   // below lower band → clamp up
+		{10, 10},
+		{50, 50},
+		{90, 90},
+		{95, 90}, // above upper band → clamp down
+		{1000, 90},
+	}
+	for _, tc := range cases {
+		if got := clampMemPct(tc.in); got != tc.want {
+			t.Errorf("clampMemPct(%d) = %d, want %d", tc.in, got, tc.want)
+		}
+	}
+}