deep profile + threshold gating + firmware stage + Burn super-stage
CI / Lint + build + test (push) Failing after 1m57s
Release / release (push) Has been cancelled

Ships all five phases of the deep-profile overhaul together. Runs now
carry a profile (quick/deep/soak); every profile walks the same
11-stage order — Inventory → Firmware → SpecValidate → SMART →
CPUStress → Storage → Network → Burn → GPU → PSU → Reporting —
with only per-stage durations and concurrency scaled.

Phase 1: profiles.ProfileRegistry loaded from vetting.yaml; runs.profile
column + CreateWithProfile; threshold table + evaluator seeded per-run
from the shared vetting.thresholds block; breach flips result at
/sensor + /result.

Phase 2: upgraded CPUStress (stress-ng --cpu-method=all --verify +
EDAC/MCE poll), Storage (fio --verify=md5 + SMART start/end delta),
Network (sustained iperf + /proc/net/dev deltas) with per-profile
knobs from Deps.

Phase 3: Burn super-stage with goroutine fan-out for CPU + memory +
fio + iperf, PSU rails sampled across the Burn window, SensorMux
(2 s flush, 500-sample cap) to absorb backpressure.

Phase 4: Firmware stage + firmware_snapshots table; probes dmidecode
(BIOS), ipmitool (BMC), ethtool -i (NIC), nvme (sysfs + id-ctrl),
lspci (HBA), /proc/cpuinfo (microcode). spec.DiffFirmware folds into
SpecValidate with pin-by-identifier and fan-out-across-component
matching; mismatches park the run in FailedHolding.

Phase 5: profile radio on the host start form, profile chip on the
run header, Firmware section in the HTML report, coverage artifact
uploaded from CI, agent/tests/fakes/ scaffold with Deps.LookPath
seam + stress_ng and dmidecode example fakes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-04-18 22:50:57 -04:00
parent fbb21cbafd
commit 23c689aa5b
60 changed files with 5911 additions and 527 deletions
+58
View File
@@ -0,0 +1,58 @@
package tests
import (
"runtime"
"testing"
)
// TestResolveCPUWorkers covers the three parse branches: empty/"all"
// falls back to NumCPU, a valid integer is used verbatim, and garbage
// also falls back to NumCPU rather than returning zero. Zero workers
// would make stress-ng a no-op and silently defeat Burn's CPU load.
func TestResolveCPUWorkers(t *testing.T) {
np := runtime.NumCPU()
cases := []struct {
name string
in string
want int
}{
{"empty defaults to NumCPU", "", np},
{"all defaults to NumCPU", "all", np},
{"ALL is case-insensitive", "ALL", np},
{"explicit integer", "3", 3},
{"negative falls back", "-1", np},
{"zero falls back", "0", np},
{"garbage falls back", "lots", np},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
if got := resolveCPUWorkers(tc.in); got != tc.want {
t.Errorf("resolveCPUWorkers(%q) = %d, want %d", tc.in, got, tc.want)
}
})
}
}
// TestClampMemPct ensures the mem_pct knob never drives the memory
// burner into OOM territory (upper clamp) or into uselessness (lower
// clamp). Zero is treated as "use default 50" so a missing knob in an
// older orchestrator's claim response doesn't collapse the workload.
func TestClampMemPct(t *testing.T) {
cases := []struct {
in, want int
}{
{0, 50}, // default
{-10, 50}, // negative treated as default
{5, 10}, // below lower band → clamp up
{10, 10},
{50, 50},
{90, 90},
{95, 90}, // above upper band → clamp down
{1000, 90},
}
for _, tc := range cases {
if got := clampMemPct(tc.in); got != tc.want {
t.Errorf("clampMemPct(%d) = %d, want %d", tc.in, got, tc.want)
}
}
}