Files
Vetting/agent/tests/cpustress_test.go
T
josh 27098fc7ed
CI / Lint + build + test (push) Successful in 1m23s
Release / release (push) Successful in 6m2s
cpustress+orchestrator: serial CPU/RAM passes + silent-skip guard
Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping
CPUStress. Two compounding bugs:

1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently.
   On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer
   fired, usually on the agent itself. Replaced with two sequential
   passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1,
   --vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify)
   for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a
   premature clean exit counts as failure instead of a silent pass.

2. On systemd-restart after the OOM, the agent hardcoded nextStage :=
   "Inventory" and re-ran it. The orchestrator's /result handler
   advances run state via TriggerStageCompleted against the *current*
   RunState, not against body.Stage — so an Inventory result posted
   while the run was in StateCPUStress silently advanced CPUStress →
   Storage and marked CPUStress passed without it ever running.

Two-layer defense for #2:
- agent-side: /claim response now carries current_state; agent resumes
  at the matching stage on a re-claim (happy path).
- server-side: new TriggerStageMismatch + StageNameForState helper
  backstop. If body.Stage doesn't match the run's current stage, /result
  parks the run in FailedHolding with failed_stage labeled
  "<got> (expected <expected>)" and returns 409.

Other stages audited for similar unbounded concurrency — none found;
only CPUStress was unsafe.

Tests:
- cpustress_test.go — parseMemAvailable parses real meminfo, errors on
  missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB
  headroom on normal/huge boxes.
- statemachine_test.go — TriggerStageMismatch lands at FailedHolding
  from every stage state and is rejected from pre-stage/terminal
  states; StageNameForState round-trips the stageStates map.
- agent_handlers_test.go — TestResult_RejectsMismatchedStage proves
  the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage
  proves the guard doesn't break the happy path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 17:29:13 -04:00

89 lines
2.8 KiB
Go
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
package tests
import (
"strings"
"testing"
)
// TestParseMemAvailable_RealSample exercises parseMemAvailable on a
// real /proc/meminfo snippet. Units are always kB and always the
// second field; we just want to confirm we strip it correctly.
func TestParseMemAvailable_RealSample(t *testing.T) {
sample := `MemTotal: 8053292 kB
MemFree: 3205104 kB
MemAvailable: 6742180 kB
Buffers: 145332 kB
Cached: 2934064 kB
`
got, err := parseMemAvailable(strings.NewReader(sample))
if err != nil {
t.Fatalf("parseMemAvailable: %v", err)
}
want := int64(6742180) * 1024
if got != want {
t.Errorf("MemAvailable = %d bytes, want %d", got, want)
}
}
func TestParseMemAvailable_Missing(t *testing.T) {
sample := "MemTotal: 8053292 kB\nMemFree: 3205104 kB\n"
if _, err := parseMemAvailable(strings.NewReader(sample)); err == nil {
t.Errorf("expected error when MemAvailable absent")
}
}
func TestParseMemAvailable_Malformed(t *testing.T) {
sample := "MemAvailable:\n"
if _, err := parseMemAvailable(strings.NewReader(sample)); err == nil {
t.Errorf("expected error on single-field MemAvailable line")
}
}
// TestMemCap_Normal: on a healthy 8GiB box with ~6.4GiB available,
// cap lands at ~4.9GiB — well above floor, well below total.
func TestMemCap_Normal(t *testing.T) {
avail := int64(6742180) * 1024 // ~6.4 GiB
cap := avail - memHeadroomBytes
if cap < memFloorBytes {
t.Errorf("cap=%d should be ≥ floor=%d on 6.4GiB available", cap, memFloorBytes)
}
// Sanity: headroom carved off 1.5 GiB.
if got := avail - cap; got != memHeadroomBytes {
t.Errorf("headroom = %d, want %d", got, memHeadroomBytes)
}
}
// TestMemCap_FloorHit: a box with <1.75 GiB available should fall
// below the floor so CPUStress refuses the memory pass rather than
// running a window too small to be meaningful.
func TestMemCap_FloorHit(t *testing.T) {
avail := int64(1_500_000_000) // 1.4 GiB
cap := avail - memHeadroomBytes
if cap >= memFloorBytes {
t.Errorf("cap=%d should be < floor=%d on 1.4GiB available (cap pre-clamp)", cap, memFloorBytes)
}
}
// TestMemCap_HugeBox: a 128 GiB box still honors the 1.5 GiB
// headroom (no weird upper clamp that would cap us at a tiny
// fraction of the RAM).
func TestMemCap_HugeBox(t *testing.T) {
avail := int64(128) * 1024 * 1024 * 1024
cap := avail - memHeadroomBytes
if cap < avail-2*memHeadroomBytes {
t.Errorf("cap=%d unexpectedly below avail=%d 2×headroom", cap, avail)
}
// Should be comfortably above 100 GiB.
if cap < 100*1024*1024*1024 {
t.Errorf("cap=%d should exceed 100 GiB on 128 GiB box", cap)
}
}
// TestDurationSeconds_BelowOne floors at "1s"; stress-ng rejects 0.
func TestDurationSeconds_BelowOne(t *testing.T) {
got := durationSeconds(0)
if got != "1s" {
t.Errorf("durationSeconds(0) = %q, want 1s", got)
}
}