Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping
CPUStress. Two compounding bugs:
1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently.
On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer
fired, usually on the agent itself. Replaced with two sequential
passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1,
--vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify)
for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a
premature clean exit counts as failure instead of a silent pass.
2. On systemd-restart after the OOM, the agent hardcoded nextStage :=
"Inventory" and re-ran it. The orchestrator's /result handler
advances run state via TriggerStageCompleted against the *current*
RunState, not against body.Stage — so an Inventory result posted
while the run was in StateCPUStress silently advanced CPUStress →
Storage and marked CPUStress passed without it ever running.
Two-layer defense for #2:
- agent-side: /claim response now carries current_state; agent resumes
at the matching stage on a re-claim (happy path).
- server-side: new TriggerStageMismatch + StageNameForState helper
backstop. If body.Stage doesn't match the run's current stage, /result
parks the run in FailedHolding with failed_stage labeled
"<got> (expected <expected>)" and returns 409.
Other stages audited for similar unbounded concurrency — none found;
only CPUStress was unsafe.
Tests:
- cpustress_test.go — parseMemAvailable parses real meminfo, errors on
missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB
headroom on normal/huge boxes.
- statemachine_test.go — TriggerStageMismatch lands at FailedHolding
from every stage state and is rejected from pre-stage/terminal
states; StageNameForState round-trips the stageStates map.
- agent_handlers_test.go — TestResult_RejectsMismatchedStage proves
the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage
proves the guard doesn't break the happy path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The live image was still carrying the Phase 2 package list, so SMART,
CPUStress, and Network each hit a LookPath miss and returned
pass-with-skip. A run that skipped every real check still ended in
"completed" — nothing on the report said the image was broken.
Add smartmontools, stress-ng, fio, iperf3, lshw, lm-sensors,
e2fsprogs, and util-linux to mkosi.conf. Flip the three stages from
skip-pass to fail when their binary is missing so any future
packaging regression blocks the run instead of whispering past it.
Legitimate "no hardware" skips (no GPU, no hwmon, no disks,
non-destructive) are untouched.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Non-destructive pre-declares "don't touch the disks" on Start: the
Storage stage skips wipe-probe, badblocks -w, and write-mode fio,
and reports a read-only summary. Runs a new non_destructive column;
threaded through Claim → agent tests.Deps → Storage stage.
Cancel halts an in-flight run. The orchestrator transitions to a
new StateCancelled via TriggerOperatorCancelled (valid from any
active state); the agent's next heartbeat returns cmd=cancel_stage,
which fires a stored CancelFunc on the per-stage context. Stage
subprocesses spawned with exec.CommandContext die with the context,
the agent posts a cancelled outcome, then powers the host off.
Destructive stages mid-run may leave the host in an intermediate
state — the UI confirm dialog warns the operator; recovery is
manual for now.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>