Orion's run (log 20:49 → 20:54) shipped GREEN while silently skipping
CPUStress. Two compounding bugs:
1. CPUStress ran --cpu N AND --vm N --vm-bytes 90% concurrently.
On a 4-core 8 GiB N95, that's 360% RAM overcommit; the OOM-killer
fired, usually on the agent itself. Replaced with two sequential
passes — CPU (all methods, --verify) for 3 min, then RAM (--vm 1,
--vm-bytes capped to MemAvailable − 1.5 GiB, floor 256 MiB, --verify)
for 3 min. Each pass now also asserts elapsed ≥ target − 2s so a
premature clean exit counts as failure instead of a silent pass.
2. On systemd-restart after the OOM, the agent hardcoded nextStage :=
"Inventory" and re-ran it. The orchestrator's /result handler
advances run state via TriggerStageCompleted against the *current*
RunState, not against body.Stage — so an Inventory result posted
while the run was in StateCPUStress silently advanced CPUStress →
Storage and marked CPUStress passed without it ever running.
Two-layer defense for #2:
- agent-side: /claim response now carries current_state; agent resumes
at the matching stage on a re-claim (happy path).
- server-side: new TriggerStageMismatch + StageNameForState helper
backstop. If body.Stage doesn't match the run's current stage, /result
parks the run in FailedHolding with failed_stage labeled
"<got> (expected <expected>)" and returns 409.
Other stages audited for similar unbounded concurrency — none found;
only CPUStress was unsafe.
Tests:
- cpustress_test.go — parseMemAvailable parses real meminfo, errors on
missing/malformed; cap calc hits floor on tiny boxes, uses 1.5 GiB
headroom on normal/huge boxes.
- statemachine_test.go — TriggerStageMismatch lands at FailedHolding
from every stage state and is rejected from pre-stage/terminal
states; StageNameForState round-trips the stageStates map.
- agent_handlers_test.go — TestResult_RejectsMismatchedStage proves
the Orion scenario now 409s + FailedHolding; TestResult_AcceptsMatchingStage
proves the guard doesn't break the happy path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Non-destructive pre-declares "don't touch the disks" on Start: the
Storage stage skips wipe-probe, badblocks -w, and write-mode fio,
and reports a read-only summary. Runs a new non_destructive column;
threaded through Claim → agent tests.Deps → Storage stage.
Cancel halts an in-flight run. The orchestrator transitions to a
new StateCancelled via TriggerOperatorCancelled (valid from any
active state); the agent's next heartbeat returns cmd=cancel_stage,
which fires a stored CancelFunc on the per-stage context. Stage
subprocesses spawned with exec.CommandContext die with the context,
the agent posts a cancelled outcome, then powers the host off.
Destructive stages mid-run may leave the host in an intermediate
state — the UI confirm dialog warns the operator; recovery is
manual for now.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every supported host runs vetting-reporter in-OS and heartbeats every
30s. WoL was never the thing that started vetting — the heartbeat
response's reboot_for_vetting command was. Firing WoL first only
crowded the run log with misleading diagnostics when the real failure
mode is "reporter isn't installed."
- StartRun 409s if the host hasn't heartbeated within 60s, pointing
the operator at /register/quick.sh.
- Dispatcher re-checks LastSeenAt at dispatch time (run may sit in
Queued long enough for the host to go offline); stale hosts mark
the run Failed with failed_stage=dispatch instead of looping.
- New StateWaitingReboot + TriggerRebootCommanded capture the actual
semantics. StateWaitingWoL kept as the hook point for a future
manual-override button.
- Tile disables the Start button with a quick.sh tooltip when the
host is offline, matching the server-side 409.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>