fix(agent): keep heartbeat loop alive during FailedHolding
CI / Lint + build + test (push) Successful in 1m51s
Release / release (push) Failing after 4m28s

The heartbeat handler was returning cmd=abort for FailedHolding, which
caused the agent's heartbeat goroutine to exit after ~10s in hold.
Subsequent state changes (Cancel -> reboot, Override -> retry_stage)
then had no recipient, so the host sat idle at the SSH hold prompt
forever. Narrowed cmd=abort to StateReleased only; FailedHolding falls
through to cmd=continue so the loop keeps polling and can receive the
operator's eventual command.
This commit is contained in:
2026-04-20 18:28:43 -04:00
parent 62bddac110
commit 73f727b4c1
2 changed files with 38 additions and 1 deletions
+7 -1
View File
@@ -286,8 +286,14 @@ func (a *Agent) Heartbeat(w http.ResponseWriter, r *http.Request) {
} else {
cmd = "cancel_stage"
}
case run.State == model.StateFailedHolding || run.State == model.StateReleased:
case run.State == model.StateReleased:
// Operator accepted the failure outcome. No further agent
// action is possible — stop the heartbeat loop.
cmd = "abort"
// FailedHolding intentionally falls through to cmd=continue: the
// agent is parked in waitForOverride awaiting operator action
// (Cancel → reboot, Override → retry_stage). Keeping the
// heartbeat loop alive is what lets those commands reach it.
case run.FailedStage == "Storage" && overrideWipeSet(run.OverrideFlagsJSON):
// Operator pressed "Override wipe & retry". Agent should
// re-enter Storage with the wipe-probe bypass armed.