fix(agent): keep heartbeat loop alive during FailedHolding

The heartbeat handler was returning cmd=abort for FailedHolding, which caused the agent's heartbeat goroutine to exit after ~10s in hold. Subsequent state changes (Cancel -> reboot, Override -> retry_stage) then had no recipient, so the host sat idle at the SSH hold prompt forever. Narrowed cmd=abort to StateReleased only; FailedHolding falls through to cmd=continue so the loop keeps polling and can receive the operator's eventual command.
2026-04-20 18:28:43 -04:00
parent 62bddac110
commit 73f727b4c1
2 changed files with 38 additions and 1 deletions
@@ -286,8 +286,14 @@ func (a *Agent) Heartbeat(w http.ResponseWriter, r *http.Request) {
 		} else {
 			cmd = "cancel_stage"
 		}
-	case run.State == model.StateFailedHolding || run.State == model.StateReleased:
+	case run.State == model.StateReleased:
+		// Operator accepted the failure outcome. No further agent
+		// action is possible — stop the heartbeat loop.
 		cmd = "abort"
+	// FailedHolding intentionally falls through to cmd=continue: the
+	// agent is parked in waitForOverride awaiting operator action
+	// (Cancel → reboot, Override → retry_stage). Keeping the
+	// heartbeat loop alive is what lets those commands reach it.
 	case run.FailedStage == "Storage" && overrideWipeSet(run.OverrideFlagsJSON):
 		// Operator pressed "Override wipe & retry". Agent should
 		// re-enter Storage with the wipe-probe bypass armed.