fix(agent): keep heartbeat loop alive during FailedHolding
The heartbeat handler was returning cmd=abort for FailedHolding, which caused the agent's heartbeat goroutine to exit after ~10s in hold. Subsequent state changes (Cancel -> reboot, Override -> retry_stage) then had no recipient, so the host sat idle at the SSH hold prompt forever. Narrowed cmd=abort to StateReleased only; FailedHolding falls through to cmd=continue so the loop keeps polling and can receive the operator's eventual command.
This commit is contained in:
@@ -286,8 +286,14 @@ func (a *Agent) Heartbeat(w http.ResponseWriter, r *http.Request) {
|
||||
} else {
|
||||
cmd = "cancel_stage"
|
||||
}
|
||||
case run.State == model.StateFailedHolding || run.State == model.StateReleased:
|
||||
case run.State == model.StateReleased:
|
||||
// Operator accepted the failure outcome. No further agent
|
||||
// action is possible — stop the heartbeat loop.
|
||||
cmd = "abort"
|
||||
// FailedHolding intentionally falls through to cmd=continue: the
|
||||
// agent is parked in waitForOverride awaiting operator action
|
||||
// (Cancel → reboot, Override → retry_stage). Keeping the
|
||||
// heartbeat loop alive is what lets those commands reach it.
|
||||
case run.FailedStage == "Storage" && overrideWipeSet(run.OverrideFlagsJSON):
|
||||
// Operator pressed "Override wipe & retry". Agent should
|
||||
// re-enter Storage with the wipe-probe bypass armed.
|
||||
|
||||
Reference in New Issue
Block a user