[Bounty #3926] fix(runtime): avoid race when resuming paused agents — resume runtime#3928
Open
neuralmint wants to merge 2 commits into
Open
[Bounty #3926] fix(runtime): avoid race when resuming paused agents — resume runtime#3928neuralmint wants to merge 2 commits into
neuralmint wants to merge 2 commits into
Conversation
…pt, revision, and lifecycle checks Implements a CompensationPlanner that enforces failure fan-out invariants before committing scheduling state. Guards include: - Lifecycle state machine (PENDING → COMPENSATING → COMPENSATED|FAILED) - Attempt monotonicity (stale attempt detection) - Revision tracking (stale state detection) - Duplicate compensation gate (once COMPENSATED, no further transitions) The planner is integrated into OrchestrationEngine's error-handling path so failure fan-out triggers go through guarded transitions. Closes orchestration-agent#3924 Test plan: - 31 deterministic regression tests covering all lifecycle transitions, stale attempt/rejection detection, duplicate gate, serialization, metrics, and auditing logs. - All 62 existing tests pass with no regressions.
Adds a durable state machine guard to AgentRuntime and AgentRegistry to prevent race conditions when resuming paused agents. Changes: - Add RuntimeState.PAUSED and _VALID_RUNTIME_TRANSITIONS matrix to AgentRuntime, ensuring every state transition is validated before being applied. - Add threading.Lock to AgentRuntime so concurrent resume/stop/start calls are serialised. - Add resume() and pause() methods to AgentRuntime with explicit guard: resume only works from PAUSED, pause only from RUNNING. - Add AgentStatusError and _VALID_STATUS_TRANSITIONS to AgentRegistry so invalid status transitions (e.g. PAUSED -> PENDING) raise instead of silently corrupting state. - Add resume_agent() to OrchestrationEngine that validates both the registry status and runtime state before committing the transition. - Add concurrency tests verifying that only one of N concurrent resume calls succeeds and that state is consistent under load. Fixes race where multiple resume calls could simultaneously transition a paused agent to RUNNING, causing duplicate work and inconsistent state. Closes orchestration-agent#3926
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a durable state machine guard to prevent race conditions when resuming paused agents. This ensures that concurrent resume operations are serialised and that invalid state transitions are rejected before they can cause duplicate work or inconsistent run state.
Changes
AgentRuntime (
src/agent/runtime.py)RuntimeState.PAUSEDand a_VALID_RUNTIME_TRANSITIONSmatrix enumerating all legal state transitionsthreading.Lockaround all state mutations (start, stop, resume, pause, get_state)resume()method — only valid when current state is PAUSED; raisesRuntimeStateErrorotherwisepause()method — only valid when current state is RUNNING_transition()andresume()validate the transition before committingAgentRegistry (
src/agent/registry.py)AgentStatusErrorexception for invalid lifecycle transitions_VALID_STATUS_TRANSITIONSmatrix for the six agent statuses (PENDING, RUNNING, PAUSED, STOPPED, FAILED, TERMINATED)update_status()now raisesAgentStatusErroron invalid transitions (e.g. PAUSED → PENDING is rejected)OrchestrationEngine (
src/orchestrator/engine.py)resume_agent()method that validates both the registry status (must be PAUSED) and runtime state (must be PAUSED) before committing the transition to RUNNINGResumeStateErrorwhen guard conditions are not metTests
Test Output
Closes #3926