Skip to content

[Bounty #3926] fix(runtime): avoid race when resuming paused agents — resume runtime#3928

Open
neuralmint wants to merge 2 commits into
orchestration-agent:mainfrom
neuralmint:fix/resume-runtime-race
Open

[Bounty #3926] fix(runtime): avoid race when resuming paused agents — resume runtime#3928
neuralmint wants to merge 2 commits into
orchestration-agent:mainfrom
neuralmint:fix/resume-runtime-race

Conversation

@neuralmint
Copy link
Copy Markdown

Summary

Adds a durable state machine guard to prevent race conditions when resuming paused agents. This ensures that concurrent resume operations are serialised and that invalid state transitions are rejected before they can cause duplicate work or inconsistent run state.

Changes

AgentRuntime (src/agent/runtime.py)

  • Added RuntimeState.PAUSED and a _VALID_RUNTIME_TRANSITIONS matrix enumerating all legal state transitions
  • Added threading.Lock around all state mutations (start, stop, resume, pause, get_state)
  • Added resume() method — only valid when current state is PAUSED; raises RuntimeStateError otherwise
  • Added pause() method — only valid when current state is RUNNING
  • Both _transition() and resume() validate the transition before committing

AgentRegistry (src/agent/registry.py)

  • Added AgentStatusError exception for invalid lifecycle transitions
  • Added _VALID_STATUS_TRANSITIONS matrix for the six agent statuses (PENDING, RUNNING, PAUSED, STOPPED, FAILED, TERMINATED)
  • update_status() now raises AgentStatusError on invalid transitions (e.g. PAUSED → PENDING is rejected)
  • Same-status transitions are allowed (idempotent no-op)

OrchestrationEngine (src/orchestrator/engine.py)

  • Added resume_agent() method that validates both the registry status (must be PAUSED) and runtime state (must be PAUSED) before committing the transition to RUNNING
  • Raises ResumeStateError when guard conditions are not met

Tests

  • test_agent_runtime.py: 22 new tests covering:
    • Basic state machine transitions (start, stop, pause, resume)
    • Invalid transitions rejected (resume from RUNNING/STOPPED, pause from PAUSED)
    • Concurrent resume: 5 threads racing — exactly one succeeds, the rest are rejected
    • Concurrent resume + pause on different agents: no deadlock
    • State consistency under concurrent reads/writes
    • Parametrized transition matrix covering 12 valid/invalid paths
    • Engine-level resume_agent guard tests

Test Output

tests/test_agent_runtime.py     ✓ 22 passed
tests/test_agent_registry.py    ✓ 13 passed (3 new guard tests)
tests/test_compensation.py      ✓ 27 passed (unchanged)
tests/test_scheduler.py         ✓ 5 passed (unchanged)
tests/test_config.py            ✓ 5 passed (unchanged)

Closes #3926

…pt, revision, and lifecycle checks

Implements a CompensationPlanner that enforces failure fan-out invariants
before committing scheduling state. Guards include:

- Lifecycle state machine (PENDING → COMPENSATING → COMPENSATED|FAILED)
- Attempt monotonicity (stale attempt detection)
- Revision tracking (stale state detection)
- Duplicate compensation gate (once COMPENSATED, no further transitions)

The planner is integrated into OrchestrationEngine's error-handling path
so failure fan-out triggers go through guarded transitions.

Closes orchestration-agent#3924

Test plan:
- 31 deterministic regression tests covering all lifecycle transitions,
  stale attempt/rejection detection, duplicate gate, serialization,
  metrics, and auditing logs.
- All 62 existing tests pass with no regressions.
Adds a durable state machine guard to AgentRuntime and AgentRegistry to
prevent race conditions when resuming paused agents.

Changes:
- Add RuntimeState.PAUSED and _VALID_RUNTIME_TRANSITIONS matrix to
  AgentRuntime, ensuring every state transition is validated before
  being applied.
- Add threading.Lock to AgentRuntime so concurrent resume/stop/start
  calls are serialised.
- Add resume() and pause() methods to AgentRuntime with explicit guard:
  resume only works from PAUSED, pause only from RUNNING.
- Add AgentStatusError and _VALID_STATUS_TRANSITIONS to AgentRegistry
  so invalid status transitions (e.g. PAUSED -> PENDING) raise instead
  of silently corrupting state.
- Add resume_agent() to OrchestrationEngine that validates both the
  registry status and runtime state before committing the transition.
- Add concurrency tests verifying that only one of N concurrent resume
  calls succeeds and that state is consistent under load.

Fixes race where multiple resume calls could simultaneously transition
a paused agent to RUNNING, causing duplicate work and inconsistent state.

Closes orchestration-agent#3926
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ Bounty $8k ] [ Runtime ] Avoid race when resuming paused agents — resume runtime

1 participant