Skip to content

Control-agent blocks on synchronous worker spawns, causing complete unresponsiveness #185

@vltbaudbot

Description

@vltbaudbot

Summary

The control-agent can become completely unresponsive for extended periods (20+ minutes observed) when worker spawn commands block the main event loop. This creates a critical outage where all incoming messages are queued but never processed.

Problem Statement

Symptom: Control-agent stops processing inbox messages while bridge continues to receive and forward them.

Root Cause: The control-agent executes pi session spawn commands synchronously via the bash tool. If the spawn process hangs or takes a long time, the control-agent cannot process any other messages until it completes.

Impact:

  • Complete loss of responsiveness to incoming requests
  • Messages queue up with no acknowledgment or processing
  • Users experience total silence from the system
  • No automatic detection or recovery
  • Requires manual intervention to restore service

Reproduction

  1. Have control-agent with message queue
  2. Execute worker spawn: pi session spawn --name "dev-agent-..." --skill dev-agent ...
  3. If spawn fails/hangs (e.g., missing API key, config error):
    • Spawn retries multiple times
    • Each retry takes 30-60 seconds
    • Total: 5-10 minutes of blocking
  4. Meanwhile: All incoming messages queue up unprocessed

Proposed Solutions

1. Non-Blocking Worker Spawns (Critical Priority)

Change: Always background worker spawns and verify separately

Before (blocks):

pi session spawn --name "dev-agent-..." --skill dev-agent --model X 2>&1 | tail -5

After (non-blocking):

# Spawn in background
(pi session spawn \
  --name "dev-agent-..." \
  --skill dev-agent \
  --model X \
  > /tmp/spawn-${WORKER_ID}.log 2>&1 &)

# Wait briefly for socket
sleep 3

# Verify spawn succeeded
if [ -S ~/.pi/session-control/${WORKER_NAME}.sock ]; then
  echo "Worker spawned successfully"
  # Send task immediately
  send_to_session --sessionName "${WORKER_NAME}" --message "..."
else
  echo "Worker spawn failed - check logs"
fi
# Continue processing inbox immediately!

Implementation suggestions:

  • Update control-agent skill documentation with non-blocking spawn pattern
  • Add helper function/script for safe worker spawning
  • Warn in dev-agent docs about spawn timing

2. Default Timeouts on Bash Tool (Critical Priority)

Problem: Bash commands can run indefinitely

Solution: Add configurable timeout with reasonable default

interface BashToolOptions {
  command: string;
  timeout?: number; // milliseconds, default: 300000 (5 min)
}

Implementation:

  • Default timeout: 5 minutes
  • Allow override for known long operations (CI polling, builds)
  • Kill process tree on timeout
  • Return timeout error to caller

3. Async Message Processing (High Priority)

Current: Serial message processing - one message blocks all others

Proposed: Concurrent message handling with worker pool

┌──────────────┐
│ Message Queue│
└──────┬───────┘
       │
   ┌───▼────┬────────┬────────┐
   │Worker 1│Worker 2│Worker 3│  ← Process messages concurrently
   └────────┴────────┴────────┘

Benefits:

  • One slow message doesn't block others
  • Natural load balancing
  • Better resource utilization
  • Graceful degradation under load

Considerations:

  • Message ordering (when required)
  • Shared state management
  • Resource limits (max concurrent)

4. Health Monitoring & Auto-Recovery (High Priority)

Add: Watchdog process that monitors control-agent health

Check every 30-60 seconds:

  • Time since last message processed
  • Queue depth
  • Process responsiveness
  • Active child processes

Auto-recovery when stuck:

  1. Log state to incident file
  2. Identify blocking child processes
  3. Kill blocking processes
  4. Verify recovery
  5. Alert operator

Metrics to expose:

  • Messages processed per minute
  • Average processing time
  • Current queue depth
  • Worker count
  • Time since last activity

5. Circuit Breaker for Operations (Medium Priority)

Pattern: Fail fast when operations consistently fail

Operation fails 3 times → 
  Circuit opens (skip attempts for 5 min) → 
  Retry after timeout → 
  Success → Circuit closes

Apply to:

  • Worker spawns
  • API calls
  • External service calls

6. Graceful Degradation (Medium Priority)

Strategies when under load:

  1. Immediate acknowledgment: Always respond within 5 seconds

    • "Got it! Working on this..."
  2. Queue position updates:

  3. Auto-fallback responses:

    • After 2 min: "System under heavy load, will respond shortly"
  4. Priority queue:

    • Some messages (admins, urgent) jump queue

7. Observability Improvements (Medium Priority)

Add status endpoints/tools:

  • /health - Is control-agent responsive?
  • /metrics - Queue depth, processing rate, worker count
  • /status - Current state, last activity timestamp

Structured logging:

{
  "timestamp": "2026-02-27T04:08:01Z",
  "level": "warn",
  "component": "control-agent",
  "event": "spawn_timeout",
  "worker_id": "dev-agent-xyz",
  "duration_ms": 300000
}

8. Worker Lifecycle Tracking (Low Priority)

Track state transitions:

SPAWNING → READY → ASSIGNED → WORKING → REPORTING → COMPLETE/FAILED

Benefits:

  • Identify slow spawns
  • Detect stuck workers
  • Optimize performance
  • Better resource cleanup

Testing Recommendations

Load Testing

  1. Spawn stress: Spawn 10 workers rapidly, verify no blocking
  2. Message flood: 20 messages in 10 seconds, verify all processed
  3. Long operations: Start 5-min operation, verify other work continues

Chaos Testing

  1. Kill random worker processes
  2. Block network for 5 minutes
  3. Fill disk to 99%
  4. Send malformed messages

Recovery Testing

  1. Verify auto-recovery from blocked state
  2. Verify graceful shutdown
  3. Verify worker cleanup on crash

Success Criteria

  • Control-agent never blocks for >30 seconds
  • All incoming messages acknowledged within 5 seconds
  • Messages processed concurrently when independent
  • System auto-recovers from blocked states
  • Health metrics visible and monitored
  • P95 response time < 30 seconds
  • Uptime > 99.9%

Additional Context

This issue was discovered during production operation when a worker spawn hung for 22 minutes, causing complete control-agent unresponsiveness. The worker itself functioned correctly once spawned, but the control-agent remained blocked waiting for the spawn command to return.

Manual intervention (killing blocking processes) immediately restored service, but highlighted the need for:

  1. Asynchronous operations by default
  2. Automatic health monitoring
  3. Self-recovery mechanisms
  4. Better visibility into system state

Related Work

  • Consider similar patterns in other agents (sentry-agent, dev-agent)
  • Review all bash tool usage for potential blocking operations
  • Audit all external process spawning
  • Review timeout usage across codebase

Priority: Critical
Effort: 2-3 weeks (all high-priority items)
Impact: Eliminates entire class of outage scenarios

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions