-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Summary
The control-agent can become completely unresponsive for extended periods (20+ minutes observed) when worker spawn commands block the main event loop. This creates a critical outage where all incoming messages are queued but never processed.
Problem Statement
Symptom: Control-agent stops processing inbox messages while bridge continues to receive and forward them.
Root Cause: The control-agent executes pi session spawn commands synchronously via the bash tool. If the spawn process hangs or takes a long time, the control-agent cannot process any other messages until it completes.
Impact:
- Complete loss of responsiveness to incoming requests
- Messages queue up with no acknowledgment or processing
- Users experience total silence from the system
- No automatic detection or recovery
- Requires manual intervention to restore service
Reproduction
- Have control-agent with message queue
- Execute worker spawn:
pi session spawn --name "dev-agent-..." --skill dev-agent ... - If spawn fails/hangs (e.g., missing API key, config error):
- Spawn retries multiple times
- Each retry takes 30-60 seconds
- Total: 5-10 minutes of blocking
- Meanwhile: All incoming messages queue up unprocessed
Proposed Solutions
1. Non-Blocking Worker Spawns (Critical Priority)
Change: Always background worker spawns and verify separately
Before (blocks):
pi session spawn --name "dev-agent-..." --skill dev-agent --model X 2>&1 | tail -5After (non-blocking):
# Spawn in background
(pi session spawn \
--name "dev-agent-..." \
--skill dev-agent \
--model X \
> /tmp/spawn-${WORKER_ID}.log 2>&1 &)
# Wait briefly for socket
sleep 3
# Verify spawn succeeded
if [ -S ~/.pi/session-control/${WORKER_NAME}.sock ]; then
echo "Worker spawned successfully"
# Send task immediately
send_to_session --sessionName "${WORKER_NAME}" --message "..."
else
echo "Worker spawn failed - check logs"
fi
# Continue processing inbox immediately!Implementation suggestions:
- Update control-agent skill documentation with non-blocking spawn pattern
- Add helper function/script for safe worker spawning
- Warn in dev-agent docs about spawn timing
2. Default Timeouts on Bash Tool (Critical Priority)
Problem: Bash commands can run indefinitely
Solution: Add configurable timeout with reasonable default
interface BashToolOptions {
command: string;
timeout?: number; // milliseconds, default: 300000 (5 min)
}Implementation:
- Default timeout: 5 minutes
- Allow override for known long operations (CI polling, builds)
- Kill process tree on timeout
- Return timeout error to caller
3. Async Message Processing (High Priority)
Current: Serial message processing - one message blocks all others
Proposed: Concurrent message handling with worker pool
┌──────────────┐
│ Message Queue│
└──────┬───────┘
│
┌───▼────┬────────┬────────┐
│Worker 1│Worker 2│Worker 3│ ← Process messages concurrently
└────────┴────────┴────────┘
Benefits:
- One slow message doesn't block others
- Natural load balancing
- Better resource utilization
- Graceful degradation under load
Considerations:
- Message ordering (when required)
- Shared state management
- Resource limits (max concurrent)
4. Health Monitoring & Auto-Recovery (High Priority)
Add: Watchdog process that monitors control-agent health
Check every 30-60 seconds:
- Time since last message processed
- Queue depth
- Process responsiveness
- Active child processes
Auto-recovery when stuck:
- Log state to incident file
- Identify blocking child processes
- Kill blocking processes
- Verify recovery
- Alert operator
Metrics to expose:
- Messages processed per minute
- Average processing time
- Current queue depth
- Worker count
- Time since last activity
5. Circuit Breaker for Operations (Medium Priority)
Pattern: Fail fast when operations consistently fail
Operation fails 3 times →
Circuit opens (skip attempts for 5 min) →
Retry after timeout →
Success → Circuit closes
Apply to:
- Worker spawns
- API calls
- External service calls
6. Graceful Degradation (Medium Priority)
Strategies when under load:
-
Immediate acknowledgment: Always respond within 5 seconds
- "Got it! Working on this..."
-
Queue position updates:
- "Request queued, position docs: add git workflow rules to AGENTS.md #3 of 7"
-
Auto-fallback responses:
- After 2 min: "System under heavy load, will respond shortly"
-
Priority queue:
- Some messages (admins, urgent) jump queue
7. Observability Improvements (Medium Priority)
Add status endpoints/tools:
/health- Is control-agent responsive?/metrics- Queue depth, processing rate, worker count/status- Current state, last activity timestamp
Structured logging:
{
"timestamp": "2026-02-27T04:08:01Z",
"level": "warn",
"component": "control-agent",
"event": "spawn_timeout",
"worker_id": "dev-agent-xyz",
"duration_ms": 300000
}8. Worker Lifecycle Tracking (Low Priority)
Track state transitions:
SPAWNING → READY → ASSIGNED → WORKING → REPORTING → COMPLETE/FAILED
Benefits:
- Identify slow spawns
- Detect stuck workers
- Optimize performance
- Better resource cleanup
Testing Recommendations
Load Testing
- Spawn stress: Spawn 10 workers rapidly, verify no blocking
- Message flood: 20 messages in 10 seconds, verify all processed
- Long operations: Start 5-min operation, verify other work continues
Chaos Testing
- Kill random worker processes
- Block network for 5 minutes
- Fill disk to 99%
- Send malformed messages
Recovery Testing
- Verify auto-recovery from blocked state
- Verify graceful shutdown
- Verify worker cleanup on crash
Success Criteria
- Control-agent never blocks for >30 seconds
- All incoming messages acknowledged within 5 seconds
- Messages processed concurrently when independent
- System auto-recovers from blocked states
- Health metrics visible and monitored
- P95 response time < 30 seconds
- Uptime > 99.9%
Additional Context
This issue was discovered during production operation when a worker spawn hung for 22 minutes, causing complete control-agent unresponsiveness. The worker itself functioned correctly once spawned, but the control-agent remained blocked waiting for the spawn command to return.
Manual intervention (killing blocking processes) immediately restored service, but highlighted the need for:
- Asynchronous operations by default
- Automatic health monitoring
- Self-recovery mechanisms
- Better visibility into system state
Related Work
- Consider similar patterns in other agents (sentry-agent, dev-agent)
- Review all bash tool usage for potential blocking operations
- Audit all external process spawning
- Review timeout usage across codebase
Priority: Critical
Effort: 2-3 weeks (all high-priority items)
Impact: Eliminates entire class of outage scenarios