Control-agent blocks on synchronous worker spawns, causing complete unresponsiveness

## Summary

The control-agent can become completely unresponsive for extended periods (20+ minutes observed) when worker spawn commands block the main event loop. This creates a critical outage where all incoming messages are queued but never processed.

## Problem Statement

**Symptom**: Control-agent stops processing inbox messages while bridge continues to receive and forward them.

**Root Cause**: The control-agent executes `pi session spawn` commands synchronously via the bash tool. If the spawn process hangs or takes a long time, the control-agent cannot process any other messages until it completes.

**Impact**:
- Complete loss of responsiveness to incoming requests
- Messages queue up with no acknowledgment or processing
- Users experience total silence from the system
- No automatic detection or recovery
- Requires manual intervention to restore service

## Reproduction

1. Have control-agent with message queue
2. Execute worker spawn: `pi session spawn --name "dev-agent-..." --skill dev-agent ...`
3. If spawn fails/hangs (e.g., missing API key, config error):
   - Spawn retries multiple times
   - Each retry takes 30-60 seconds
   - Total: 5-10 minutes of blocking
4. Meanwhile: All incoming messages queue up unprocessed

## Proposed Solutions

### 1. Non-Blocking Worker Spawns (Critical Priority)

**Change**: Always background worker spawns and verify separately

**Before** (blocks):
```bash
pi session spawn --name "dev-agent-..." --skill dev-agent --model X 2>&1 | tail -5
```

**After** (non-blocking):
```bash
# Spawn in background
(pi session spawn \
  --name "dev-agent-..." \
  --skill dev-agent \
  --model X \
  > /tmp/spawn-${WORKER_ID}.log 2>&1 &)

# Wait briefly for socket
sleep 3

# Verify spawn succeeded
if [ -S ~/.pi/session-control/${WORKER_NAME}.sock ]; then
  echo "Worker spawned successfully"
  # Send task immediately
  send_to_session --sessionName "${WORKER_NAME}" --message "..."
else
  echo "Worker spawn failed - check logs"
fi
# Continue processing inbox immediately!
```

**Implementation suggestions**:
- Update control-agent skill documentation with non-blocking spawn pattern
- Add helper function/script for safe worker spawning
- Warn in dev-agent docs about spawn timing

### 2. Default Timeouts on Bash Tool (Critical Priority)

**Problem**: Bash commands can run indefinitely

**Solution**: Add configurable timeout with reasonable default

```typescript
interface BashToolOptions {
  command: string;
  timeout?: number; // milliseconds, default: 300000 (5 min)
}
```

**Implementation**:
- Default timeout: 5 minutes
- Allow override for known long operations (CI polling, builds)
- Kill process tree on timeout
- Return timeout error to caller

### 3. Async Message Processing (High Priority)

**Current**: Serial message processing - one message blocks all others

**Proposed**: Concurrent message handling with worker pool

```
┌──────────────┐
│ Message Queue│
└──────┬───────┘
       │
   ┌───▼────┬────────┬────────┐
   │Worker 1│Worker 2│Worker 3│  ← Process messages concurrently
   └────────┴────────┴────────┘
```

**Benefits**:
- One slow message doesn't block others
- Natural load balancing
- Better resource utilization
- Graceful degradation under load

**Considerations**:
- Message ordering (when required)
- Shared state management
- Resource limits (max concurrent)

### 4. Health Monitoring & Auto-Recovery (High Priority)

**Add**: Watchdog process that monitors control-agent health

**Check every 30-60 seconds**:
- Time since last message processed
- Queue depth
- Process responsiveness
- Active child processes

**Auto-recovery when stuck**:
1. Log state to incident file
2. Identify blocking child processes
3. Kill blocking processes
4. Verify recovery
5. Alert operator

**Metrics to expose**:
- Messages processed per minute
- Average processing time
- Current queue depth
- Worker count
- Time since last activity

### 5. Circuit Breaker for Operations (Medium Priority)

**Pattern**: Fail fast when operations consistently fail

```
Operation fails 3 times → 
  Circuit opens (skip attempts for 5 min) → 
  Retry after timeout → 
  Success → Circuit closes
```

**Apply to**:
- Worker spawns
- API calls
- External service calls

### 6. Graceful Degradation (Medium Priority)

**Strategies when under load**:

1. **Immediate acknowledgment**: Always respond within 5 seconds
   - "Got it! Working on this..."
   
2. **Queue position updates**: 
   - "Request queued, position #3 of 7"
   
3. **Auto-fallback responses**:
   - After 2 min: "System under heavy load, will respond shortly"
   
4. **Priority queue**:
   - Some messages (admins, urgent) jump queue

### 7. Observability Improvements (Medium Priority)

**Add status endpoints/tools**:
- `/health` - Is control-agent responsive?
- `/metrics` - Queue depth, processing rate, worker count
- `/status` - Current state, last activity timestamp

**Structured logging**:
```json
{
  "timestamp": "2026-02-27T04:08:01Z",
  "level": "warn",
  "component": "control-agent",
  "event": "spawn_timeout",
  "worker_id": "dev-agent-xyz",
  "duration_ms": 300000
}
```

### 8. Worker Lifecycle Tracking (Low Priority)

**Track state transitions**:
```
SPAWNING → READY → ASSIGNED → WORKING → REPORTING → COMPLETE/FAILED
```

**Benefits**:
- Identify slow spawns
- Detect stuck workers
- Optimize performance
- Better resource cleanup

## Testing Recommendations

### Load Testing
1. **Spawn stress**: Spawn 10 workers rapidly, verify no blocking
2. **Message flood**: 20 messages in 10 seconds, verify all processed
3. **Long operations**: Start 5-min operation, verify other work continues

### Chaos Testing
1. Kill random worker processes
2. Block network for 5 minutes
3. Fill disk to 99%
4. Send malformed messages

### Recovery Testing
1. Verify auto-recovery from blocked state
2. Verify graceful shutdown
3. Verify worker cleanup on crash

## Success Criteria

- [ ] Control-agent never blocks for >30 seconds
- [ ] All incoming messages acknowledged within 5 seconds
- [ ] Messages processed concurrently when independent
- [ ] System auto-recovers from blocked states
- [ ] Health metrics visible and monitored
- [ ] P95 response time < 30 seconds
- [ ] Uptime > 99.9%

## Additional Context

This issue was discovered during production operation when a worker spawn hung for 22 minutes, causing complete control-agent unresponsiveness. The worker itself functioned correctly once spawned, but the control-agent remained blocked waiting for the spawn command to return.

Manual intervention (killing blocking processes) immediately restored service, but highlighted the need for:
1. Asynchronous operations by default
2. Automatic health monitoring
3. Self-recovery mechanisms
4. Better visibility into system state

## Related Work

- Consider similar patterns in other agents (sentry-agent, dev-agent)
- Review all bash tool usage for potential blocking operations
- Audit all external process spawning
- Review timeout usage across codebase

---

**Priority**: Critical  
**Effort**: 2-3 weeks (all high-priority items)  
**Impact**: Eliminates entire class of outage scenarios


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control-agent blocks on synchronous worker spawns, causing complete unresponsiveness #185

Summary

Problem Statement

Reproduction

Proposed Solutions

1. Non-Blocking Worker Spawns (Critical Priority)

2. Default Timeouts on Bash Tool (Critical Priority)

3. Async Message Processing (High Priority)

4. Health Monitoring & Auto-Recovery (High Priority)

5. Circuit Breaker for Operations (Medium Priority)

6. Graceful Degradation (Medium Priority)

7. Observability Improvements (Medium Priority)

8. Worker Lifecycle Tracking (Low Priority)

Testing Recommendations

Load Testing

Chaos Testing

Recovery Testing

Success Criteria

Additional Context

Related Work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Control-agent blocks on synchronous worker spawns, causing complete unresponsiveness #185

Description

Summary

Problem Statement

Reproduction

Proposed Solutions

1. Non-Blocking Worker Spawns (Critical Priority)

2. Default Timeouts on Bash Tool (Critical Priority)

3. Async Message Processing (High Priority)

4. Health Monitoring & Auto-Recovery (High Priority)

5. Circuit Breaker for Operations (Medium Priority)

6. Graceful Degradation (Medium Priority)

7. Observability Improvements (Medium Priority)

8. Worker Lifecycle Tracking (Low Priority)

Testing Recommendations

Load Testing

Chaos Testing

Recovery Testing

Success Criteria

Additional Context

Related Work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions