An autonomous SRE agent system with chaos engineering testing.
In modern cloud-native environments, Site Reliability Engineers (SREs) face several critical challenges:
- 24/7 Incident Response: Production systems fail at any time, requiring constant human monitoring and immediate response, leading to burnout and high operational costs.
- Manual Diagnostics: Engineers spend significant time reading through logs, correlating events, and identifying root causes during incidents - time that could be better spent on prevention and innovation.
- Slow Mean Time to Recovery (MTTR): Even with runbooks and documentation, the time between incident detection and resolution remains high due to human involvement in the diagnostic and remediation loop.
- Alert Fatigue: Teams are overwhelmed with alerts, many of which could be resolved automatically with the right context and decision-making capabilities.
- Inconsistent Response: Different engineers may handle the same incident differently, leading to varying resolution times and outcomes.
- Lack of Proactive Testing: Chaos engineering and failure injection are often manual processes, making it difficult to validate system resilience continuously.
CodeWeaver is an autonomous AI-powered SRE agent that combines intelligent log analysis, self-healing capabilities, and chaos engineering to deliver true autonomous incident response:
🤖 Autonomous Incident Response
- Receives alerts via webhooks and immediately springs into action
- No human intervention required for common failure scenarios
- Continuous learning from incident patterns
🔍 AI-Powered Diagnostics
- Uses advanced LLM (Groq) to analyze logs and identify root causes
- Correlates errors across distributed systems
- Provides intelligent remediation recommendations
🛠️ Self-Healing Execution
- Generates and executes Python scripts to resolve issues autonomously
- Interacts with service APIs to trigger recovery actions
- Validates successful remediation
🧪 Built-in Chaos Engineering
- Integrated chaos-app for continuous resilience testing
- Simulates real-world failures (network issues, service crashes, etc.)
- Validates that the autonomous agent can handle failures before they occur in production
📊 Full Observability
- Shared log volumes for seamless log access
- Real-time monitoring through structured logging
- Transparent decision-making process
- Detection: CodeWeaver receives an alert via webhook when a service degrades
- Diagnosis: AI agent reads logs from shared volumes and analyzes error patterns
- Planning: LLM generates an execution plan with Python scripts to resolve the issue
- Execution: Agent automatically executes the remediation scripts
- Validation: Verifies that the service has been restored to healthy state
- Learning: Logs the entire process for future reference and improvement
- ⚡ Reduced MTTR: From minutes/hours to seconds
- 💰 Lower Operational Costs: Reduces need for 24/7 on-call rotations
- 🎯 Consistent Response: Same high-quality resolution every time
- 🛡️ Proactive Resilience: Continuous chaos testing ensures readiness
- 😌 Reduced Burnout: Engineers focus on innovation, not firefighting
- Docker and Docker Compose installed
- Groq API key (get from https://console.groq.com/keys)
Create a .env file in the Core directory:
GROQ_API_KEY=your_groq_api_key_heredocker compose up --buildThis will start:
- chaos-app on port 8000 - Service that can simulate failures
- codeweaver-agent on port 8001 - Autonomous SRE agent
Trigger a failure:
curl -X POST http://localhost:8000/chaos/triggerSend alert to trigger autonomous recovery:
curl -X POST http://localhost:8001/webhook/alert \
-H "Content-Type: application/json" \
-d '{"data": {"message": "Service Down", "severity": "critical"}}'Watch the magic happen:
- CodeWeaver reads logs from shared volume
- AI analyzes the error (ConnectionRefused)
- Plans a restart action
- Executes
POST /chaos/resolve - Service recovers automatically! ✨
┌─────────────────────┐ ┌─────────────────────┐
│ Chaos App │ │ CodeWeaver Agent │
│ Port: 8000 │◄────────│ Port: 8001 │
│ │ │ │
│ Simulates failures │ │ Monitors & Fixes │
│ Writes logs │ │ Reads logs via │
│ /var/log/chaos-app │────────►│ shared volume │
└─────────────────────┘ └─────────────────────┘
│ │
└───────────┬───────────────────┘
│
Shared Volume
(shared-logs)
codeweaver/
├── docker-compose.yml # Orchestration config
├── Core/ # SRE Agent
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── .env # API keys
│ └── src/
│ ├── main.py # FastAPI app
│ ├── diagnoser.py # AI log analysis
│ ├── planner.py # Action planning
│ └── executor.py # Action execution
└── chaos-app/ # Test service
├── Dockerfile
├── requirements.txt
└── main.py
- Endpoints:
GET /- Health checkGET /buy- Payment endpoint (fails when broken)GET /status- Check if in chaos modePOST /chaos/trigger- Activate chaos modePOST /chaos/resolve- Deactivate chaos mode
- Endpoints:
GET /- Health checkPOST /webhook/alert- Receive alerts and trigger autonomous recovery
- Shared Logs: Volume
shared-logsallows agent to read chaos-app logs - Custom Network:
codeweaver-netbridge for service communication - Health Checks: chaos-app must be healthy before agent starts
- Environment: GROQ_API_KEY passed from .env file
cd Core
python -m venv venv
.\venv\Scripts\Activate.ps1 # Windows
source venv/bin/activate # Linux/Mac
pip install -r requirements.txt
uvicorn src.main:app --host 0.0.0.0 --port 8001 --reloadcd chaos-app
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reloadView logs:
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f codeweaver-agent
docker-compose logs -f chaos-app
# Shared logs volume
docker exec -it codeweaver-agent cat /logs/chaos-app/service.log-
Start the system:
docker-compose up -d
-
Verify services are running:
curl http://localhost:8000/ curl http://localhost:8001/
-
Trigger chaos:
curl -X POST http://localhost:8000/chaos/trigger
-
Verify service is broken:
curl http://localhost:8000/buy # Should return 500 error -
Send alert to CodeWeaver:
curl -X POST http://localhost:8001/webhook/alert \ -H "Content-Type: application/json" \ -d '{"data": {"message": "Critical failure"}}'
-
CodeWeaver will automatically:
- Read logs from
/logs/chaos-app/service.log - Detect
ConnectionRefusedError - Plan restart action
- Execute
POST http://chaos-app:8000/chaos/resolve - Service recovers!
- Read logs from
-
Verify recovery:
curl http://localhost:8000/buy # Should return success
Built for autonomous SRE operations with AI-powered incident response.