Skip to content

rooney011/CodeWeaver

Repository files navigation

CodeWeaver System

An autonomous SRE agent system with chaos engineering testing.

🎯 Problem Statement

In modern cloud-native environments, Site Reliability Engineers (SREs) face several critical challenges:

  • 24/7 Incident Response: Production systems fail at any time, requiring constant human monitoring and immediate response, leading to burnout and high operational costs.
  • Manual Diagnostics: Engineers spend significant time reading through logs, correlating events, and identifying root causes during incidents - time that could be better spent on prevention and innovation.
  • Slow Mean Time to Recovery (MTTR): Even with runbooks and documentation, the time between incident detection and resolution remains high due to human involvement in the diagnostic and remediation loop.
  • Alert Fatigue: Teams are overwhelmed with alerts, many of which could be resolved automatically with the right context and decision-making capabilities.
  • Inconsistent Response: Different engineers may handle the same incident differently, leading to varying resolution times and outcomes.
  • Lack of Proactive Testing: Chaos engineering and failure injection are often manual processes, making it difficult to validate system resilience continuously.

💡 Our Solution: CodeWeaver

CodeWeaver is an autonomous AI-powered SRE agent that combines intelligent log analysis, self-healing capabilities, and chaos engineering to deliver true autonomous incident response:

Key Features

🤖 Autonomous Incident Response

  • Receives alerts via webhooks and immediately springs into action
  • No human intervention required for common failure scenarios
  • Continuous learning from incident patterns

🔍 AI-Powered Diagnostics

  • Uses advanced LLM (Groq) to analyze logs and identify root causes
  • Correlates errors across distributed systems
  • Provides intelligent remediation recommendations

🛠️ Self-Healing Execution

  • Generates and executes Python scripts to resolve issues autonomously
  • Interacts with service APIs to trigger recovery actions
  • Validates successful remediation

🧪 Built-in Chaos Engineering

  • Integrated chaos-app for continuous resilience testing
  • Simulates real-world failures (network issues, service crashes, etc.)
  • Validates that the autonomous agent can handle failures before they occur in production

📊 Full Observability

  • Shared log volumes for seamless log access
  • Real-time monitoring through structured logging
  • Transparent decision-making process

How It Works

  1. Detection: CodeWeaver receives an alert via webhook when a service degrades
  2. Diagnosis: AI agent reads logs from shared volumes and analyzes error patterns
  3. Planning: LLM generates an execution plan with Python scripts to resolve the issue
  4. Execution: Agent automatically executes the remediation scripts
  5. Validation: Verifies that the service has been restored to healthy state
  6. Learning: Logs the entire process for future reference and improvement

Impact

  • Reduced MTTR: From minutes/hours to seconds
  • 💰 Lower Operational Costs: Reduces need for 24/7 on-call rotations
  • 🎯 Consistent Response: Same high-quality resolution every time
  • 🛡️ Proactive Resilience: Continuous chaos testing ensures readiness
  • 😌 Reduced Burnout: Engineers focus on innovation, not firefighting

🚀 Quick Start with Docker Compose

Prerequisites

1. Set up environment variables

Create a .env file in the Core directory:

GROQ_API_KEY=your_groq_api_key_here

2. Start the system

docker compose up --build

This will start:

  • chaos-app on port 8000 - Service that can simulate failures
  • codeweaver-agent on port 8001 - Autonomous SRE agent

3. Test the system

Trigger a failure:

curl -X POST http://localhost:8000/chaos/trigger

Send alert to trigger autonomous recovery:

curl -X POST http://localhost:8001/webhook/alert \
  -H "Content-Type: application/json" \
  -d '{"data": {"message": "Service Down", "severity": "critical"}}'

Watch the magic happen:

  1. CodeWeaver reads logs from shared volume
  2. AI analyzes the error (ConnectionRefused)
  3. Plans a restart action
  4. Executes POST /chaos/resolve
  5. Service recovers automatically! ✨

🏗️ Architecture

┌─────────────────────┐         ┌─────────────────────┐
│   Chaos App         │         │  CodeWeaver Agent   │
│   Port: 8000        │◄────────│   Port: 8001        │
│                     │         │                     │
│  Simulates failures │         │  Monitors & Fixes   │
│  Writes logs        │         │  Reads logs via     │
│  /var/log/chaos-app │────────►│  shared volume      │
└─────────────────────┘         └─────────────────────┘
         │                               │
         └───────────┬───────────────────┘
                     │
              Shared Volume
            (shared-logs)

📁 Project Structure

codeweaver/
├── docker-compose.yml       # Orchestration config
├── Core/                    # SRE Agent
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── .env                # API keys
│   └── src/
│       ├── main.py         # FastAPI app
│       ├── diagnoser.py    # AI log analysis
│       ├── planner.py      # Action planning
│       └── executor.py     # Action execution
└── chaos-app/              # Test service
    ├── Dockerfile
    ├── requirements.txt
    └── main.py

🔧 Services

Chaos App

  • Endpoints:
    • GET / - Health check
    • GET /buy - Payment endpoint (fails when broken)
    • GET /status - Check if in chaos mode
    • POST /chaos/trigger - Activate chaos mode
    • POST /chaos/resolve - Deactivate chaos mode

CodeWeaver Agent

  • Endpoints:
    • GET / - Health check
    • POST /webhook/alert - Receive alerts and trigger autonomous recovery

🐳 Docker Compose Features

  • Shared Logs: Volume shared-logs allows agent to read chaos-app logs
  • Custom Network: codeweaver-net bridge for service communication
  • Health Checks: chaos-app must be healthy before agent starts
  • Environment: GROQ_API_KEY passed from .env file

🛠️ Local Development

CodeWeaver Agent

cd Core
python -m venv venv
.\venv\Scripts\Activate.ps1  # Windows
source venv/bin/activate      # Linux/Mac
pip install -r requirements.txt
uvicorn src.main:app --host 0.0.0.0 --port 8001 --reload

Chaos App

cd chaos-app
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

📊 Monitoring

View logs:

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f codeweaver-agent
docker-compose logs -f chaos-app

# Shared logs volume
docker exec -it codeweaver-agent cat /logs/chaos-app/service.log

🧪 Full Integration Test

  1. Start the system:

    docker-compose up -d
  2. Verify services are running:

    curl http://localhost:8000/
    curl http://localhost:8001/
  3. Trigger chaos:

    curl -X POST http://localhost:8000/chaos/trigger
  4. Verify service is broken:

    curl http://localhost:8000/buy
    # Should return 500 error
  5. Send alert to CodeWeaver:

    curl -X POST http://localhost:8001/webhook/alert \
      -H "Content-Type: application/json" \
      -d '{"data": {"message": "Critical failure"}}'
  6. CodeWeaver will automatically:

    • Read logs from /logs/chaos-app/service.log
    • Detect ConnectionRefusedError
    • Plan restart action
    • Execute POST http://chaos-app:8000/chaos/resolve
    • Service recovers!
  7. Verify recovery:

    curl http://localhost:8000/buy
    # Should return success

📝 License

Built for autonomous SRE operations with AI-powered incident response.

About

CodeWeaver is an autonomous AI incident response agent that detects outages, diagnoses root causes (e.g., “null pointer in PR#392”), and safely executes fixes—like rollbacks or scaling—in seconds. It acts as a tireless, always-on SRE: sandboxed, human-approved when needed, and self-documenting. Built for engineers who’d rather build than burn out.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors