Skip to content

devyash2930/aws_infra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Autonomous Software Recovery System

Autonomous Software Recovery System is a self-healing reliability platform that watches a running ECS application, detects failure signals in CloudWatch, builds a remediation plan with retrieval-augmented reasoning (RAG), generates a code fix with AI, and redeploys the service automatically on AWS.

What This App Does

This app closes the loop between incident detection and production recovery. Instead of stopping at alerting, it captures failure context, decides on a remediation strategy, creates a patch, pushes the change through GitHub, and lets CI/CD redeploy the fixed version back to ECS.

Autonomous recovery workflow

The system was designed to close the reliability gap between detection and recovery by reducing manual incident response steps.

Problem Statement

Even when code passes CI/CD, production failures still occur due to:

  • Runtime state drift
  • Dependency/API instability
  • Slow manual mean time to recovery (MTTR)

This project automates the full loop from fault detection to validated redeployment.

High-Level Architecture

Application (ECS/Fargate + ALB)
  -> CloudWatch Logs (fault marker)
  -> Fault Router Lambda (incident context extraction)
  -> Backboard.io (RAG remediation strategy)
  -> GitHub Tool Lambda + Gemini (patch generation + git push)
  -> GitHub Actions CI/CD
      - Docker build
      - ECR push
      - ECS task/service update
      - Blue/Green deploy on Fargate
      - Health verification

Core Components

1. Application Layer

  • Flask-based application running on ECS/Fargate behind an ALB
  • Controlled fault injection route: /test-fault
  • Emits deterministic fault markers to CloudWatch logs (for repeatable recovery tests)

2. Detection Layer

  • CloudWatch log stream monitoring
  • Fault Router Lambda filters for known fault markers
  • Captures and packages incident logs + metadata as recovery context

3. Strategy Layer (RAG)

  • Backboard.io retrieves related historical incidents and known fix patterns
  • Produces a structured remediation plan
  • Uses deterministic knowledge first, with reasoning escalation only when needed

4. Execution Layer

  • GitHub Tool Lambda receives remediation plan
  • Gemini transforms plan into repository code changes
  • Commits and pushes patch to GitHub for full auditability

5. Recovery Loop

  • GitHub Actions pipeline triggers automatically on patch push
  • Builds container image and pushes to ECR
  • Updates ECS service/task definition
  • Performs blue/green Fargate rollout
  • Runs final health checks to confirm recovery

What Makes It Different

  • Decoupled strategy and execution: planning and code mutation are isolated for safety
  • Whitelisted operations: automation is restricted to approved Git/recovery actions
  • Deterministic validation: repeatable fault injection through /test-fault
  • End-to-end observability: each recovery step is logged and auditable

Performance

  • 88% faster recovery in demo runs (measured from fault trigger to Fargate redeploy)

Tech Stack

  • Cloud: AWS ECS, Fargate, Lambda, CloudWatch, ECR, ALB
  • Application: Python, Flask
  • AI/RAG: Backboard.io, Gemini 1.5 Pro
  • DevOps: GitHub, GitHub Actions, Docker

End-to-End Flow

  1. A controlled fault is triggered from /test-fault (or occurs naturally in production).
  2. CloudWatch captures logs with a known marker.
  3. Fault Router Lambda detects the marker and assembles incident context.
  4. Backboard.io generates a remediation strategy from retrieval + reasoning.
  5. GitHub Tool Lambda calls Gemini to convert strategy to concrete code edits.
  6. The patch is committed and pushed to GitHub.
  7. GitHub Actions builds and deploys the updated container to ECS/Fargate.
  8. Health checks verify service recovery.

Repository Structure

This is the actual aws_infra layout a new collaborator should use as their map:

aws_infra/
├── README.md                           # Project overview, deployment flow, and onboarding notes
├── provider.tf                         # Root-level AWS provider declaration
├── ecs/
│   └── task-definition.json            # Snapshot/export of the deployed ECS task definition for debugging/reference
├── lambda/
│   ├── FaultRouter/
│   │   └── template.yml                # AWS SAM template for FaultRouter Lambda config, IAM, and env vars
│   └── GitHubTool/
│       └── template.yml                # AWS SAM template for GitHubTool Lambda config, IAM, and env vars
└── terraform/
    ├── main.tf                         # Primary source of truth for AWS infra: VPC, ECS, ECR, IAM, Lambdas, logs, outputs
    ├── terraform.tfvars                # Environment-specific variable values such as repo info and secrets
    ├── .terraform.lock.hcl             # Terraform provider version lock file
    ├── lambda_zips/
    │   ├── fault_router.zip            # Generated Lambda package artifact
    │   └── github_tool.zip             # Generated Lambda package artifact
    ├── terraform.tfstate               # Local Terraform state
    ├── terraform.tfstate.backup        # Backup copy of local Terraform state
    └── tfplan                          # Saved Terraform plan artifact

What Each File/Folder Does

  • terraform/main.tf: Start here for almost every infrastructure change. It defines the VPC, subnets, security groups, ALB, ECS cluster/service, ECR repos, IAM roles, CloudWatch log groups, Lambda packaging, and outputs.
  • terraform/terraform.tfvars: Stores the variable values Terraform expects at apply time, such as GitHub owner/repo and secret-backed inputs. This is environment-specific configuration.
  • provider.tf: Declares the AWS provider region at the repo root. Keep this aligned with the Terraform configuration.
  • ecs/task-definition.json: Useful when you want to inspect the currently deployed ECS task definition, container env vars, health checks, or image tags. Treat this as a reference snapshot, not the main source of truth.
  • lambda/FaultRouter/template.yml: SAM template for the FaultRouter Lambda deployment shape, including runtime, memory, IAM permissions, and environment variables.
  • lambda/GitHubTool/template.yml: SAM template for the GitHubTool Lambda deployment shape, including repo wiring and secret references.
  • terraform/lambda_zips/: Generated build artifacts that Terraform creates when packaging Lambda code. These are outputs, not files you normally edit manually.
  • terraform/terraform.tfstate, terraform/terraform.tfstate.backup, and terraform/tfplan: Generated Terraform state/plan files used during provisioning. These help Terraform track deployed resources but are not the files where feature work should happen.

Where To Find Common Changes

  • Need to change networking, ECS, ALB, IAM, or ECR resources: update terraform/main.tf.
  • Need to change Lambda infrastructure settings such as timeout, runtime, memory, or env vars: update the corresponding file in lambda/.
  • Need to change the actual Lambda application logic: edit the sibling project files ../hack_ncstate/fault_router_lambda_function.py and ../hack_ncstate/GithubTool_lambda_function.py, because terraform/main.tf packages those files into the Lambda zip artifacts.
  • Need to inspect what is currently running in ECS: check ecs/task-definition.json.
  • Need to update environment-specific Terraform inputs: edit terraform/terraform.tfvars.

Setup Guide

Update this section with your exact implementation details. A practical baseline:

Prerequisites

  • AWS account with permissions for ECS/Fargate, Lambda, CloudWatch, ECR, IAM
  • Docker installed
  • GitHub repository with Actions enabled
  • Python 3.10+ for Flask app/lambda tooling
  • API access/configuration for Backboard.io and Gemini

Environment Variables (Example)

AWS_REGION=us-east-1
ECS_CLUSTER=your-cluster
ECS_SERVICE=your-service
ECR_REPOSITORY=your-repo
FAULT_LOG_GROUP=/ecs/cream-task
FAULT_MARKER=INTENTIONAL_INVALID_SQL
BACKBOARD_API_KEY=...
GEMINI_API_KEY=...
GITHUB_TOKEN=...
GITHUB_REPO=org/repo

Deployment Steps (Typical)

  1. Provision cloud resources (infra/)
  2. Deploy Flask app to ECS/Fargate
  3. Configure CloudWatch log subscription/event trigger to Fault Router Lambda
  4. Configure Backboard.io knowledge base and remediation templates
  5. Deploy GitHub Tool Lambda with least-privilege IAM
  6. Configure GitHub Actions workflow for build/deploy/verification
  7. Run a deterministic recovery test via /test-fault

Safety and Governance

  • Least-privilege IAM roles for both Lambdas
  • Whitelisted recovery commands only
  • Full Git commit history for every AI-generated change
  • CI gates + health checks before stable traffic cutover

Testing Strategy

  • Unit tests: parser, marker detection, remediation plan formatting
  • Integration tests: lambda-to-lambda handoff, GitHub API operations
  • Resilience tests: repeated /test-fault injections under load
  • Deployment tests: blue/green success + rollback validation

Operational Metrics to Track

  • MTTR (trigger -> healthy deploy)
  • Detection latency (log emission -> router trigger)
  • Strategy generation latency
  • Patch success rate
  • Deployment success rate
  • False positive/negative incident detection rate

Future Improvements

  • Multi-service incident correlation
  • Automated rollback for low-confidence fixes
  • Human approval mode for high-risk patches
  • Expanded fault taxonomy and playbook coverage
  • Cost-aware remediation policy routing

Team

Team Cream&Onion (HACK_NCSTATE 2026)

Steps to replicate

Services

  • ECS Fargate
  • ECR
  • Lambda
  • Load Balancer
  • IAM
  • CloudWatch

Deploy

cd terraform terraform init terraform apply

Build & Push Docker Image

cd docker ./build.sh

Destroy

terraform destroy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors