Skip to content

ARAVIND281/Chaos-Engineering-Platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

38 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ”ฅ Chaos Engineering Platform

Break things on purpose, before they break in production

MIT License AWS TypeScript React Terraform PRs Welcome

Production-ready, full-stack Chaos Engineering platform built entirely on AWS

Test your system's resilience by injecting controlled failures into your infrastructure.

๐Ÿš€ Quick Start โ€ข ๐Ÿ“– Documentation โ€ข ๐Ÿ—๏ธ Architecture โ€ข ๐Ÿ’ก Examples โ€ข ๐Ÿค Contributing


Chaos Engineering Dashboard

Modern React dashboard for managing and monitoring chaos experiments


๐ŸŒŸ Why Chaos Engineering?

In today's complex distributed systems, failure is inevitable. The question isn't if your system will fail, but when. Chaos Engineering helps you:

  • ๐Ÿ›ก๏ธ Build Resilience - Discover weaknesses before customers do
  • ๐ŸŽฏ Validate Assumptions - Test if your failover actually works
  • ๐Ÿ“Š Improve Monitoring - Find blind spots in observability
  • ๐Ÿš€ Increase Confidence - Deploy with certainty your system can handle failures
  • ๐Ÿ’ฐ Reduce Downtime - Prevent costly outages through proactive testing

๐ŸŽฏ From Netflix to Your Infrastructure

Inspired by Netflix's battle-tested Chaos Monkey, this platform brings enterprise-grade chaos engineering to your AWS environment.


โœจ Features

๐ŸŽจ Modern Full-Stack Dashboard

  • Beautiful React UI with shadcn/ui components
  • Real-time experiment monitoring
  • Interactive analytics and metrics
  • Mobile-responsive design

โšก Serverless Architecture

  • 100% AWS serverless (Lambda + Step Functions)
  • Auto-scaling and highly available
  • Pay-per-use pricing model
  • Zero server maintenance

๐Ÿ”ง Powerful Chaos Experiments

  • Instance Termination
  • CPU Stress Testing
  • Memory Exhaustion
  • Network Latency Injection
  • Disk I/O Saturation

๐Ÿ”’ Safe & Controlled

  • Dry-run mode for testing
  • Automated rollback on failures
  • Pre/post health validation
  • Comprehensive audit logs

๐Ÿ—๏ธ Infrastructure as Code

  • 5 CloudFormation Templates - Complete infrastructure automation
  • Multi-AZ VPC - High-availability networking
  • Auto Scaling Groups - Dynamic capacity management
  • Application Load Balancer - Intelligent traffic distribution
  • DynamoDB - Serverless database for experiments and results

๐Ÿ“Š Enterprise Features

  • 18-State Step Functions Workflow - Sophisticated orchestration
  • CloudWatch Integration - Detailed metrics and logging
  • Results Analytics - Comprehensive experiment analysis
  • One-Command Deployment - Deploy entire stack in 15 minutes

๐Ÿš€ Quick Start

Prerequisites

โœ… AWS Account with admin access
โœ… AWS CLI configured
โœ… Node.js 18+ and npm
โœ… Git

Deploy in 3 Steps

# 1. Clone the repository
git clone https://github.com/ARAVIND281/Chaos-Engineering-Platform.git
cd Chaos-Engineering-Platform

# 2. Deploy everything (takes ~15 minutes)
./scripts/deploy-fullstack-complete.sh dev

# 3. Access your dashboard
# URL will be displayed after deployment completes

That's it! ๐ŸŽ‰ Your chaos engineering platform is live.

๐Ÿ“น Watch Quick Start Video

Coming soon: Step-by-step video walkthrough


๐Ÿ—๏ธ Architecture

graph TB
    subgraph "User Interface"
        A[React Dashboard<br/>S3 + CloudFront]
    end
    
    subgraph "API Layer"
        B[API Gateway]
        C[Lambda Functions<br/>TypeScript]
        D[DynamoDB<br/>Experiments & Results]
    end
    
    subgraph "Chaos Engine"
        E[Step Functions<br/>18-State Workflow]
        F1[Get Target<br/>Lambda]
        F2[Inject Failure<br/>Lambda]
        F3[Validate Health<br/>Lambda]
    end
    
    subgraph "Target Infrastructure"
        G[VPC<br/>Multi-AZ]
        H[Auto Scaling Group]
        I[Load Balancer]
        J[EC2 Instances]
    end
    
    A --> B
    B --> C
    C --> D
    C --> E
    E --> F1
    E --> F2
    E --> F3
    F1 --> H
    F2 --> J
    F3 --> I
    G --> H
    H --> J
    I --> J
Loading

Component Architecture

Component Technology Purpose
๐ŸŽจ Frontend React 18 + TypeScript + Vite Modern dashboard for experiment management
๐Ÿ”ง Backend API Lambda + API Gateway RESTful API for CRUD operations
๐Ÿ—„๏ธ Database DynamoDB Serverless data persistence
โš™๏ธ Orchestration Step Functions 18-state chaos workflow
๐Ÿ”จ Chaos Functions Python Lambda Failure injection logic
๐ŸŒ Networking VPC + ALB Multi-AZ infrastructure
๐ŸŽฏ Target App Auto Scaling Group Sample application for testing
๐Ÿ“Š Monitoring CloudWatch Metrics and logging

๐Ÿ’ก Examples

Creating Your First Experiment

1๏ธโƒฃ Through the Dashboard (Recommended)
  1. Access your dashboard at the provided URL
  2. Login with admin@chaos-platform.com / any-password
  3. Click "New Experiment"
  4. Configure experiment:
    Target: Auto Scaling Group
    Failure Type: Instance Termination
    Dry Run: โœ… Enabled (for first test)
    
  5. Click "Start Experiment"
  6. Monitor in real-time as the platform:
    • โœ… Validates system health
    • ๐Ÿ”ฅ Injects controlled failure
    • ๐Ÿ“Š Monitors system response
    • โœ… Validates recovery
    • ๐Ÿ“ˆ Generates detailed report
2๏ธโƒฃ Through AWS CLI
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:ACCOUNT:stateMachine:chaos-platform-chaos-experiment \
  --input '{
    "experimentId": "exp-cli-001",
    "targetType": "ASG",
    "targetId": "chaos-platform-asg",
    "failureType": "INSTANCE_TERMINATION",
    "dryRun": false,
    "configuration": {
      "expectedHealthyInstances": 2
    }
  }'
3๏ธโƒฃ Programmatically (TypeScript)
import { StepFunctions } from '@aws-sdk/client-sfn';

const stepfunctions = new StepFunctions({ region: 'us-east-1' });

await stepfunctions.startExecution({
  stateMachineArn: 'arn:aws:states:...:stateMachine:chaos-platform-chaos-experiment',
  input: JSON.stringify({
    experimentId: 'exp-programmatic-001',
    targetType: 'ASG',
    failureType: 'CPU_STRESS',
    dryRun: false
  })
});

Sample Experiment Results

๐Ÿ“Š Experiment: exp-2025-01-27-abc123
๐ŸŽฏ Target: chaos-platform-asg (2 instances)
๐Ÿ”ฅ Failure: Instance Termination
โฑ๏ธ Duration: 5m 32s

Results:
โœ… Pre-check: System healthy (2/2 instances)
๐Ÿ”ฅ Chaos: Terminated i-0abc123
โณ Recovery: Auto Scaling launched replacement
โœ… Post-check: System recovered (2/2 instances)
๐Ÿ“ˆ Availability: 99.8% maintained during test

Learnings:
โ€ข Auto Scaling Group successfully replaced failed instance
โ€ข Load Balancer detected unhealthy instance in 30s
โ€ข Application remained available throughout experiment

๐Ÿ“– Documentation

๐Ÿ“š Essential Reading

Document Description
๐Ÿš€ Quick Deployment Get started in 5 minutes
๐Ÿ“– Step-by-Step Guide Detailed walkthrough
๐Ÿ” AWS IAM Setup Required permissions
๐Ÿ—๏ธ Architecture Design System architecture
๐Ÿ“Š Project Summary Component overview

๐ŸŽ“ Learning Resources


๐Ÿ’ป Project Structure

Chaos-Engineering-Platform/
โ”œโ”€โ”€ ๐Ÿ“ infrastructure/          # CloudFormation templates
โ”‚   โ”œโ”€โ”€ vpc-infrastructure.yaml
โ”‚   โ”œโ”€โ”€ target-application.yaml
โ”‚   โ”œโ”€โ”€ chaos-lambda-functions.yaml
โ”‚   โ”œโ”€โ”€ chaos-step-functions.yaml
โ”‚   โ””โ”€โ”€ fullstack-database.yaml
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ lambda-functions/        # Chaos injection logic
โ”‚   โ”œโ”€โ”€ get-target-instance/    # Instance selection
โ”‚   โ”œโ”€โ”€ inject-failure/         # Failure injection
โ”‚   โ””โ”€โ”€ validate-system-health/ # Health validation
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ backend/                 # TypeScript API
โ”‚   โ”œโ”€โ”€ src/handlers/           # API endpoints
โ”‚   โ”œโ”€โ”€ src/services/           # Business logic
โ”‚   โ””โ”€โ”€ src/types/              # TypeScript types
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ frontend/                # React Dashboard
โ”‚   โ”œโ”€โ”€ src/components/         # UI components
โ”‚   โ”œโ”€โ”€ src/pages/              # Application pages
โ”‚   โ””โ”€โ”€ src/lib/                # Utilities
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ scripts/                 # Automation scripts
โ”‚   โ”œโ”€โ”€ deploy-fullstack-complete.sh
โ”‚   โ””โ”€โ”€ cleanup.sh
โ”‚
โ””โ”€โ”€ ๐Ÿ“ docs/                    # Documentation
    โ”œโ”€โ”€ deployment/
    โ”œโ”€โ”€ fullstack/
    โ””โ”€โ”€ archive/

๐Ÿ’ฐ Cost Breakdown

Monthly Costs (24/7 Operation)

Service Configuration Cost/Month
EC2 2x t3.micro ~$12
ALB Application LB ~$16
NAT Gateway 2x (Multi-AZ) ~$64
Lambda Low traffic ~$5
DynamoDB On-demand ~$5
S3 Static hosting ~$3
CloudWatch Logs + Metrics ~$5
Total ~$110/month

Cost Optimization Tips ๐Ÿ’ก

  • Stop when not in use: Run cleanup script โ†’ $0/month
  • Use Spot Instances: Replace EC2 โ†’ Save 70%
  • Single NAT Gateway: Dev/Test only โ†’ Save $32/month
  • Lambda-only testing: Skip EC2 target โ†’ Save $28/month

Free Tier Eligible โœจ

First 12 months with AWS Free Tier:

  • Lambda (1M requests/month)
  • DynamoDB (25GB storage)
  • S3 (5GB storage)

๐Ÿ› ๏ธ Tech Stack

Frontend

React TypeScript Vite TailwindCSS

Backend

AWS Lambda TypeScript Node.js Python

Infrastructure

AWS CloudFormation DynamoDB Step Functions


๐Ÿงช Testing & CI/CD

Running Tests
# Backend tests
cd backend
npm test

# Frontend tests
cd frontend
npm test

# End-to-end tests
./scripts/test-end-to-end.sh
GitHub Actions (Coming Soon)
  • โœ… Automated testing on PR
  • โœ… Infrastructure validation
  • โœ… Security scanning
  • โœ… Deployment automation

๐Ÿค Contributing

We love contributions! ๐Ÿ’–

๐Ÿ› Found a Bug?

Report it โ†’

๐Ÿ’ก Have an Idea?

Suggest a feature โ†’

๐Ÿ”ง Want to Contribute?

See CONTRIBUTING.md โ†’

๐Ÿ’ฌ Questions?

Join Discussions โ†’

Quick Contribution Guide

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

See our Contribution Guidelines for more details.


๐Ÿ—บ๏ธ Roadmap

โœ… Completed (v1.0)

  • Full-stack React dashboard
  • Step Functions orchestration
  • Multi-AZ VPC infrastructure
  • Auto Scaling target application
  • Chaos Lambda functions
  • DynamoDB persistence
  • One-command deployment

๐Ÿšง In Progress (v1.1)

  • Advanced analytics dashboard
  • Experiment scheduling
  • Slack/Teams notifications
  • API authentication (JWT)

๐Ÿ”ฎ Future (v2.0+)

  • Kubernetes chaos experiments
  • Multi-region testing
  • Custom failure plugins
  • Team collaboration features
  • Experiment templates library
  • Cost optimization recommendations

๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License - you can use, modify, and distribute this code freely.

๐Ÿ™ Acknowledgments

Inspired By
Netflix Chaos Monkey
Built With
shadcn/ui
Powered By
AWS
Icons By
Lucide

๐Ÿ“Š Project Stats

GitHub stars GitHub forks GitHub issues GitHub pull requests GitHub last commit


๐ŸŒŸ Star History

Star History Chart


๐Ÿ”— Links


๐Ÿ’ช Built with determination | ๐Ÿง  Designed with intelligence | โค๏ธ Made with love

If this project helped you, please give it a โญ!

โฌ† Back to Top


Made by ARAVIND281 | Licensed under MIT | Contributions welcome!

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors