A comprehensive, production-ready DevOps platform demonstrating Infrastructure as Code, GitOps, and modern cloud-native practices.
Last Updated: December 2024
.
โโโ terraform/ # Terraform modules
โ โโโ modules/
โ โ โโโ vpc/ # VPC, subnets, NAT, IGW
โ โ โโโ eks/ # EKS cluster + node groups
โ โ โโโ eks-addons/ # EKS add-ons (CSI, CNI, etc.)
โ โ โโโ irsa/ # IAM Roles for Service Accounts
โ โโโ providers.tf
โ
โโโ terragrunt/ # Terragrunt environment configs
โ โโโ terragrunt.hcl # Root configuration
โ โโโ dev/
โ โ โโโ env.hcl
โ โ โโโ vpc/
โ โ โโโ eks/
โ โ โโโ eks-addons/
โ โโโ staging/
โ โโโ prod/
โ
โโโ kubernetes/ # Kubernetes manifests
โ โโโ flux-system/ # Flux bootstrap configuration
โ โโโ infrastructure/ # Cluster-wide infrastructure
โ โ โโโ sources/ # Helm repositories
โ โ โโโ monitoring/ # Prometheus, Grafana
โ โ โโโ nvidia/ # NVIDIA device plugin
โ โ โโโ ingress/ # Ingress controller
โ โโโ apps/ # Application deployments
โ โโโ sample-gpu-app/
โ
โโโ .github/
โ โโโ workflows/
โ โโโ terraform-ci.yaml # TF validate, plan, apply
โ โโโ container-build.yaml # Build & push containers
โ โโโ flux-diff.yaml # Preview Flux changes
โ
โโโ docker/ # Dockerfiles
โ โโโ sample-gpu-app/
โ
โโโ docs/ # Additional documentation
โโโ SETUP.md
โโโ GPU-WORKLOADS.md
โโโ TROUBLESHOOTING.md
- Terraform Modules: Reusable, versioned modules for VPC and EKS
- Terragrunt: DRY configuration management across environments
- State Management: Remote state with S3 + DynamoDB locking
- GPU Support: Pre-configured node groups for NVIDIA GPU instances
- Automated Deployments: Git as single source of truth
- Helm Controller: Declarative Helm release management
- Kustomize Integration: Environment-specific overlays
- Image Automation: Automatic image updates (optional)
- Prometheus: Metrics collection with GPU metrics support
- Grafana: Pre-configured dashboards for K8s and GPU monitoring
- Alertmanager: Alert routing and notification
- Infrastructure Pipeline: Validate โ Plan โ Apply workflow
- Container Pipeline: Build, scan, and push to ECR
- Security Scanning: Trivy for container vulnerability scanning
- Cost Estimation: Infracost integration for PR cost preview
- AWS CLI v2 configured with appropriate credentials
- Terraform >= 1.5.0
- Terragrunt >= 0.50.0
- kubectl >= 1.28
- Flux CLI >= 2.0
- Docker (for building containers)
git clone https://github.com/mateenali66/devops-experiment.git
cd devops-experiment
# Set your AWS profile
export AWS_PROFILE=personalcd terragrunt/dev
terragrunt run-all init# Review the plan
terragrunt run-all plan
# Apply infrastructure
terragrunt run-all apply# Configure kubectl
aws eks update-kubeconfig --name eks-dev-cluster --region us-west-2
# Bootstrap Flux
flux bootstrap github \
--owner=mateenali66 \
--repository=devops-experiment \
--branch=main \
--path=kubernetes/clusters/dev \
--personal# Port forward Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80
# Default credentials: admin / prom-operatorThis platform supports NVIDIA GPU workloads out of the box:
# Example GPU pod request
resources:
limits:
nvidia.com/gpu: 1See docs/GPU-WORKLOADS.md for detailed GPU configuration.
Pre-configured Grafana dashboards:
- Kubernetes Cluster Overview
- Node Exporter / Node Metrics
- NVIDIA GPU Metrics (DCGM)
- Flux GitOps Status
- Container Resource Usage
- Private EKS endpoint (configurable)
- IRSA for pod-level AWS permissions
- Network policies for pod isolation
- Secrets management via External Secrets Operator
- Container image scanning in CI/CD
- Spot instances for non-GPU workloads
- Cluster autoscaler for dynamic scaling
- Karpenter support (optional)
- Right-sizing recommendations via Grafana
- Fork the repository
- Create a feature branch
- Submit a pull request
MIT License - see LICENSE


