Production-grade Kubernetes infrastructure on AWS with zero-trust security model and automated CI/CD deployment via GitHub Actions.
- Architecture Overview
- Zero-Trust Security Model
- Infrastructure Components
- Prerequisites
- Quick Start
- CI/CD Workflows
- Contributing
- Cost Breakdown
- Management Operations
High-Level Design:
- Network: 3-AZ High Availability VPC (3 public + 3 private subnets)
- Compute:
- 1x Bastion Host (t3.micro) - kubectl operations only
- 1x Dev Server (t3.medium) - SSH gateway, Ansible, Terraform
- 1x Control Plane (t3.medium) - Kubernetes master in ap-south-1a
- 2x Worker Nodes (t3.medium) - Static instances across ap-south-1b and ap-south-1c
- Container Registry: 5x ECR repositories for microservices (lifecycle policy: keep last 5 images)
- Load Balancer: Internal NLB for Kubernetes API (port 6443)
- NAT Gateway: Single NAT in ap-south-1a for cost optimization
- CI/CD: GitHub Actions with OIDC for automated infrastructure deployment and container image pushes
Key Design Decisions:
- ✅ No Auto Scaling Groups - Static worker nodes managed via Ansible/Kubespray
- ✅ Fixed Private IPs - Predictable addressing for Ansible inventory
- ✅ Single Control Plane - Cost-optimized setup (can scale to 3 for full HA)
- ✅ Dev Server - Stop when not in use to save ~$28/month
| Resource | SSH Access | Kubectl Access | ECR Access |
|---|---|---|---|
| Bastion | ✅ From anywhere (0.0.0.0/0) | ✅ Via NLB | ❌ |
| Dev Server | ✅ From MY_IP only |
✅ Direct to control plane | ❌ |
| Control Plane | ✅ From dev server | N/A | ✅ Pull only |
| Workers | ✅ From dev server | N/A | ✅ Pull only |
| GitHub Actions | N/A | N/A | ✅ Push only (OIDC) |
Internet → Bastion (0.0.0.0/0:22)
MY_IP → Dev Server (MY_IP:22)
Dev Server → All K8s Nodes (SSH:22)
Bastion → NLB → Control Plane (K8s API:6443)
K8s Nodes ↔ K8s Nodes (all ports via k8s_common SG)
Critical Rules:
- ❌ Bastion cannot SSH to K8s nodes (use dev server)
- ✅ Dev server can SSH to all nodes for configuration management
- ✅ Inter-node communication via shared
k8s_commonsecurity group - ✅ All external access requires authentication (SSH keys)
- VPC:
10.0.0.0/16 - Public Subnets:
10.0.1.0/24(ap-south-1a) - Bastion, Dev Server10.0.2.0/24(ap-south-1b)10.0.3.0/24(ap-south-1c)
- Private Subnets:
10.0.10.0/24(ap-south-1a) - Control Plane10.0.11.0/24(ap-south-1b) - Worker 110.0.12.0/24(ap-south-1c) - Worker 2
- NAT Gateway: Single NAT in ap-south-1a (saves ~$64/month vs 3 NATs)
- Internet Gateway: Single IGW for public subnet internet access
| Instance | Type | Subnet | IP | Purpose |
|---|---|---|---|---|
| Bastion | t3.micro | Public (ap-south-1a) | Elastic IP | kubectl via NLB |
| Dev Server | t3.medium | Public (ap-south-1a) | Dynamic (free) | SSH gateway, Ansible, Terraform |
| Control Plane | t3.medium | Private (ap-south-1a) | 10.0.10.21 |
Kubernetes master |
| Worker 1 | t3.medium | Private (ap-south-1b) | 10.0.11.30 |
Kubernetes worker |
| Worker 2 | t3.medium | Private (ap-south-1c) | 10.0.12.30 |
Kubernetes worker |
Note: Dev server has no Elastic IP - public IP changes on start/stop but costs $0 when stopped.
- Repositories: 5 microservices
ms-product-catalogms-inventoryms-supplierms-order-managementms-test
- Lifecycle Policy: Keep last 5 images per repository
- Features: Image scanning on push, AES256 encryption
- Access: GitHub Actions (push via OIDC), K8s nodes (pull via IAM instance profile)
- Type: Network Load Balancer (NLB)
- Scheme: Internal (not internet-facing)
- Subnets: All 3 private subnets
- Port: 6443 → Control Plane
- Health Checks: TCP on port 6443 every 30s
- kubestock-node-role: Instance profile for control plane and workers
- SSM Session Manager access
- Kubernetes AWS controllers (Cluster Autoscaler, EBS CSI, AWS LB Controller)
- ECR pull access
- KubeStock-github-actions-ecr-role: OIDC role for GitHub Actions
- ECR push permissions (all microservice repositories)
- Trust policy: restricted to specific repos and production environment
- Used by CI/CD pipelines to push container images
- AWS Account with admin permissions
- Terraform >= 1.13.5 (Install)
- AWS CLI configured with credentials
- SSH Key Pair:
ssh-keygen -t rsa -b 4096 -f ~/.ssh/kubestock-key - S3 Backend (optional): For remote state storage
- Create S3 bucket:
kubestock-terraform-state - Configure in
terraform/base/main.tf
- Create S3 bucket:
# 1. Clone repository
git clone https://github.com/KubeStock-DevOps-project/kubestock-infrastructure.git
cd kubestock-infrastructure/terraform/prod
# 2. Create terraform.tfvars
cat > terraform.tfvars <<EOF
my_ip = "$(curl -4 ifconfig.me)/32"
ssh_public_key_content = "$(cat ~/.ssh/kubestock-key.pub)"
EOF
# 3. Initialize and deploy
terraform init
terraform plan
terraform apply# Connection details
terraform output bastion_public_ip
terraform output dev_server_public_ip
terraform output control_plane_private_ip
terraform output worker_private_ips
terraform output nlb_dns_name
# ECR repositories
terraform output ecr_repository_urls
terraform output github_actions_role_arn-
terraform-pr-checks.yml- Runs on Pull Requests- Terraform format check
- Terraform validate (no backend required)
- Ensures code quality before merge
-
terraform-prod-apply.yml- Runs on Push to Main- Generates
terraform.tfvarsfrom GitHub secrets - Terraform format check
- Terraform init (with AWS backend)
- Terraform validate
- Terraform plan
- Terraform apply (auto-approved)
- Generates
Set in: Repository Settings → Secrets and variables → Actions
AWS_REGION- Deployment region (e.g.,ap-south-1)MY_IP- Your public IP with CIDR (e.g.,1.2.3.4/32)
SSH_PUBLIC_KEY_CONTENT- SSH public key (full content)
Note: GitHub Actions uses OIDC for AWS authentication. Configure the following IAM roles:
KubeStock-Terraform-Plan-Role- For PR checks (read-only)KubeStock-Terraform-Apply-Role- For deployments (read-write)KubeStock-github-actions-ecr-role- For ECR push operations
Documentation: See .github/GITHUB_SECRETS_SETUP.md for detailed setup instructions.
Feature Branch → PR → PR Checks Pass → Merge to Main → Terraform Apply
Safety Features:
- Format and validation on every PR
- Concurrency control prevents overlapping applies
- Terraform plan review before apply
- All secrets encrypted and hidden in logs
All infrastructure modifications must follow the zero-trust security model and require approval from the infrastructure team.
-
Create Feature Branch
git checkout -b feature/add-monitoring
-
Make Changes (follow zero-trust principles)
- Update Terraform files in
terraform/prod/ - Add security group rules with least privilege
- Document changes in commit messages
- Update Terraform files in
-
Open Pull Request
- PR triggers automated checks (fmt + validate)
- Request review from @KubeStock-DevOps-project/infrastructure
- Address feedback
-
Wait for Approval
- Infrastructure team reviews security implications
- Minimum 1 approval required
-
Merge to Main
- Automated deployment triggers
- Monitor GitHub Actions for apply status
When proposing infrastructure changes:
✅ DO:
- Use security groups with specific source/destination
- Limit SSH access to dev server only
- Add egress rules only when necessary
- Document security group changes clearly
- Test in a separate environment first
❌ DON'T:
- Open ports to
0.0.0.0/0(except bastion SSH) - Allow unrestricted egress (
0.0.0.0/0on all ports) - Bypass the PR review process
- Commit
terraform.tfvarswith secrets - Create IAM roles with
*permissions
Monthly Estimate: ~$147/month (with dev server stopped)
| Resource | Spec | Monthly Cost |
|---|---|---|
| NAT Gateway | 1x in ap-south-1a | $32 |
| Bastion | t3.micro (24/7) | $7 |
| Dev Server | t3.medium (stopped) | $2 (storage only) |
| Control Plane | t3.medium (24/7) | $30 |
| Worker 1 | t3.medium (24/7) | $30 |
| Worker 2 | t3.medium (24/7) | $30 |
| NLB | Internal NLB | $16 |
| ECR | 5 repositories | <$1 (storage only) |
| Storage/Transfer | EBS + Data | ~$10 |
Cost Optimization Tips:
- Stop dev server when not in use: saves $28/month
- Use Spot Instances for workers: saves ~40% (requires setup)
- ECR lifecycle policies: automatically remove old images to minimize storage costs
- Monitor NAT Gateway data transfer in ap-south-1a: largest variable cost
- Review ECR image scanning costs if pushing frequently
Stop (save money):
aws ec2 stop-instances --instance-ids $(terraform output -raw dev_server_instance_id)Start (when needed):
aws ec2 start-instances --instance-ids $(terraform output -raw dev_server_instance_id)
# Get new public IP after start
terraform refresh
terraform output dev_server_public_ipConnect to bastion:
ssh -i ~/.ssh/kubestock-key ubuntu@<bastion-eip>Connect to dev server:
ssh -i ~/.ssh/kubestock-key ubuntu@<dev-server-ip>SSH to K8s nodes (via dev server jump host):
# Add key to SSH agent
ssh-add ~/.ssh/kubestock-key
# Jump through dev server
ssh -J ubuntu@<dev-server-ip> ubuntu@10.0.10.21 # Control plane
ssh -J ubuntu@<dev-server-ip> ubuntu@10.0.11.30 # Worker 1
ssh -J ubuntu@<dev-server-ip> ubuntu@10.0.12.30 # Worker 2From bastion (via NLB):
ssh ubuntu@<bastion-eip>
kubectl --kubeconfig ~/.kube/config get nodesFrom dev server (direct):
ssh ubuntu@<dev-server-ip>
kubectl --server=https://10.0.10.21:6443 --kubeconfig ~/.kube/config get nodesLogin to ECR (from any K8s node):
aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin <account-id>.dkr.ecr.ap-south-1.amazonaws.comGitHub Actions (automatic via OIDC):
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }} # KubeStock-github-actions-ecr-role
aws-region: ap-south-1
- name: Login to Amazon ECR
uses: aws-actions/amazon-ecr-login@v2- Cost Analysis:
terraform/prod/docs/COST_ANALYSIS.md - Deployment Checklist:
terraform/prod/docs/DEPLOYMENT_CHECKLIST.md - Quick Reference:
terraform/prod/docs/QUICK_REFERENCE.md - GitHub Secrets Setup:
.github/GITHUB_SECRETS_SETUP.md
cd terraform/prod
terraform destroy- All EC2 instances
- ECR repositories and container images
- VPC and networking components
- Load balancer and NAT gateway
- IAM roles and policies
This project is maintained by the KubeStock Infrastructure Team.
Expected behavior - Use dev server as SSH gateway instead.
Check NLB target health:
aws elbv2 describe-target-health --target-group-arn $(terraform output -raw k8s_api_target_group_arn)Verify both specific SG + k8s_common SG are attached:
aws ec2 describe-instances --instance-ids <instance-id> --query 'Reservations[0].Instances[0].SecurityGroups'- Verify all secrets/variables are set in GitHub
- Check AWS credentials are valid
- Review GitHub Actions logs for specific error
- Ensure no manual changes conflict with Terraform state
Project: KubeStock | Environment: Production | Managed By: Terraform + GitHub Actions