This project demonstrates how I design, operate, validate, and safely tear down a production-style Kubernetes platform on AWS. It Includes a full operator runbook covering rebuild, validation, and teardown.
In a live demo, I:
- Provision AWS infrastructure using Terraform (VPC + EKS)
- Deploy a CPU-bound application and validate Horizontal Pod Autoscaling under real load
- Expose the application securely using NGINX Ingress behind an AWS NLB with ACM TLS
- Prove HTTPS-only external access with direct command-line validation
- Tear down all infrastructure cleanly to prevent unnecessary cloud cost
The emphasis is not just deployment, but operability, validation, and disciplined teardown —the core responsibilities of a senior Cloud / DevOps engineer.
A production-grade Cloud & DevOps reference platform on AWS, designed to demonstrate how I design, operate, validate, and safely decommission Kubernetes-based infrastructure in real-world conditions.
This repository demonstrates real-world infrastructure automation, Kubernetes operations, and validated scaling behavior under load — with full teardown and cost-control discipline.
- Cloud: AWS (VPC, EKS, NLB, ACM)
- IaC: Terraform (remote state, locking, modular)
- Kubernetes: EKS, HPA, Metrics Server, NGINX Ingress
- CI/CD: GitHub Actions with OIDC (no secrets)
- Security: HTTPS-only ingress, IAM role assumption
- Cost Control: Manual demos + enforced teardown
This project demonstrates the ability to:
- Design and provision AWS infrastructure using modular Terraform
- Operate a production-style Amazon EKS cluster
- Expose applications securely using NGINX Ingress + AWS NLB
- Validate Horizontal Pod Autoscaler (HPA) behavior under real load
- Reason about failure scenarios, observability, and cost control
- Rebuild and tear down environments safely and repeatably
This mirrors how Cloud / DevOps engineers work in real production environments.
The platform consists of modular AWS infrastructure provisioned with Terraform and a Kubernetes workload deployed on Amazon EKS.
This project intentionally uses an AWS Network Load Balancer (L4) instead of an ALB to demonstrate:
- Lower latency and higher throughput at scale
- Simpler traffic model with TLS terminated at the edge (ACM)
- Full control over HTTPS-only enforcement without redirects
- Compatibility with restricted ingress environments (no NGINX server snippets)
This mirrors environments where performance, cost, or security constraints make ALB unsuitable.
This project includes a production-style operator runbook documenting how the platform is:
- Rebuilt from scratch
- Validated with evidence-oriented checks
- Safely torn down to prevent cloud cost leakage
The runbook reflects real CloudOps / SRE operational workflows rather than one-off deployment steps.
👉 Runbook: docs/runbook.md
This project supports two ingress designs:
- Primary (current): NLB + ACM TLS termination (L4)
- Future variant: ALB + L7 routing (documented separately)
This repository intentionally demonstrates the NLB-first model, commonly used for performance, cost, and simplicity in production.
To demo this project live, I rebuild the environment with Terraform (VPC + EKS), deploy the Kubernetes workloads (HPA demo + ingress-nginx), and then prove behavior with real evidence: (1) autoscaling—start a CPU load generator and watch kubectl get hpa,pods -w scale from 1→5 and then back down after load stops, and (2) external access—verify DNS → NLB (ACM TLS termination) → NGINX Ingress → Service → Pods by curling https://app.utieyincloud.com and confirming 200 OK. After the demo, I tear everything down (Ingress deleted, ingress-nginx uninstalled, and EKS/nodegroup removed) to prevent AWS charges.
This repository includes production-style CI/CD pipelines:
-
CI – Terraform Plan (OIDC)
Automatically validates Terraform formatting, configuration, and execution plans without creating infrastructure. -
Demo – Rebuild, Validate, Teardown (OIDC)
Manually triggered end-to-end demo that provisions infrastructure, validates behavior, and always tears down resources to control cost.
All AWS authentication uses GitHub Actions OpenID Connect (OIDC). No long-lived AWS access keys or GitHub secrets are stored in this repository.
- Ingress (NLB + ACM TLS → NGINX Ingress → Service → Pods):
docs/architecture
Design decisions and trade-offs:
See docs/design-rationale.md
- Kubernetes HPA autoscaling proof: see screenshots in
docs/images/hpa - Ingress (NGINX) + TLS termination at NLB (ACM):
k8s/ingress - Terraform environment wiring (dev):
terraform/environments/dev
- Cloud: AWS
- Infrastructure as Code: Terraform (modular VPC + EKS)
- Container Orchestration: Amazon EKS
- Ingress & Traffic Management: NGINX Ingress Controller
- Autoscaling: Horizontal Pod Autoscaler (HPA)
- Observability: Metrics Server
Client ↓ Route 53 (app.utieyincloud.com) ↓ AWS Network Load Balancer (TLS 443) ↓ (TLS terminates at NLB via ACM) NGINX Ingress Controller (EKS) ↓ Kubernetes Service ↓ Application Pod
Ingress is documented in detail here:
k8s/ingress/README.md
This section demonstrates real, observed Horizontal Pod Autoscaling behavior on a live Amazon EKS cluster.
Autoscaling behavior was intentionally triggered, observed, and validated
- Deployed a CPU-bound application (
hpa-demo) - Configured CPU requests and limits correctly
- Installed and patched Metrics Server for EKS compatibility
- Created an HPA with:
- Minimum replicas:
1 - Maximum replicas:
5 - Target CPU utilization:
50%
- Minimum replicas:
- Generated sustained CPU load using a BusyBox-based load generator
- Observed scale-up, stabilization, and scale-down in real time
- CPU utilization exceeded the 50% threshold
- HPA increased replicas from 1 → 5
- New pods were scheduled and reached
Runningstate
Picture Evidence:
docs/images/hpa/01-hpa-scale-up.pngdocs/images/hpa/02-hpa-metrics.png
- CPU remained consistently high
- Replica count stabilized at the maximum configured value
Picture Evidence:
docs/images/hpa/03-hpa-stabilization.png
- Load generator was stopped
- CPU utilization dropped below target
- HPA gradually reduced replicas from 5 → 1
- Pods were terminated gracefully, without disruption
Picture Evidence:
docs/images/hpa/04-hpa-scale-down-start.pngdocs/images/hpa/05-hpa-scale-down-cpu.pngdocs/images/hpa/08-hpa-scale-down-live.pngdocs/images/hpa/10-hpa-scale-down-complete.png
Ingress-NGINX exposes Prometheus metrics and they are scraped by kube-prometheus-stack via a ServiceMonitor. Traffic is generated against the NLB (with Host header) and visualized in Grafana (Explore) using:
sum(rate(nginx_ingress_controller_requests[1m]))
Ingress traffic is monitored using Prometheus and visualized in Grafana.
Evidence:
- HTTPS-only ingress enforced via NLB + ACM
- Prometheus scraping ingress-nginx metrics
- Grafana dashboards showing real-time request rate (RPS)
Screenshots:
- docs/images/observability/grafana-ingress-dashboard-overview.png
- docs/images/observability/ingress-https-only-proof.png
- Cluster returned to minimum replica count
- No pod crashes or instability observed
- HPA events confirm correct autoscaling decisions
Picture Evidence:
docs/images/hpa/11-hpa-final-event-trail.png
This implementation validates how Kubernetes autoscaling behaves in production, not just in theory.
It demonstrates:
- Correct sizing of CPU requests and limits
- Proper Metrics Server configuration on Amazon EKS
- Autoscaling decisions made dynamically by Kubernetes
- Safe and predictable scale-down behavior without service disruption
This pattern closely mirrors how autoscaling is tested, verified, and trusted in real production Kubernetes platforms.
All AWS infrastructure is provisioned using Terraform with a clear separation of concerns:
- Reusable modules:
terraform/modules/- VPC
- EKS
- Environment wiring:
terraform/environments/dev/
Terraform environment documentation:
terraform/environments/dev/README.md
This project was designed with cost awareness in mind.
Before ending any session:
- Ingress resources are deleted
- ingress-nginx is uninstalled (triggering NLB deletion)
- EKS node groups and clusters are scaled down or destroyed
- Terraform state is cleanly destroyed when finished
This reflects real operational hygiene expected in production cloud environments.
This environment can be fully rebuilt and demonstrated live:
- Provision infrastructure with Terraform
- Deploy ingress controller and application
- Validate external access (HTTP / HTTPS)
- Trigger HPA scale-up and scale-down
- Tear everything down safely
This ensures the project is reproducible, not a one-off setup.
This project was designed with failure modes in mind, including:
- Load balancer recreation after ingress-nginx reinstall
- DNS propagation delays after NLB replacement
- HPA scale-down behavior under fluctuating CPU load
- Metrics Server compatibility issues on EKS
- Cost leakage from orphaned load balancers or node groups
Each failure mode was either:
- Prevented through design, or
- Observed, validated, and corrected during live testing
This reflects how production systems are operated, not just deployed.
- Architecture validated
- Autoscaling behavior proven
- External access secured via Ingress
- Cost-control teardown verified
This repository represents a production-style Cloud & DevOps platform.