Cluster orchestration for the ETHRC organisation. OpenTofu IaC that provisions an IPv6-primary EKS cluster on AWS with GPU autoscaling and GitOps, purpose-built for ML training workloads.
- VPC — dual-stack, IPv6-primary across 6 AZs (AWS-provided or BYOIP). EKS control plane uses 5 AZs (excludes us-east-1e).
- EKS — Kubernetes cluster with Karpenter node autoscaling
- Nodes — CPU (
cpu), entry-level GPU (gpus), mid GPU (gpum), large GPU (gpul), or H100 (h100) tiers provisioned on demand by Karpenter - Add-ons — Karpenter, CoreDNS, VPC CNI, EBS CSI, NVIDIA GPU Operator (GPU tiers), S3 CSI
- ArgoCD — GitOps-driven workload delivery with per-project repo whitelisting
- S3 — KMS-encrypted bucket for ML data, checkpoints, and model artefacts
- Cost controls — GPU node TTL (
gpu_node_max_lifetime) and a cost killswitch script
| Tier | Node pool | Instance | GPU | VRAM | Primary workload |
|---|---|---|---|---|---|
| S | gpus |
g6.xlarge | 1× L4 | 24 GB | Code validation, EDA, script debugging |
| M | gpum |
g6e.xlarge | 1× L40S | 48 GB | Core prototyping, LoRA fine-tuning (up to 14B params), heavy inference |
| L | gpul |
g6e.12xlarge | 4× L40S | 192 GB | Distributed training (DDP/FSDP), continuous pre-training, large batch sizes |
We also offer a100-40, a100-80 for A100 40/80GB, though with limited capacity.
cp terraform.tfvars.example terraform.tfvars
# fill in cluster_name, ml_data_bucket_name, and cluster_access at minimum
tofu init
tofu plan
# Fresh cluster — two-phase deploy required (see deploy.sh for details)
./deploy.sh
# Subsequent applies (updates, drift fixes)
tofu apply
$(tofu output -raw configure_kubectl)
kubectl get nodes -o wideFresh deploys require
./deploy.shinstead of a plaintofu apply. The Kubernetes and Helm providers authenticate before any resources exist, so a single apply will always fail for add-on resources.deploy.shruns Phase 1 (VPC, EKS, S3) first and Phase 2 (Helm charts, ArgoCD, Karpenter) once the API server is ready.
| QUICKSTART | Prerequisites, step-by-step deploy, teardown |
| VARIABLES | All input variables with types and defaults |
| BYOIP | Bring-your-own IPv6 prefix setup |
main.tf # root module wiring
variables.tf # input variables
outputs.tf # outputs
backend.tf # S3 remote state
terraform.tfvars.example # copy → terraform.tfvars
deploy.sh # two-phase deploy for fresh clusters
modules/aws/
vpc/ # dual-stack VPC
eks/ # EKS cluster + IAM
eks-addons/ # Karpenter, GPU Operator, ArgoCD, S3 CSI
s3/ # ML data bucket
scripts/
cost-killswitch.py # emergency spend brake
Team access is managed through the cluster_access variable — a map of IAM principals to EKS access policies (ClusterAdmin, Admin, Edit, View). See VARIABLES for details.
Remote state is configured in backend.tf (S3 + DynamoDB lock, KMS-encrypted). Local state is gitignored — without the backend every apply runs from scratch.
terraform.tfvars and terraform.tfstate are gitignored. Never commit them.
