Hercules

Cluster orchestration for the ETHRC organisation. OpenTofu IaC that provisions an IPv6-primary EKS cluster on AWS with GPU autoscaling and GitOps, purpose-built for ML training workloads.

What it deploys

VPC — dual-stack, IPv6-primary across 6 AZs (AWS-provided or BYOIP). EKS control plane uses 5 AZs (excludes us-east-1e).
EKS — Kubernetes cluster with Karpenter node autoscaling
Nodes — CPU (cpu), entry-level GPU (gpus), mid GPU (gpum), large GPU (gpul), or H100 (h100) tiers provisioned on demand by Karpenter
Add-ons — Karpenter, CoreDNS, VPC CNI, EBS CSI, NVIDIA GPU Operator (GPU tiers), S3 CSI
ArgoCD — GitOps-driven workload delivery with per-project repo whitelisting
S3 — KMS-encrypted bucket for ML data, checkpoints, and model artefacts
Cost controls — GPU node TTL (gpu_node_max_lifetime) and a cost killswitch script

GPU tiers

Tier	Node pool	Instance	GPU	VRAM	Primary workload
S	`gpus`	g6.xlarge	1× L4	24 GB	Code validation, EDA, script debugging
M	`gpum`	g6e.xlarge	1× L40S	48 GB	Core prototyping, LoRA fine-tuning (up to 14B params), heavy inference
L	`gpul`	g6e.12xlarge	4× L40S	192 GB	Distributed training (DDP/FSDP), continuous pre-training, large batch sizes

We also offer a100-40, a100-80 for A100 40/80GB, though with limited capacity.

Quick start

cp terraform.tfvars.example terraform.tfvars
# fill in cluster_name, ml_data_bucket_name, and cluster_access at minimum

tofu init
tofu plan

# Fresh cluster — two-phase deploy required (see deploy.sh for details)
./deploy.sh

# Subsequent applies (updates, drift fixes)
tofu apply

$(tofu output -raw configure_kubectl)
kubectl get nodes -o wide

Fresh deploys require ./deploy.sh instead of a plain tofu apply. The Kubernetes and Helm providers authenticate before any resources exist, so a single apply will always fail for add-on resources. deploy.sh runs Phase 1 (VPC, EKS, S3) first and Phase 2 (Helm charts, ArgoCD, Karpenter) once the API server is ready.

Docs


QUICKSTART	Prerequisites, step-by-step deploy, teardown
VARIABLES	All input variables with types and defaults
BYOIP	Bring-your-own IPv6 prefix setup

Structure

main.tf                   # root module wiring
variables.tf              # input variables
outputs.tf                # outputs
backend.tf                # S3 remote state
terraform.tfvars.example  # copy → terraform.tfvars
deploy.sh                 # two-phase deploy for fresh clusters

modules/aws/
  vpc/                    # dual-stack VPC
  eks/                    # EKS cluster + IAM
  eks-addons/             # Karpenter, GPU Operator, ArgoCD, S3 CSI
  s3/                     # ML data bucket

scripts/
  cost-killswitch.py      # emergency spend brake

Access

Team access is managed through the cluster_access variable — a map of IAM principals to EKS access policies (ClusterAdmin, Admin, Edit, View). See VARIABLES for details.

State

Remote state is configured in backend.tf (S3 + DynamoDB lock, KMS-encrypted). Local state is gitignored — without the backend every apply runs from scratch.

terraform.tfvars and terraform.tfstate are gitignored. Never commit them.

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
.github		.github
docs		docs
modules/aws		modules/aws
.cursorrules		.cursorrules
.gitignore		.gitignore
.terraform.lock.hcl		.terraform.lock.hcl
README.md		README.md
acm.tf		acm.tf
backend.tf		backend.tf
deploy.sh		deploy.sh
main.tf		main.tf
outputs.tf		outputs.tf
terraform.tfvars.example		terraform.tfvars.example
user-ml-data-iam.tf		user-ml-data-iam.tf
user-rbac.tf		user-rbac.tf
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hercules

What it deploys

GPU tiers

Quick start

Docs

Structure

Access

State

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hercules

What it deploys

GPU tiers

Quick start

Docs

Structure

Access

State

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages