Skip to content

Deploying

Nico Suter edited this page May 4, 2026 · 25 revisions

Context

Hercules is our compute cluster. You do not need to clone this repo or work with tofu as this is handled by the infra team. End users should follow the instructions below and then interface with the cluster using kubectl and k9s.

First Steps

  • Download AWS CLI, run aws login
  • Get your k8s credentials via aws eks update-kubeconfig --name ethrc-prod-1 --region us-east-1

Note to admin users: You MUST NOT make any manual changes to IaC (Hercules) resources via the AWS dashboard under any circumstances. These changes will be overwritten during the next deployment and break the cluster in the meantime.

NOTE: Refer to the Hydra repository for example files for the next two steps.

Preparing Setup Script

You should write a setup script that installs any required dependencies and then executes your desired workload/command.

You can deploy this script to a GitHub repository or similar platform and then download it in the container entrypoint.

Deploying Workloads

  • Write a deployment/job definition file. For the entrypoint, download the S3 setup script created above and run it.

  • You also need to create a volumeclaims.yaml with your storage volumes, use those volumes in the deployment/definition files, then apply it with kubectl apply -f volumeclaims.yaml

  • kubectl apply -f FILE to deploy the workload

  • Check running workloads: kubectl get pods -n robot-learning

  • Monitor using k9s -n NAMESPACE. Make sure your workload exits correctly! (NAMESPACE: ['robot-learning', ...] find with kubectl get ns)

  • To cancel/delete, run kubectl delete -f <FILE>

Pricing & Specs

(Please keep in mind there are also storage costs and network transfer costs).

Given Name Actual AWS Instance Name GPU vGPU Count VRAM vCPU RAM us-east-1 Price (USD/hour)
gpus g6.xlarge NVIDIA L4 1 24 GB 4 16 GiB 0.8048
gpum g6e.xlarge NVIDIA L40S 1 48 GB 4 32 GiB 1.8610
gpum g6e.2xlarge NVIDIA L40S 1 48 GB 8 64 GiB 2.24208
gpul g6e.12xlarge NVIDIA L40S 4 192 GB total (4×48 GB) 48 384 GiB 10.49264

H100 and A100 Availability

Note that H100 and A100 80GB capacity is very limited. A100 40GB may be available intermittently.

Useful Commands

  • View failed run: kubectl -n robot-learning describe trainjobs.trainer.kubeflow.org [YOUR_TRAINJOB] and kubectl -n robot-learning logs trainjobs.trainer.kubeflow.org [YOUR_TRAINJOB]
  • Add secret: kubectl -n robot-learning create secret generic wandb-secret --from-literal=WANDB_API_KEY='wandb_v1_Vaj...'
  • List S3 buckets: aws s3 ls
  • List files in S3 bucket: aws s3 ls s3://<bucket>/<folder>/ eg aws s3 ls s3://ethrc-ml-data-916780037007/ETHRC/towel_base_with_rewards
  • Copy file to S3 bucket: eg aws s3 cp s3://ethrc-ml-data-916780037007/ETHRC/ test_dl.txt
  • View nodes in namespace: eg k9s -n robot-learning

Clone this wiki locally