Deploying

Context

Hercules is our compute cluster. You do not need to clone this repo or work with tofu as this is handled by the infra team. End users should follow the instructions below and then interface with the cluster using kubectl and k9s.

First Steps

Download AWS CLI, run aws login
Get your k8s credentials via aws eks update-kubeconfig --name ethrc-prod-1 --region us-east-1

Note to admin users: You MUST NOT make any manual changes to IaC (Hercules) resources via the AWS dashboard under any circumstances. These changes will be overwritten during the next deployment and break the cluster in the meantime.

NOTE: Refer to the Hydra repository for example files for the next two steps.

Preparing Setup Script

You should write a setup script that installs any required dependencies and then executes your desired workload/command.

You can deploy this script to a GitHub repository or similar platform and then download it in the container entrypoint.

Deploying Workloads

Write a deployment/job definition file. For the entrypoint, download the S3 setup script created above and run it.
You also need to create a volumeclaims.yaml with your storage volumes, use those volumes in the deployment/definition files, then apply it with kubectl apply -f volumeclaims.yaml
kubectl apply -f FILE to deploy the workload
Check running workloads: kubectl get pods -n robot-learning
Monitor using k9s -n NAMESPACE. Make sure your workload exits correctly! (NAMESPACE: ['robot-learning', ...] find with kubectl get ns)
To cancel/delete, run kubectl delete -f <FILE>

Pricing & Specs

(Please keep in mind there are also storage costs and network transfer costs).

Given Name	Actual AWS Instance Name	GPU	vGPU Count	VRAM	vCPU	RAM	us-east-1 Price (USD/hour)
gpus	g6.xlarge	NVIDIA L4	1	24 GB	4	16 GiB	0.8048
gpum	g6e.xlarge	NVIDIA L40S	1	48 GB	4	32 GiB	1.8610
gpum	g6e.2xlarge	NVIDIA L40S	1	48 GB	8	64 GiB	2.24208
gpul	g6e.12xlarge	NVIDIA L40S	4	192 GB total (4×48 GB)	48	384 GiB	10.49264

H100 and A100 Availability

Note that H100 and A100 80GB capacity is very limited. A100 40GB may be available intermittently.

Useful Commands

View failed run: kubectl -n robot-learning describe trainjobs.trainer.kubeflow.org [YOUR_TRAINJOB] and kubectl -n robot-learning logs trainjobs.trainer.kubeflow.org [YOUR_TRAINJOB]
Add secret: kubectl -n robot-learning create secret generic wandb-secret --from-literal=WANDB_API_KEY='wandb_v1_Vaj...'
List S3 buckets: aws s3 ls
List files in S3 bucket: aws s3 ls s3://<bucket>/<folder>/ eg aws s3 ls s3://ethrc-ml-data-916780037007/ETHRC/towel_base_with_rewards
Copy file to S3 bucket: eg aws s3 cp s3://ethrc-ml-data-916780037007/ETHRC/ test_dl.txt
View nodes in namespace: eg k9s -n robot-learning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly