-
Notifications
You must be signed in to change notification settings - Fork 0
Deploying
Hercules is our compute cluster. You do not need to clone this repo or work with tofu as this is handled by the infra team. End users should follow the instructions below and then interface with the cluster using kubectl and k9s.
- Download AWS CLI, run
aws login - Get your k8s credentials via
aws eks update-kubeconfig --name ethrc-prod-1 --region us-east-1
Note to admin users: You MUST NOT make any manual changes to IaC (Hercules) resources via the AWS dashboard under any circumstances. These changes will be overwritten during the next deployment and break the cluster in the meantime.
NOTE: Refer to the Hydra repository for example files for the next two steps.
You should write a setup script that installs any required dependencies and then executes your desired workload/command.
You can deploy this script to a GitHub repository or similar platform and then download it in the container entrypoint.
-
Write a deployment/job definition file. For the entrypoint, download the S3 setup script created above and run it.
-
You also need to create a volumeclaims.yaml with your storage volumes, use those volumes in the deployment/definition files, then apply it with
kubectl apply -f volumeclaims.yaml -
kubectl apply -f FILEto deploy the workload -
Check running workloads:
kubectl get pods -n robot-learning -
Monitor using
k9s -n NAMESPACE. Make sure your workload exits correctly! (NAMESPACE: ['robot-learning', ...] find withkubectl get ns) -
To cancel/delete, run
kubectl delete -f <FILE>
(Please keep in mind there are also storage costs and network transfer costs).
| Given Name | Actual AWS Instance Name | GPU | vGPU Count | VRAM | vCPU | RAM | us-east-1 Price (USD/hour) |
|---|---|---|---|---|---|---|---|
| gpus | g6.xlarge | NVIDIA L4 | 1 | 24 GB | 4 | 16 GiB | 0.8048 |
| gpum | g6e.xlarge | NVIDIA L40S | 1 | 48 GB | 4 | 32 GiB | 1.8610 |
| gpum | g6e.2xlarge | NVIDIA L40S | 1 | 48 GB | 8 | 64 GiB | 2.24208 |
| gpul | g6e.12xlarge | NVIDIA L40S | 4 | 192 GB total (4×48 GB) | 48 | 384 GiB | 10.49264 |
Note that H100 and A100 80GB capacity is very limited. A100 40GB may be available intermittently.
- View failed run:
kubectl -n robot-learning describe trainjobs.trainer.kubeflow.org [YOUR_TRAINJOB]andkubectl -n robot-learning logs trainjobs.trainer.kubeflow.org [YOUR_TRAINJOB] - Add secret:
kubectl -n robot-learning create secret generic wandb-secret --from-literal=WANDB_API_KEY='wandb_v1_Vaj...' - List S3 buckets:
aws s3 ls - List files in S3 bucket:
aws s3 ls s3://<bucket>/<folder>/egaws s3 ls s3://ethrc-ml-data-916780037007/ETHRC/towel_base_with_rewards - Copy file to S3 bucket: eg
aws s3 cp s3://ethrc-ml-data-916780037007/ETHRC/ test_dl.txt - View nodes in namespace: eg
k9s -n robot-learning