β οΈ Important Note: This project creates custom development environments in Kubernetes pods using standard Kubernetes resources. It is not related to or affiliated with the DevPod project by Loft Labs. We use "DevPod" here in the generic sense of "a development pod" - a Kubernetes pod configured for development purposes.
This project demonstrates how to use development pods inside a Kubernetes cluster for PyTorch machine learning development, providing a persistent development environment that doesn't require rebuilding containers for code changes.
- Persistent Development: Your code lives in persistent storage, not in the container
- Remote Development: SSH access for tools like Zed, VS Code Remote, or terminal
- GPU Access: Full access to node GPUs for both interactive development and training jobs
- Multi-Architecture Support: Works on both AMD64 and ARM64 GPU nodes (Grace Hopper)
- No Container Rebuilds: Iterate on code without rebuilding/pushing containers
- Shared Resources: Same data and output volumes across dev environment and training jobs
βββββββββββββββββββββββββββββββββββββββββββ
β CoreWeave Kubernetes Cluster β
βββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββ βββββββββββββββββ β
β β Dev Pod β β Training Jobs β β
β β - SSH Server β β - 1, 4, 8 GPU β β
β β - PyTorch+CUDA β β - Batch Jobs β β
β β - 1 GPU β β - Same PVCs β β
β β - ARM64/AMD64 β β - Multi-arch β β
β βββββββββββββββββββ βββββββββββββββββ β
β β β β
β ββββββββββββββββββββββββββββββββββββββ β
β β Persistent Volumes β β
β β - /workspace (code) β β
β β - /data (datasets) β β
β β - /outputs (models/logs) β β
β β - /cache (pip/huggingface) β β
β ββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββ
β macOS/Zed β
β SSH Client β
βββββββββββββββββββ
- Kubernetes cluster with GPU nodes (tested on CoreWeave)
kubectlconfigured with cluster access- Docker for building images
- SSH key pair (
~/.ssh/id_ed25519)
git clone <this-repo>
cd devpod-demoEdit config.env and set your container registry:
# Container Registry Settings
REGISTRY="ghcr.io"
ORG="your-github-org" # <-- Change this to your GitHub org
IMAGE_NAME="devpod-demo"
IMAGE_TAG="latest"chmod +x setup.sh generate-manifests.sh
./setup.shThe setup script will:
- Load your configuration from
config.env - Generate manifests with your registry settings
- Verify prerequisites
- Create SSH key secret in Kubernetes
- Build the multi-arch PyTorch+SSH container image
- Deploy storage and development pod
- Provide connection details
Start port-forwarding:
./port-forward.sh startAdd to your ~/.ssh/config:
Host ml-dev
HostName localhost
Port 2222
User dev
IdentityFile ~/.ssh/id_ed25519
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
LogLevel ERROR
Connect:
ssh ml-devFirst ensure port-forwarding is running, then in Zed: File β Openβ¦ β Remote via SSH β dev@ml-dev:/workspace
Your persistent workspace is at /workspace. Code changes survive pod restarts.
# In the SSH session - test GPU functionality
cd /workspace
python hello_gpu.pydevpod-demo/
βββ README.md # This file
βββ config.env # Single configuration file (edit this!)
βββ setup.sh # Interactive setup script
βββ generate-manifests.sh # Generate manifests from config
βββ port-forward.sh # Port-forward helper for SSH access
βββ quick-start.sh # Quick deployment script
βββ run-job.sh # Training job submission helper
βββ docker/
β βββ Dockerfile # Minimal SSH + dev user setup (uses CoreWeave PyTorch base)
β βββ start-dev.sh # Container startup script
β βββ .dockerignore # Keep builds clean
βββ k8s/ # Generated manifests (don't edit directly!)
β βββ 01-storage.yaml # PVCs for workspace/data/outputs/cache
β βββ 02-dev-statefulset.yaml # Development StatefulSet with SSH + 1 GPU
β βββ 03-training-job.yaml # Training job templates (1/8 GPU, CPU)
βββ examples/
βββ hello_gpu.py # Simple GPU hello world test
βββ test_multigpu.py # Multi-GPU DDP test
Edit config.env with your registry/org settings:
# Change these values
REGISTRY="ghcr.io"
ORG="your-github-org"
IMAGE_NAME="devpod-demo"./generate-manifests.sh --allThis creates all K8s manifests with your configuration and proper GPU node scheduling.
kubectl create namespace ml # or whatever you set in config.env
kubectl create secret generic ml-dev-ssh-keys \
--from-file=authorized_keys=~/.ssh/id_ed25519.pub \
-n mlThe container uses CoreWeave's PyTorch image as base with PyTorch 2.8.0 + CUDA 12.9 support. The setup script handles multi-platform builds automatically. For manual builds:
cd docker/
# Multi-platform build for both ARM64 and AMD64
docker buildx build --platform linux/amd64,linux/arm64 --push -t ghcr.io/your-org/devpod-demo:latest .kubectl apply -f k8s/01-storage.yaml
kubectl apply -f k8s/02-dev-statefulset.yamlStart port-forwarding and connect:
# Start port-forwarding (runs in background)
./port-forward.sh start
# Connect via SSH
ssh dev@localhost -p 2222# Test GPU functionality in dev pod
ssh ml-dev
cd /workspace
python hello_gpu.py# Or submit training jobs using run-job.sh
./run-job.sh --helpkubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: custom-training
namespace: ml
spec:
template:
spec:
restartPolicy: Never
nodeSelector:
node.coreweave.cloud/class: gpu
containers:
- name: trainer
image: ghcr.io/your-org/devpod-demo:latest
command: ["python", "/workspace/my_training_script.py"]
args: ["--epochs", "10", "--batch-size", "32"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
volumeMounts:
- {name: workspace, mountPath: /workspace}
- {name: datasets, mountPath: /data}
- {name: outputs, mountPath: /outputs}
volumes:
- {name: workspace, persistentVolumeClaim: {claimName: ml-workspace}}
- {name: datasets, persistentVolumeClaim: {claimName: ml-datasets}}
- {name: outputs, persistentVolumeClaim: {claimName: ml-outputs}}
EOF# Edit and apply the 8-GPU job template
kubectl apply -f k8s/03-training-job.yaml# List jobs
kubectl get jobs -n ml
# Watch job logs
kubectl logs -f job/pytorch-train-8gpu -n ml
# Get pod details
kubectl get pods -n ml -l job-name=pytorch-train-8gpu
# Or use the job runner
./run-job.sh logs pytorch-train-8gpuThis project supports both AMD64 and ARM64 architectures:
- AMD64 (x86_64): Traditional GPU servers
- ARM64 (aarch64): Grace Hopper and other ARM-based GPU nodes
- Automatic Architecture Detection: The setup script detects your host architecture
- Multi-Platform Base Image: Uses NVIDIA's multi-arch PyTorch containers
- Smart Scheduling: Automatically schedules on GPU nodes using
node.coreweave.cloud/class: gpu - Cross-Architecture Development: Develop on Apple Silicon, deploy to any GPU architecture
We use ghcr.io/coreweave/ml-containers/torch:es-ubuntu-24-dev-2dd65d0-base-cuda12.9.1-ubuntu24.04-torch2.8.0-vision0.23.0-audio2.8.0-abi1 which provides:
- β Multi-architecture support (AMD64 + ARM64)
- β CUDA 12.9 support
- β PyTorch 2.8.0 with GPU acceleration
- β Pre-optimized for NVIDIA GPUs
- β Ubuntu 24.04 base with development tools
# Check node architecture and GPU availability
kubectl get nodes -o custom-columns=NAME:.metadata.name,ARCH:.status.nodeInfo.architecture,GPU:.status.capacity.nvidia\.com/gpu
# Verify pod is on correct architecture
kubectl exec -n ml ml-dev-0 -- uname -m
# Check if container matches node architecture
kubectl describe pod ml-dev-0 -n ml | grep "Node:\|Image:"The SSH config above includes options to handle frequently changing host keys (common when pods are recreated):
StrictHostKeyChecking no- Automatically accepts new host keysUserKnownHostsFile /dev/null- Doesn't store host keysLogLevel ERROR- Suppresses host key warnings
# Check port-forward status
./port-forward.sh status
# Restart port-forward
./port-forward.sh restart
# Check dev pod status
kubectl get pods -n ml -l app=ml-dev
# Get pod logs
kubectl logs -n ml ml-dev-0
# Manual port-forward if needed
kubectl port-forward -n ml svc/ml-dev 2222:22 8888:8888 6006:6006
# If you still get host key errors, you can also clear the known_hosts entry:
ssh-keygen -R "[localhost]:2222"# Check GPU visibility in dev pod
kubectl exec -n ml ml-dev-0 -- nvidia-smi
# Check CUDA availability in Python
kubectl exec -n ml ml-dev-0 -- python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Devices: {torch.cuda.device_count()}')"
# Check GPU allocation on nodes
kubectl describe nodes | grep -A 5 -B 5 nvidia.com/gpu# Check for exec format errors
kubectl logs -n ml ml-dev-0
# Verify image architecture
docker buildx imagetools inspect ghcr.io/your-org/devpod-demo:latest
# Pull and test image locally
docker run --rm ghcr.io/your-org/devpod-demo:latest uname -m# Check PVC status
kubectl get pvc -n ml
# Check available storage
kubectl exec -n ml ml-dev-0 -- df -hThe project automatically schedules pods on GPU nodes using:
nodeSelector:
node.coreweave.cloud/class: gpuFor other clusters, edit generate-manifests.sh to change the nodeSelector.
Edit config.env:
DEFAULT_DEV_GPU_LIMIT="2" # Change from 1 to 2 GPUsThen regenerate manifests:
./generate-manifests.sh --all
kubectl apply -f k8s/02-dev-statefulset.yamlThe CoreWeave PyTorch base image includes most ML packages you'll need (PyTorch, transformers, etc.).
For additional packages, add them to the Dockerfile and rebuild the image:
# Add to docker/Dockerfile after the base image
RUN pip install package-name another-packageThen rebuild and deploy:
# Use setup.sh for interactive rebuild
./setup.sh # Select option 3: Build and update image only
# Restart pod with new image
kubectl rollout restart statefulset/ml-dev -n mlNote: Avoid installing packages directly in the running pod as they will be lost when the pod restarts.
Edit config.env:
WORKSPACE_SIZE="100Gi" # Increase workspace
DATASETS_SIZE="1Ti" # Increase dataset storageRegenerate and apply:
./generate-manifests.sh --all
kubectl apply -f k8s/01-storage.yaml- Run
./setup.shwith your registry settings - SSH into dev pod and test GPU:
ssh ml-dev, thenpython /workspace/hello_gpu.py - Start developing: your code in
/workspacepersists across pod restarts - Create your own training scripts in
/workspace
Happy coding! π This setup gives you a robust, persistent, multi-architecture ML development environment on Kubernetes that scales from experimentation to production training.