Skip to content

pamelia/devpod-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DevPod on Kubernetes for ML Development

⚠️ Important Note: This project creates custom development environments in Kubernetes pods using standard Kubernetes resources. It is not related to or affiliated with the DevPod project by Loft Labs. We use "DevPod" here in the generic sense of "a development pod" - a Kubernetes pod configured for development purposes.

This project demonstrates how to use development pods inside a Kubernetes cluster for PyTorch machine learning development, providing a persistent development environment that doesn't require rebuilding containers for code changes.

🎯 What This Solves

  • Persistent Development: Your code lives in persistent storage, not in the container
  • Remote Development: SSH access for tools like Zed, VS Code Remote, or terminal
  • GPU Access: Full access to node GPUs for both interactive development and training jobs
  • Multi-Architecture Support: Works on both AMD64 and ARM64 GPU nodes (Grace Hopper)
  • No Container Rebuilds: Iterate on code without rebuilding/pushing containers
  • Shared Resources: Same data and output volumes across dev environment and training jobs

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CoreWeave Kubernetes Cluster            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   Dev Pod       β”‚  β”‚ Training Jobs β”‚ β”‚
β”‚  β”‚ - SSH Server    β”‚  β”‚ - 1, 4, 8 GPU β”‚ β”‚
β”‚  β”‚ - PyTorch+CUDA  β”‚  β”‚ - Batch Jobs  β”‚ β”‚
β”‚  β”‚ - 1 GPU         β”‚  β”‚ - Same PVCs   β”‚ β”‚
β”‚  β”‚ - ARM64/AMD64   β”‚  β”‚ - Multi-arch  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚           β”‚                     β”‚       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚        Persistent Volumes          β”‚ β”‚
β”‚  β”‚ - /workspace (code)                β”‚ β”‚
β”‚  β”‚ - /data (datasets)                 β”‚ β”‚
β”‚  β”‚ - /outputs (models/logs)           β”‚ β”‚
β”‚  β”‚ - /cache (pip/huggingface)         β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚   macOS/Zed     β”‚
          β”‚  SSH Client     β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Prerequisites

  • Kubernetes cluster with GPU nodes (tested on CoreWeave)
  • kubectl configured with cluster access
  • Docker for building images
  • SSH key pair (~/.ssh/id_ed25519)

1. Clone and Configure

git clone <this-repo>
cd devpod-demo

2. Edit Configuration

Edit config.env and set your container registry:

# Container Registry Settings
REGISTRY="ghcr.io"
ORG="your-github-org"  # <-- Change this to your GitHub org
IMAGE_NAME="devpod-demo"
IMAGE_TAG="latest"

3. Run Setup

chmod +x setup.sh generate-manifests.sh
./setup.sh

The setup script will:

  • Load your configuration from config.env
  • Generate manifests with your registry settings
  • Verify prerequisites
  • Create SSH key secret in Kubernetes
  • Build the multi-arch PyTorch+SSH container image
  • Deploy storage and development pod
  • Provide connection details

4. Setup Port-Forwarding and SSH

Start port-forwarding:

./port-forward.sh start

Add to your ~/.ssh/config:

Host ml-dev
  HostName localhost
  Port 2222
  User dev
  IdentityFile ~/.ssh/id_ed25519
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null
  LogLevel ERROR

Connect:

ssh ml-dev

5. Connect with Zed (macOS)

First ensure port-forwarding is running, then in Zed: File β†’ Open… β†’ Remote via SSH β†’ dev@ml-dev:/workspace

6. Start Developing

Your persistent workspace is at /workspace. Code changes survive pod restarts.

# In the SSH session - test GPU functionality
cd /workspace
python hello_gpu.py

πŸ“ Project Structure

devpod-demo/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ config.env                   # Single configuration file (edit this!)
β”œβ”€β”€ setup.sh                     # Interactive setup script
β”œβ”€β”€ generate-manifests.sh        # Generate manifests from config
β”œβ”€β”€ port-forward.sh              # Port-forward helper for SSH access
β”œβ”€β”€ quick-start.sh               # Quick deployment script
β”œβ”€β”€ run-job.sh                   # Training job submission helper
β”œβ”€β”€ docker/
β”‚   β”œβ”€β”€ Dockerfile              # Minimal SSH + dev user setup (uses CoreWeave PyTorch base)
β”‚   β”œβ”€β”€ start-dev.sh            # Container startup script
β”‚   └── .dockerignore           # Keep builds clean
β”œβ”€β”€ k8s/                        # Generated manifests (don't edit directly!)
β”‚   β”œβ”€β”€ 01-storage.yaml         # PVCs for workspace/data/outputs/cache
β”‚   β”œβ”€β”€ 02-dev-statefulset.yaml # Development StatefulSet with SSH + 1 GPU
β”‚   └── 03-training-job.yaml    # Training job templates (1/8 GPU, CPU)
└── examples/
    β”œβ”€β”€ hello_gpu.py            # Simple GPU hello world test
    └── test_multigpu.py        # Multi-GPU DDP test

πŸ”§ Manual Deployment (Alternative to setup.sh)

1. Configure Your Setup

Edit config.env with your registry/org settings:

# Change these values
REGISTRY="ghcr.io"
ORG="your-github-org"
IMAGE_NAME="devpod-demo"

2. Generate Manifests

./generate-manifests.sh --all

This creates all K8s manifests with your configuration and proper GPU node scheduling.

3. Create SSH Key Secret

kubectl create namespace ml  # or whatever you set in config.env
kubectl create secret generic ml-dev-ssh-keys \
  --from-file=authorized_keys=~/.ssh/id_ed25519.pub \
  -n ml

4. Build Container Image

The container uses CoreWeave's PyTorch image as base with PyTorch 2.8.0 + CUDA 12.9 support. The setup script handles multi-platform builds automatically. For manual builds:

cd docker/
# Multi-platform build for both ARM64 and AMD64
docker buildx build --platform linux/amd64,linux/arm64 --push -t ghcr.io/your-org/devpod-demo:latest .

5. Deploy Resources

kubectl apply -f k8s/01-storage.yaml
kubectl apply -f k8s/02-dev-statefulset.yaml

6. Connect via Port-Forward

Start port-forwarding and connect:

# Start port-forwarding (runs in background)
./port-forward.sh start

# Connect via SSH
ssh dev@localhost -p 2222

πŸƒβ€β™‚οΈ Testing GPU Functionality

Quick GPU Tests

# Test GPU functionality in dev pod
ssh ml-dev
cd /workspace
python hello_gpu.py

View Test Results

# Or submit training jobs using run-job.sh
./run-job.sh --help

πŸƒβ€β™‚οΈ Running Training Jobs

Submit Single GPU Job

kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: custom-training
  namespace: ml
spec:
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        node.coreweave.cloud/class: gpu
      containers:
      - name: trainer
        image: ghcr.io/your-org/devpod-demo:latest
        command: ["python", "/workspace/my_training_script.py"]
        args: ["--epochs", "10", "--batch-size", "32"]
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - {name: workspace, mountPath: /workspace}
        - {name: datasets, mountPath: /data}
        - {name: outputs, mountPath: /outputs}
      volumes:
      - {name: workspace, persistentVolumeClaim: {claimName: ml-workspace}}
      - {name: datasets, persistentVolumeClaim: {claimName: ml-datasets}}
      - {name: outputs, persistentVolumeClaim: {claimName: ml-outputs}}
EOF

Submit 8-GPU Distributed Job

# Edit and apply the 8-GPU job template
kubectl apply -f k8s/03-training-job.yaml

Monitor Jobs

# List jobs
kubectl get jobs -n ml

# Watch job logs
kubectl logs -f job/pytorch-train-8gpu -n ml

# Get pod details
kubectl get pods -n ml -l job-name=pytorch-train-8gpu

# Or use the job runner
./run-job.sh logs pytorch-train-8gpu

πŸ”§ Multi-Architecture Support

This project supports both AMD64 and ARM64 architectures:

Supported Platforms

  • AMD64 (x86_64): Traditional GPU servers
  • ARM64 (aarch64): Grace Hopper and other ARM-based GPU nodes

Key Features

  • Automatic Architecture Detection: The setup script detects your host architecture
  • Multi-Platform Base Image: Uses NVIDIA's multi-arch PyTorch containers
  • Smart Scheduling: Automatically schedules on GPU nodes using node.coreweave.cloud/class: gpu
  • Cross-Architecture Development: Develop on Apple Silicon, deploy to any GPU architecture

Base Image

We use ghcr.io/coreweave/ml-containers/torch:es-ubuntu-24-dev-2dd65d0-base-cuda12.9.1-ubuntu24.04-torch2.8.0-vision0.23.0-audio2.8.0-abi1 which provides:

  • βœ… Multi-architecture support (AMD64 + ARM64)
  • βœ… CUDA 12.9 support
  • βœ… PyTorch 2.8.0 with GPU acceleration
  • βœ… Pre-optimized for NVIDIA GPUs
  • βœ… Ubuntu 24.04 base with development tools

πŸ› Troubleshooting

Architecture Issues

# Check node architecture and GPU availability
kubectl get nodes -o custom-columns=NAME:.metadata.name,ARCH:.status.nodeInfo.architecture,GPU:.status.capacity.nvidia\.com/gpu

# Verify pod is on correct architecture
kubectl exec -n ml ml-dev-0 -- uname -m

# Check if container matches node architecture
kubectl describe pod ml-dev-0 -n ml | grep "Node:\|Image:"

SSH Connection Issues

The SSH config above includes options to handle frequently changing host keys (common when pods are recreated):

  • StrictHostKeyChecking no - Automatically accepts new host keys
  • UserKnownHostsFile /dev/null - Doesn't store host keys
  • LogLevel ERROR - Suppresses host key warnings
# Check port-forward status
./port-forward.sh status

# Restart port-forward
./port-forward.sh restart

# Check dev pod status
kubectl get pods -n ml -l app=ml-dev

# Get pod logs
kubectl logs -n ml ml-dev-0

# Manual port-forward if needed
kubectl port-forward -n ml svc/ml-dev 2222:22 8888:8888 6006:6006

# If you still get host key errors, you can also clear the known_hosts entry:
ssh-keygen -R "[localhost]:2222"

GPU Issues

# Check GPU visibility in dev pod
kubectl exec -n ml ml-dev-0 -- nvidia-smi

# Check CUDA availability in Python
kubectl exec -n ml ml-dev-0 -- python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Devices: {torch.cuda.device_count()}')"

# Check GPU allocation on nodes
kubectl describe nodes | grep -A 5 -B 5 nvidia.com/gpu

Container Issues

# Check for exec format errors
kubectl logs -n ml ml-dev-0

# Verify image architecture
docker buildx imagetools inspect ghcr.io/your-org/devpod-demo:latest

# Pull and test image locally
docker run --rm ghcr.io/your-org/devpod-demo:latest uname -m

Storage Issues

# Check PVC status
kubectl get pvc -n ml

# Check available storage
kubectl exec -n ml ml-dev-0 -- df -h

πŸŽ›οΈ Customization

GPU Node Scheduling

The project automatically schedules pods on GPU nodes using:

nodeSelector:
  node.coreweave.cloud/class: gpu

For other clusters, edit generate-manifests.sh to change the nodeSelector.

Adjust GPU Allocation

Edit config.env:

DEFAULT_DEV_GPU_LIMIT="2"  # Change from 1 to 2 GPUs

Then regenerate manifests:

./generate-manifests.sh --all
kubectl apply -f k8s/02-dev-statefulset.yaml

Add Python Packages

The CoreWeave PyTorch base image includes most ML packages you'll need (PyTorch, transformers, etc.).

For additional packages, add them to the Dockerfile and rebuild the image:

# Add to docker/Dockerfile after the base image
RUN pip install package-name another-package

Then rebuild and deploy:

# Use setup.sh for interactive rebuild
./setup.sh  # Select option 3: Build and update image only

# Restart pod with new image
kubectl rollout restart statefulset/ml-dev -n ml

Note: Avoid installing packages directly in the running pod as they will be lost when the pod restarts.

Change Storage Sizes

Edit config.env:

WORKSPACE_SIZE="100Gi"   # Increase workspace
DATASETS_SIZE="1Ti"      # Increase dataset storage

Regenerate and apply:

./generate-manifests.sh --all
kubectl apply -f k8s/01-storage.yaml

πŸ“š Next Steps

Quick Start Checklist

  • Run ./setup.sh with your registry settings
  • SSH into dev pod and test GPU: ssh ml-dev, then python /workspace/hello_gpu.py
  • Start developing: your code in /workspace persists across pod restarts
  • Create your own training scripts in /workspace

Happy coding! πŸš€ This setup gives you a robust, persistent, multi-architecture ML development environment on Kubernetes that scales from experimentation to production training.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors