Skip to content

Latest commit

 

History

History
416 lines (326 loc) · 9.38 KB

File metadata and controls

416 lines (326 loc) · 9.38 KB

Ollama GPU Setup Guide - Qwen2.5-Coder:3B

This guide helps you deploy the LlamaLearn agent with Ollama using the qwen2.5-coder:3b model with GPU offloading on your Kubernetes cluster.

Hardware Requirements (Your Setup)

  • RAM: 8GB total
  • CPU: 4 cores
  • GPU: NVIDIA with 5GB VRAM (operator installed)
  • Model: Qwen2.5-Coder:3B (~2GB VRAM usage)

Model Information

Qwen2.5-Coder:3B is an excellent choice for your hardware:

  • ✅ Small size: ~2GB in VRAM (fits easily in your 5GB)
  • ✅ Code-focused: Specialized for programming tasks
  • ✅ Fast inference: 3B parameters = quick responses
  • ✅ Good quality: Strong performance for its size
  • ✅ Multilingual: Supports many programming languages

Option 1: Ollama Deployment in Kubernetes (Recommended)

Step 1: Create Ollama Deployment with GPU

Create ollama-gpu-deployment.yaml:

---
apiVersion: v1
kind: Namespace
metadata:
  name: llamalearn

---
# Ollama Deployment with GPU
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: llamalearn
  labels:
    app: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: http
          protocol: TCP
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0"
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
            nvidia.com/gpu: 1
          limits:
            memory: "3Gi"
            cpu: "2000m"
            nvidia.com/gpu: 1
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc

---
# PersistentVolumeClaim for Ollama
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
  namespace: llamalearn
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

---
# Ollama Service
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: llamalearn
  labels:
    app: ollama
spec:
  type: ClusterIP
  ports:
  - port: 11434
    targetPort: 11434
    protocol: TCP
    name: http
  selector:
    app: ollama

Step 2: Deploy Ollama

# Save the above as ollama-gpu-deployment.yaml
kubectl apply -f ollama-gpu-deployment.yaml

# Wait for pod to be ready
kubectl wait --for=condition=ready pod -l app=ollama -n llamalearn --timeout=300s

# Check GPU is available
kubectl logs -n llamalearn -l app=ollama

Step 3: Pull the Model

# Get the Ollama pod name
POD=$(kubectl get pods -n llamalearn -l app=ollama -o jsonpath='{.items[0].metadata.name}')

# Pull qwen2.5-coder:3b model
kubectl exec -it -n llamalearn $POD -- ollama pull qwen2.5-coder:3b

# Verify the model is downloaded
kubectl exec -it -n llamalearn $POD -- ollama list

Step 4: Test Ollama with GPU

# Test the model
kubectl exec -it -n llamalearn $POD -- ollama run qwen2.5-coder:3b "Write a hello world in Python"

# Check GPU usage (if nvidia-smi is available)
kubectl exec -it -n llamalearn $POD -- nvidia-smi

Step 5: Deploy the Agent

# Build the agent image
docker build -t llamalearn-agent:latest .

# Load to your K8s (adjust for your setup)
# For minikube:
minikube image load llamalearn-agent:latest
# For kind:
# kind load docker-image llamalearn-agent:latest

# Deploy the agent (it will connect to ollama-service)
kubectl apply -f k8s-minimal.yaml

# Wait for agent to be ready
kubectl wait --for=condition=ready pod -l app=llamalearn-agent -n llamalearn --timeout=120s

Step 6: Test the Setup

# Port forward the agent service
kubectl port-forward -n llamalearn svc/llamalearn-service 8000:8000

# In another terminal, test
curl http://localhost:8000/health

# Test with the Python client
python test_client.py

Option 2: External Ollama (Already Running)

If you already have Ollama running elsewhere with GPU:

  1. Pull the model (on your Ollama host):

    ollama pull qwen2.5-coder:3b
    ollama list  # verify
  2. Update k8s-minimal.yaml to point to your Ollama:

    data:
      LLM_BACKEND: "ollama"
      OLLAMA_BASE_URL: "http://your-ollama-host:11434"  # or IP address
      OLLAMA_MODEL: "qwen2.5-coder:3b"
  3. Deploy just the agent:

    kubectl apply -f k8s-minimal.yaml
    kubectl port-forward svc/llamalearn-service 8000:8000

Option 3: Local Development (Non-K8s)

For local testing without Kubernetes:

# Install and run Ollama locally
# Download from: https://ollama.ai

# Pull the model
ollama pull qwen2.5-coder:3b

# Verify GPU is being used
ollama run qwen2.5-coder:3b "test"
# Watch GPU usage in another terminal:
watch -n 1 nvidia-smi

# Setup the agent
./setup.sh
source venv/bin/activate

# Make sure .env has:
# LLM_BACKEND=ollama
# OLLAMA_BASE_URL=http://localhost:11434
# OLLAMA_MODEL=qwen2.5-coder:3b

# Run the agent
python main.py --mode api

# Test
python test_client.py

Verifying GPU Usage

Check GPU is Available in Ollama Pod

POD=$(kubectl get pods -n llamalearn -l app=ollama -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n llamalearn $POD -- nvidia-smi

Monitor GPU Usage During Inference

# In one terminal, watch GPU usage
kubectl exec -n llamalearn $POD -- sh -c "watch -n 1 nvidia-smi"

# In another terminal, send requests
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Write a Python function to calculate fibonacci numbers"}'

Expected GPU Memory Usage

  • Idle (model loaded): ~2GB VRAM
  • During inference: ~2.5-3GB VRAM
  • Plenty of headroom in your 5GB VRAM!

Troubleshooting

GPU Not Detected in Pod

Check NVIDIA operator:

kubectl get pods -n gpu-operator-resources

Verify GPU resource:

kubectl describe node | grep nvidia.com/gpu

Check pod has GPU:

kubectl describe pod -n llamalearn -l app=ollama | grep -A 5 Limits

Model Not Loading

Check available space:

kubectl exec -n llamalearn $POD -- df -h

Check Ollama logs:

kubectl logs -n llamalearn -l app=ollama -f

Manually pull model:

kubectl exec -it -n llamalearn $POD -- ollama pull qwen2.5-coder:3b

Out of Memory

If you run out of memory, reduce resources:

  1. Lower GPU memory usage in Ollama (requires custom environment):

    env:
    - name: OLLAMA_NUM_GPU
      value: "1"
    - name: OLLAMA_GPU_OVERHEAD
      value: "0"
  2. Use a smaller model:

    • qwen2.5-coder:1.5b (~1GB VRAM)
    • tinyllama:1.1b (~700MB VRAM)

Agent Can't Connect to Ollama

Test connectivity from agent pod:

AGENT_POD=$(kubectl get pods -n llamalearn -l app=llamalearn-agent -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n llamalearn $AGENT_POD -- curl http://ollama-service:11434/

Check service:

kubectl get svc -n llamalearn ollama-service

Performance Tips

  1. Keep model loaded: Ollama keeps the model in VRAM after first use
  2. Adjust context length: Smaller context = faster inference
  3. Temperature: Lower temperature (0.1-0.3) = faster, more deterministic
  4. Batch requests: Send multiple requests to amortize loading costs

Resource Monitoring

# Watch all resources
watch -n 2 'kubectl top nodes && echo && kubectl top pods -n llamalearn'

# Check GPU usage
kubectl exec -n llamalearn -l app=ollama -- nvidia-smi

# Check agent logs
kubectl logs -n llamalearn -l app=llamalearn-agent -f

Example Queries for Code Model

Since you're using a code-focused model, try these:

python test_client.py

Then try:

  • "Write a Python function to reverse a string"
  • "Explain what a binary search algorithm does"
  • "Create a REST API endpoint using FastAPI"
  • "Write a SQL query to find duplicate records"
  • "Debug this code: [paste code]"

Complete Deployment Script

Save this as deploy-ollama-gpu.sh:

#!/bin/bash
set -e

echo "Deploying Ollama with GPU and LlamaLearn Agent..."

# Create namespace
kubectl create namespace llamalearn --dry-run=client -o yaml | kubectl apply -f -

# Deploy Ollama with GPU
kubectl apply -f ollama-gpu-deployment.yaml

# Wait for Ollama
echo "Waiting for Ollama pod..."
kubectl wait --for=condition=ready pod -l app=ollama -n llamalearn --timeout=300s

# Pull model
echo "Pulling qwen2.5-coder:3b model..."
POD=$(kubectl get pods -n llamalearn -l app=ollama -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n llamalearn $POD -- ollama pull qwen2.5-coder:3b

# Build and load agent image
echo "Building agent image..."
docker build -t llamalearn-agent:latest .
minikube image load llamalearn-agent:latest || kind load docker-image llamalearn-agent:latest || true

# Deploy agent
echo "Deploying agent..."
kubectl apply -f k8s-minimal.yaml

# Wait for agent
kubectl wait --for=condition=ready pod -l app=llamalearn-agent -n llamalearn --timeout=120s

echo ""
echo "✅ Deployment complete!"
echo ""
echo "Port forward to access:"
echo "  kubectl port-forward -n llamalearn svc/llamalearn-service 8000:8000"
echo ""
echo "Test:"
echo "  curl http://localhost:8000/health"
echo "  python test_client.py"

Make it executable:

chmod +x deploy-ollama-gpu.sh
./deploy-ollama-gpu.sh

Happy coding with your GPU-accelerated agent! 🚀