Ollama GPU Setup Guide - Qwen2.5-Coder:3B

This guide helps you deploy the LlamaLearn agent with Ollama using the qwen2.5-coder:3b model with GPU offloading on your Kubernetes cluster.

Hardware Requirements (Your Setup)

RAM: 8GB total
CPU: 4 cores
GPU: NVIDIA with 5GB VRAM (operator installed)
Model: Qwen2.5-Coder:3B (~2GB VRAM usage)

Model Information

Qwen2.5-Coder:3B is an excellent choice for your hardware:

✅ Small size: ~2GB in VRAM (fits easily in your 5GB)
✅ Code-focused: Specialized for programming tasks
✅ Fast inference: 3B parameters = quick responses
✅ Good quality: Strong performance for its size
✅ Multilingual: Supports many programming languages

Option 1: Ollama Deployment in Kubernetes (Recommended)

Step 1: Create Ollama Deployment with GPU

Create ollama-gpu-deployment.yaml:

---
apiVersion: v1
kind: Namespace
metadata:
  name: llamalearn

---
# Ollama Deployment with GPU
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: llamalearn
  labels:
    app: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: http
          protocol: TCP
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0"
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
            nvidia.com/gpu: 1
          limits:
            memory: "3Gi"
            cpu: "2000m"
            nvidia.com/gpu: 1
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc

---
# PersistentVolumeClaim for Ollama
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
  namespace: llamalearn
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

---
# Ollama Service
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: llamalearn
  labels:
    app: ollama
spec:
  type: ClusterIP
  ports:
  - port: 11434
    targetPort: 11434
    protocol: TCP
    name: http
  selector:
    app: ollama

Step 2: Deploy Ollama

# Save the above as ollama-gpu-deployment.yaml
kubectl apply -f ollama-gpu-deployment.yaml

# Wait for pod to be ready
kubectl wait --for=condition=ready pod -l app=ollama -n llamalearn --timeout=300s

# Check GPU is available
kubectl logs -n llamalearn -l app=ollama

Step 3: Pull the Model

# Get the Ollama pod name
POD=$(kubectl get pods -n llamalearn -l app=ollama -o jsonpath='{.items[0].metadata.name}')

# Pull qwen2.5-coder:3b model
kubectl exec -it -n llamalearn $POD -- ollama pull qwen2.5-coder:3b

# Verify the model is downloaded
kubectl exec -it -n llamalearn $POD -- ollama list

Step 4: Test Ollama with GPU

# Test the model
kubectl exec -it -n llamalearn $POD -- ollama run qwen2.5-coder:3b "Write a hello world in Python"

# Check GPU usage (if nvidia-smi is available)
kubectl exec -it -n llamalearn $POD -- nvidia-smi

Step 5: Deploy the Agent

# Build the agent image
docker build -t llamalearn-agent:latest .

# Load to your K8s (adjust for your setup)
# For minikube:
minikube image load llamalearn-agent:latest
# For kind:
# kind load docker-image llamalearn-agent:latest

# Deploy the agent (it will connect to ollama-service)
kubectl apply -f k8s-minimal.yaml

# Wait for agent to be ready
kubectl wait --for=condition=ready pod -l app=llamalearn-agent -n llamalearn --timeout=120s

Step 6: Test the Setup

# Port forward the agent service
kubectl port-forward -n llamalearn svc/llamalearn-service 8000:8000

# In another terminal, test
curl http://localhost:8000/health

# Test with the Python client
python test_client.py

Option 2: External Ollama (Already Running)

If you already have Ollama running elsewhere with GPU:

Pull the model (on your Ollama host):

ollama pull qwen2.5-coder:3b
ollama list  # verify

Update k8s-minimal.yaml to point to your Ollama:

data:
  LLM_BACKEND: "ollama"
  OLLAMA_BASE_URL: "http://your-ollama-host:11434"  # or IP address
  OLLAMA_MODEL: "qwen2.5-coder:3b"

Deploy just the agent:

kubectl apply -f k8s-minimal.yaml
kubectl port-forward svc/llamalearn-service 8000:8000

Option 3: Local Development (Non-K8s)

For local testing without Kubernetes:

# Install and run Ollama locally
# Download from: https://ollama.ai

# Pull the model
ollama pull qwen2.5-coder:3b

# Verify GPU is being used
ollama run qwen2.5-coder:3b "test"
# Watch GPU usage in another terminal:
watch -n 1 nvidia-smi

# Setup the agent
./setup.sh
source venv/bin/activate

# Make sure .env has:
# LLM_BACKEND=ollama
# OLLAMA_BASE_URL=http://localhost:11434
# OLLAMA_MODEL=qwen2.5-coder:3b

# Run the agent
python main.py --mode api

# Test
python test_client.py

Verifying GPU Usage

Check GPU is Available in Ollama Pod

POD=$(kubectl get pods -n llamalearn -l app=ollama -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n llamalearn $POD -- nvidia-smi

Monitor GPU Usage During Inference

# In one terminal, watch GPU usage
kubectl exec -n llamalearn $POD -- sh -c "watch -n 1 nvidia-smi"

# In another terminal, send requests
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Write a Python function to calculate fibonacci numbers"}'

Expected GPU Memory Usage

Idle (model loaded): ~2GB VRAM
During inference: ~2.5-3GB VRAM
Plenty of headroom in your 5GB VRAM!

Troubleshooting

GPU Not Detected in Pod

Check NVIDIA operator:

kubectl get pods -n gpu-operator-resources

Verify GPU resource:

kubectl describe node | grep nvidia.com/gpu

Check pod has GPU:

kubectl describe pod -n llamalearn -l app=ollama | grep -A 5 Limits

Model Not Loading

Check available space:

kubectl exec -n llamalearn $POD -- df -h

Check Ollama logs:

kubectl logs -n llamalearn -l app=ollama -f

Manually pull model:

kubectl exec -it -n llamalearn $POD -- ollama pull qwen2.5-coder:3b

Out of Memory

If you run out of memory, reduce resources:

Lower GPU memory usage in Ollama (requires custom environment):

env:
- name: OLLAMA_NUM_GPU
  value: "1"
- name: OLLAMA_GPU_OVERHEAD
  value: "0"

Use a smaller model:
- qwen2.5-coder:1.5b (~1GB VRAM)
- tinyllama:1.1b (~700MB VRAM)

Agent Can't Connect to Ollama

Test connectivity from agent pod:

AGENT_POD=$(kubectl get pods -n llamalearn -l app=llamalearn-agent -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n llamalearn $AGENT_POD -- curl http://ollama-service:11434/

Check service:

kubectl get svc -n llamalearn ollama-service

Performance Tips

Keep model loaded: Ollama keeps the model in VRAM after first use
Adjust context length: Smaller context = faster inference
Temperature: Lower temperature (0.1-0.3) = faster, more deterministic
Batch requests: Send multiple requests to amortize loading costs

Resource Monitoring

# Watch all resources
watch -n 2 'kubectl top nodes && echo && kubectl top pods -n llamalearn'

# Check GPU usage
kubectl exec -n llamalearn -l app=ollama -- nvidia-smi

# Check agent logs
kubectl logs -n llamalearn -l app=llamalearn-agent -f

Example Queries for Code Model

Since you're using a code-focused model, try these:

python test_client.py

Then try:

"Write a Python function to reverse a string"
"Explain what a binary search algorithm does"
"Create a REST API endpoint using FastAPI"
"Write a SQL query to find duplicate records"
"Debug this code: [paste code]"

Complete Deployment Script

Save this as deploy-ollama-gpu.sh:

#!/bin/bash
set -e

echo "Deploying Ollama with GPU and LlamaLearn Agent..."

# Create namespace
kubectl create namespace llamalearn --dry-run=client -o yaml | kubectl apply -f -

# Deploy Ollama with GPU
kubectl apply -f ollama-gpu-deployment.yaml

# Wait for Ollama
echo "Waiting for Ollama pod..."
kubectl wait --for=condition=ready pod -l app=ollama -n llamalearn --timeout=300s

# Pull model
echo "Pulling qwen2.5-coder:3b model..."
POD=$(kubectl get pods -n llamalearn -l app=ollama -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n llamalearn $POD -- ollama pull qwen2.5-coder:3b

# Build and load agent image
echo "Building agent image..."
docker build -t llamalearn-agent:latest .
minikube image load llamalearn-agent:latest || kind load docker-image llamalearn-agent:latest || true

# Deploy agent
echo "Deploying agent..."
kubectl apply -f k8s-minimal.yaml

# Wait for agent
kubectl wait --for=condition=ready pod -l app=llamalearn-agent -n llamalearn --timeout=120s

echo ""
echo "✅ Deployment complete!"
echo ""
echo "Port forward to access:"
echo "  kubectl port-forward -n llamalearn svc/llamalearn-service 8000:8000"
echo ""
echo "Test:"
echo "  curl http://localhost:8000/health"
echo "  python test_client.py"

Make it executable:

chmod +x deploy-ollama-gpu.sh
./deploy-ollama-gpu.sh

Happy coding with your GPU-accelerated agent! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ollama GPU Setup Guide - Qwen2.5-Coder:3B

Hardware Requirements (Your Setup)

Model Information

Option 1: Ollama Deployment in Kubernetes (Recommended)

Step 1: Create Ollama Deployment with GPU

Step 2: Deploy Ollama

Step 3: Pull the Model

Step 4: Test Ollama with GPU

Step 5: Deploy the Agent

Step 6: Test the Setup

Option 2: External Ollama (Already Running)

Option 3: Local Development (Non-K8s)

Verifying GPU Usage

Check GPU is Available in Ollama Pod

Monitor GPU Usage During Inference

Expected GPU Memory Usage

Troubleshooting

GPU Not Detected in Pod

Model Not Loading

Out of Memory

Agent Can't Connect to Ollama

Performance Tips

Resource Monitoring

Example Queries for Code Model

Complete Deployment Script

FilesExpand file tree

OLLAMA_GPU_SETUP.md

Latest commit

History

OLLAMA_GPU_SETUP.md

File metadata and controls

Ollama GPU Setup Guide - Qwen2.5-Coder:3B

Hardware Requirements (Your Setup)

Model Information

Option 1: Ollama Deployment in Kubernetes (Recommended)

Step 1: Create Ollama Deployment with GPU

Step 2: Deploy Ollama

Step 3: Pull the Model

Step 4: Test Ollama with GPU

Step 5: Deploy the Agent

Step 6: Test the Setup

Option 2: External Ollama (Already Running)

Option 3: Local Development (Non-K8s)

Verifying GPU Usage

Check GPU is Available in Ollama Pod

Monitor GPU Usage During Inference

Expected GPU Memory Usage

Troubleshooting

GPU Not Detected in Pod

Model Not Loading

Out of Memory

Agent Can't Connect to Ollama

Performance Tips

Resource Monitoring

Example Queries for Code Model

Complete Deployment Script