This guide helps you deploy the LlamaLearn agent with Ollama using the qwen2.5-coder:3b model with GPU offloading on your Kubernetes cluster.
- RAM: 8GB total
- CPU: 4 cores
- GPU: NVIDIA with 5GB VRAM (operator installed)
- Model: Qwen2.5-Coder:3B (~2GB VRAM usage)
Qwen2.5-Coder:3B is an excellent choice for your hardware:
- ✅ Small size: ~2GB in VRAM (fits easily in your 5GB)
- ✅ Code-focused: Specialized for programming tasks
- ✅ Fast inference: 3B parameters = quick responses
- ✅ Good quality: Strong performance for its size
- ✅ Multilingual: Supports many programming languages
Create ollama-gpu-deployment.yaml:
---
apiVersion: v1
kind: Namespace
metadata:
name: llamalearn
---
# Ollama Deployment with GPU
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: llamalearn
labels:
app: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
name: http
protocol: TCP
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
resources:
requests:
memory: "2Gi"
cpu: "1000m"
nvidia.com/gpu: 1
limits:
memory: "3Gi"
cpu: "2000m"
nvidia.com/gpu: 1
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
---
# PersistentVolumeClaim for Ollama
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
namespace: llamalearn
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
# Ollama Service
apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: llamalearn
labels:
app: ollama
spec:
type: ClusterIP
ports:
- port: 11434
targetPort: 11434
protocol: TCP
name: http
selector:
app: ollama# Save the above as ollama-gpu-deployment.yaml
kubectl apply -f ollama-gpu-deployment.yaml
# Wait for pod to be ready
kubectl wait --for=condition=ready pod -l app=ollama -n llamalearn --timeout=300s
# Check GPU is available
kubectl logs -n llamalearn -l app=ollama# Get the Ollama pod name
POD=$(kubectl get pods -n llamalearn -l app=ollama -o jsonpath='{.items[0].metadata.name}')
# Pull qwen2.5-coder:3b model
kubectl exec -it -n llamalearn $POD -- ollama pull qwen2.5-coder:3b
# Verify the model is downloaded
kubectl exec -it -n llamalearn $POD -- ollama list# Test the model
kubectl exec -it -n llamalearn $POD -- ollama run qwen2.5-coder:3b "Write a hello world in Python"
# Check GPU usage (if nvidia-smi is available)
kubectl exec -it -n llamalearn $POD -- nvidia-smi# Build the agent image
docker build -t llamalearn-agent:latest .
# Load to your K8s (adjust for your setup)
# For minikube:
minikube image load llamalearn-agent:latest
# For kind:
# kind load docker-image llamalearn-agent:latest
# Deploy the agent (it will connect to ollama-service)
kubectl apply -f k8s-minimal.yaml
# Wait for agent to be ready
kubectl wait --for=condition=ready pod -l app=llamalearn-agent -n llamalearn --timeout=120s# Port forward the agent service
kubectl port-forward -n llamalearn svc/llamalearn-service 8000:8000
# In another terminal, test
curl http://localhost:8000/health
# Test with the Python client
python test_client.pyIf you already have Ollama running elsewhere with GPU:
-
Pull the model (on your Ollama host):
ollama pull qwen2.5-coder:3b ollama list # verify -
Update k8s-minimal.yaml to point to your Ollama:
data: LLM_BACKEND: "ollama" OLLAMA_BASE_URL: "http://your-ollama-host:11434" # or IP address OLLAMA_MODEL: "qwen2.5-coder:3b"
-
Deploy just the agent:
kubectl apply -f k8s-minimal.yaml kubectl port-forward svc/llamalearn-service 8000:8000
For local testing without Kubernetes:
# Install and run Ollama locally
# Download from: https://ollama.ai
# Pull the model
ollama pull qwen2.5-coder:3b
# Verify GPU is being used
ollama run qwen2.5-coder:3b "test"
# Watch GPU usage in another terminal:
watch -n 1 nvidia-smi
# Setup the agent
./setup.sh
source venv/bin/activate
# Make sure .env has:
# LLM_BACKEND=ollama
# OLLAMA_BASE_URL=http://localhost:11434
# OLLAMA_MODEL=qwen2.5-coder:3b
# Run the agent
python main.py --mode api
# Test
python test_client.pyPOD=$(kubectl get pods -n llamalearn -l app=ollama -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n llamalearn $POD -- nvidia-smi# In one terminal, watch GPU usage
kubectl exec -n llamalearn $POD -- sh -c "watch -n 1 nvidia-smi"
# In another terminal, send requests
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"message": "Write a Python function to calculate fibonacci numbers"}'- Idle (model loaded): ~2GB VRAM
- During inference: ~2.5-3GB VRAM
- Plenty of headroom in your 5GB VRAM!
Check NVIDIA operator:
kubectl get pods -n gpu-operator-resourcesVerify GPU resource:
kubectl describe node | grep nvidia.com/gpuCheck pod has GPU:
kubectl describe pod -n llamalearn -l app=ollama | grep -A 5 LimitsCheck available space:
kubectl exec -n llamalearn $POD -- df -hCheck Ollama logs:
kubectl logs -n llamalearn -l app=ollama -fManually pull model:
kubectl exec -it -n llamalearn $POD -- ollama pull qwen2.5-coder:3bIf you run out of memory, reduce resources:
-
Lower GPU memory usage in Ollama (requires custom environment):
env: - name: OLLAMA_NUM_GPU value: "1" - name: OLLAMA_GPU_OVERHEAD value: "0"
-
Use a smaller model:
qwen2.5-coder:1.5b(~1GB VRAM)tinyllama:1.1b(~700MB VRAM)
Test connectivity from agent pod:
AGENT_POD=$(kubectl get pods -n llamalearn -l app=llamalearn-agent -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n llamalearn $AGENT_POD -- curl http://ollama-service:11434/Check service:
kubectl get svc -n llamalearn ollama-service- Keep model loaded: Ollama keeps the model in VRAM after first use
- Adjust context length: Smaller context = faster inference
- Temperature: Lower temperature (0.1-0.3) = faster, more deterministic
- Batch requests: Send multiple requests to amortize loading costs
# Watch all resources
watch -n 2 'kubectl top nodes && echo && kubectl top pods -n llamalearn'
# Check GPU usage
kubectl exec -n llamalearn -l app=ollama -- nvidia-smi
# Check agent logs
kubectl logs -n llamalearn -l app=llamalearn-agent -fSince you're using a code-focused model, try these:
python test_client.pyThen try:
- "Write a Python function to reverse a string"
- "Explain what a binary search algorithm does"
- "Create a REST API endpoint using FastAPI"
- "Write a SQL query to find duplicate records"
- "Debug this code: [paste code]"
Save this as deploy-ollama-gpu.sh:
#!/bin/bash
set -e
echo "Deploying Ollama with GPU and LlamaLearn Agent..."
# Create namespace
kubectl create namespace llamalearn --dry-run=client -o yaml | kubectl apply -f -
# Deploy Ollama with GPU
kubectl apply -f ollama-gpu-deployment.yaml
# Wait for Ollama
echo "Waiting for Ollama pod..."
kubectl wait --for=condition=ready pod -l app=ollama -n llamalearn --timeout=300s
# Pull model
echo "Pulling qwen2.5-coder:3b model..."
POD=$(kubectl get pods -n llamalearn -l app=ollama -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n llamalearn $POD -- ollama pull qwen2.5-coder:3b
# Build and load agent image
echo "Building agent image..."
docker build -t llamalearn-agent:latest .
minikube image load llamalearn-agent:latest || kind load docker-image llamalearn-agent:latest || true
# Deploy agent
echo "Deploying agent..."
kubectl apply -f k8s-minimal.yaml
# Wait for agent
kubectl wait --for=condition=ready pod -l app=llamalearn-agent -n llamalearn --timeout=120s
echo ""
echo "✅ Deployment complete!"
echo ""
echo "Port forward to access:"
echo " kubectl port-forward -n llamalearn svc/llamalearn-service 8000:8000"
echo ""
echo "Test:"
echo " curl http://localhost:8000/health"
echo " python test_client.py"Make it executable:
chmod +x deploy-ollama-gpu.sh
./deploy-ollama-gpu.shHappy coding with your GPU-accelerated agent! 🚀