This guide covers different deployment scenarios for CodePilot, from local development to production deployment.
- Local Development Setup
- Local Inference Server
- AWS Deployment
- Docker Deployment
- VSCode Extension Installation
- Production Considerations
- Ubuntu 20.04+ (or similar Linux distribution)
- Python 3.8+
- CUDA 11.8+ (for GPU support)
- 16GB+ RAM (32GB recommended for training)
- Node.js 16+ (for VSCode extension)
# Clone repository
git clone https://github.com/sreekar-gajula/code-pilot.git
cd code-pilot
# Run automated setup
chmod +x scripts/setup.sh
./scripts/setup.sh
# Activate environment
source venv/bin/activateIf automated setup fails:
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Create directories
mkdir -p data/{raw,processed} model/checkpoints logs evaluation/results# Activate environment
source venv/bin/activate
# Set model path (will auto-download from HuggingFace)
export MODEL_PATH="codellama/CodeLlama-13b-Instruct-hf"
# Start server
cd inference
python server.py --model-path $MODEL_PATH --port 8000After training your own model:
# Set path to your fine-tuned model
export MODEL_PATH="./model/checkpoints/codepilot-13b"
# Start server
cd inference
python server.py --model-path $MODEL_PATH --port 8000# Health check
curl http://localhost:8000/health
# Test completion
curl -X POST http://localhost:8000/complete \
-H "Content-Type: application/json" \
-d '{
"prompt": "void can_transmit_message(can_msg_t* msg) {\n",
"max_tokens": 150,
"temperature": 0.2
}'Edit inference/config.yaml:
model:
path: "./model/codepilot-13b"
quantization: "4bit"
server:
host: "0.0.0.0"
port: 8000
workers: 4
vllm:
max_model_len: 4096
tensor_parallel_size: 1
dtype: "float16"-
Launch EC2 Instance
- Instance Type:
g5.2xlarge(NVIDIA A10G GPU) - AMI: Deep Learning AMI (Ubuntu)
- Storage: 100GB EBS
- Security Group: Open port 8000
- Instance Type:
-
Connect and Setup
# SSH into instance
ssh -i your-key.pem ubuntu@<instance-ip>
# Clone repository
git clone https://github.com/sreekar-gajula/code-pilot.git
cd code-pilot
# Run setup
./scripts/setup.sh
# Install NVIDIA drivers (if not using DL AMI)
sudo apt-get install -y nvidia-driver-525- Start Service
# Run as background service with systemd
sudo cp scripts/codepilot.service /etc/systemd/system/
sudo systemctl enable codepilot
sudo systemctl start codepilot
# Check status
sudo systemctl status codepilotCreate /etc/systemd/system/codepilot.service:
[Unit]
Description=CodePilot Inference Server
After=network.target
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/code-pilot
Environment="PATH=/home/ubuntu/code-pilot/venv/bin"
ExecStart=/home/ubuntu/code-pilot/venv/bin/python inference/server.py
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetFor high availability, use AWS ELB:
# Install nginx
sudo apt-get install -y nginx
# Configure reverse proxy
sudo nano /etc/nginx/sites-available/codepilotNginx config:
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}# Build production image
docker build -t codepilot:latest .
# Or build development image
docker build --target development -t codepilot:dev .# Run with GPU support
docker run --gpus all -p 8000:8000 \
-v $(pwd)/model:/app/model \
codepilot:latest
# Run with custom configuration
docker run --gpus all -p 8000:8000 \
-e MODEL_PATH=/app/model/codepilot-13b \
-e API_PORT=8000 \
codepilot:latestCreate docker-compose.yml:
version: '3.8'
services:
codepilot:
image: codepilot:latest
ports:
- "8000:8000"
volumes:
- ./model:/app/model
- ./logs:/app/logs
environment:
- MODEL_PATH=/app/model/codepilot-13b
- CUDA_VISIBLE_DEVICES=0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]Run with:
docker-compose up -dcd vscode-extension
# Install dependencies
npm install
# Compile TypeScript
npm run compile
# Open in VSCode and press F5 to launch extension development hostcd vscode-extension
# Install VSCE
npm install -g vsce
# Package extension
vsce package
# Install in VSCode
code --install-extension codepilot-automotive-1.0.0.vsixOnce published:
- Open VSCode
- Go to Extensions (Ctrl+Shift+X)
- Search "CodePilot Automotive"
- Click Install
After installation, configure in VSCode settings:
{
"codepilot.apiEndpoint": "http://localhost:8000",
"codepilot.enableAutoComplete": true,
"codepilot.maxTokens": 150,
"codepilot.temperature": 0.2
}-
Model Quantization
- Use 4-bit quantization for memory efficiency
- Trade-off: ~5% accuracy for 4x less memory
-
Batch Processing
- vLLM enables efficient batching
- Configure
max_num_seqsbased on GPU memory
-
Caching
- Enable KV cache for repeated prefixes
- Can reduce latency by 30-50%
- Prometheus Metrics
Add to inference/server.py:
from prometheus_client import Counter, Histogram
request_count = Counter('codepilot_requests_total', 'Total requests')
latency = Histogram('codepilot_latency_seconds', 'Request latency')- Logging
Configure structured logging:
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('logs/codepilot.log'),
logging.StreamHandler()
]
)-
API Authentication
- Add API key validation
- Use JWT tokens for user auth
-
Rate Limiting
- Implement per-user rate limits
- Use Redis for distributed rate limiting
-
HTTPS
- Use Let's Encrypt for SSL certificates
- Configure nginx with HTTPS
-
Horizontal Scaling
- Deploy multiple instances
- Use load balancer (AWS ELB, nginx)
-
Model Replication
- Tensor parallelism for large models
- Pipeline parallelism across nodes
-
Auto-scaling
- Configure AWS Auto Scaling groups
- Scale based on GPU utilization
Issue: CUDA out of memory
# Solution: Reduce batch size or use smaller model
export CUDA_VISIBLE_DEVICES=0
python inference/server.py --max-batch-size 1Issue: Port already in use
# Solution: Change port or kill existing process
lsof -ti:8000 | xargs kill -9
python inference/server.py --port 8001Issue: Model loading slow
# Solution: Use local model cache
export TRANSFORMERS_CACHE=/path/to/cache- GitHub Issues: Report bugs and feature requests
- Discussions: Ask questions and share ideas
- Email: sreekar.gajula@example.com
- Training Guide - Train your own model
- API Reference - Detailed API documentation
- Examples - Usage examples