- Overview
- Architecture
- Features
- Prerequisites
- Quick Start
- Usage
- Development
- Security Considerations
- Performance Optimization
- Troubleshooting
- License
A GPU-accelerated inference platform for deploying multiple AI models as microservices. Built with Docker, FastAPI, and Prometheus/Grafana monitoring.
This project provides a containerized infrastructure for running multiple AI/ML models with GPU acceleration. Each model runs as an independent microservice behind an NGINX reverse proxy, with built-in monitoring via Prometheus and Grafana.
| Model | Capability | Framework |
|---|---|---|
| Whisper | Automatic speech recognition & transcription | OpenAI |
| Stable Diffusion v1.4 | Text-to-image generation | Hugging Face |
| LLaMA 2 7B (Q4_K_M) | Text generation & chat | Meta AI |
| Gemma 2 2B | Text generation & instruction following | |
| Silero TTS v3 | Russian text-to-speech synthesis | Silero |
βββββββββββββββ
β NGINX β β Reverse Proxy (Port 80)
ββββββββ¬βββββββ
β
βββββββΊ Whisper API (Port 8035)
βββββββΊ Stable Diff API (Port 8015)
βββββββΊ LLaMA API (Port 8012)
βββββββΊ Gemma API (Port 8005)
βββββββΊ TTS API (Port 8025)
ββββββββββββββββββββββββββββββββββββ
β Monitoring Stack β
β β’ Prometheus (Port 8090) β
β β’ Grafana (Port 8034) β
ββββββββββββββββββββββββββββββββββββ
Key Design Decisions:
- Microservices Architecture: Each model runs in isolation for independent updates and deployment
- Shared Base Image: Common dependencies cached in
Dockerfile.baseto reduce build time and storage - GPU Sharing: All services share a single GPU through NVIDIA Container Toolkit
- Non-root Containers: Services run as unprivileged users for enhanced security
- Makefile Orchestration: Simplified commands for common operations
β GPU Acceleration: CUDA-optimized inference for all models
β Containerized Deployment: Docker Compose orchestration with a shared base image
β Monitoring: Prometheus metrics + Grafana dashboards
β Security Hardening: Rootless Docker support, non-root containers
β Reverse Proxy: NGINX routing and request proxying
β Modular Design: Easy to add/remove models independently
β Optimized Builds: Shared base image with dependency caching
- GPU: NVIDIA GPU with CUDA support (recommended: β₯16GB VRAM for all models)
- RAM: Minimum 16GB system memory
- Storage: ~50GB free disk space for models and containers
-
Docker Engine (β₯20.10)
-
NVIDIA Container Toolkit
- Required for GPU access from containers
- Installation Guide
-
CUDA Drivers (β₯11.8)
- Verify with:
nvidia-smi
- Verify with:
-
Python 3.12+ (for setup scripts)
Check that Docker can access your GPU:
nvidia-smiAlternatively, run
python check_gpu.pyafter installingtorchinto your environment.
π Detailed GPU setup: See docs/GPU_SETUP.md
git clone https://github.com/spolivin/llm-inference-api.git
cd llm-inference-api
# Configure environment variables (Grafana credentials)
cp .env.example .envOption A: Using Python venv
sudo apt-get update
sudo apt-get install python3.12-venv build-essential
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements-all.txtOption B: Using Conda
source setup_env.shAuthenticate with Hugging Face:
sh hf_login.shDownload all models:
# LLaMA 2 7B (quantized)
cd services/llama-api && sh download_llama_weights.sh && cd ../..
# Gemma 2 2B
cd services/gemma-api && sh download_gemma_model.sh && cd ../..
# Stable Diffusion v1.4
cd services/stable-diffusion-api && python download_sd_model.py && cd ../..
# Silero TTS v3 (Russian)
cd services/tts-api && python download_tts_model.py && cd ../..
# Whisper (base model)
cd services/whisper-api && python download_whisper_model.py && cd ../..Logout from Hugging Face (security best practice):
sh hf_logout.sh# Build shared base image (one-time operation)
make build-base
# Build all services
make build-services
# Start all services
make up-servicesCheck that all services are running:
make servicesExpected output: 8 containers running (5 models + NGINX + Prometheus + Grafana)
Once deployed, access services through NGINX on localhost or your VM's public IP:
# Text Generation (LLaMA 2)
curl -X POST http://localhost/api/v1/llama/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain quantum computing", "max_new_tokens": 100}'
# Text Generation (Gemma 2)
curl -X POST http://localhost/api/v1/gemma/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Write a Python function", "max_new_tokens": 150}'
# Image Generation (Stable Diffusion)
curl -X POST http://localhost/api/v1/sd/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "a serene mountain landscape", "steps": 50}'
# Speech Recognition (Whisper)
curl -X POST http://localhost/api/v1/whisper/generate \
-F "file=@audio.wav"
# Text-to-Speech (Silero - Russian)
curl -X POST http://localhost/api/v1/tts/generate \
-H "Content-Type: application/json" \
-d '{"text": "ΠΡΠΈΠ²Π΅Ρ, ΠΊΠ°ΠΊ Π΄Π΅Π»Π°?", "speaker": "kseniya"}'Note: Generating image and synthesizing audio as shown above generate files and save them in temporary directory. Consult API scripts directory on how to download the generated files locally.
- Prometheus: http://localhost:8090
- Grafana: http://localhost:8034
- Default credentials: See
.envfile - Pre-configured dashboards available
- Default credentials: See
π Monitoring Guide: See docs/MONITORING.md
# View running containers
make containers
# View service logs
docker compose logs -f
# Check open ports
make open-ports
# Rebuild a specific service
docker compose build <service-name>
# Stop all services
make down-services
# Clean up everything (containers, volumes, images)
docker compose down -v --rmi allllm-inference-api/
βββ services/ # Individual model services
β βββ whisper-api/
β βββ stable-diffusion-api/
β βββ llama-api/
β βββ gemma-api/
β βββ tts-api/
βββ nginx/ # Reverse proxy configuration
βββ prometheus/ # Metrics collection config
βββ docs/ # Additional documentation
βββ Dockerfile.base # Shared base image
βββ docker-compose.yml # Service orchestration
βββ Makefile # Build and deployment commands
βββ requirements-*.txt # Python dependencies
This project implements several security best practices:
- Non-root Containers: All services run as unprivileged users (UID 1000)
- Rootless Docker Support: Compatible with rootless Docker installations
- Token Management: HF tokens cleared after model download
- Network Isolation: Services communicate through internal Docker network
- Environment Variables: Sensitive configs managed via
.envfile
π Security Details:
For systems with <16GB VRAM, consider:
- Running a subset of models
- Using smaller model variants
- Implementing model swapping (load/unload on demand)
The shared base image (Dockerfile.base) caches common dependencies:
- PyTorch + CUDA libraries
- FastAPI + Uvicorn
- HuggingFace Transformers
Benefits:
- Reduction in total build time
- Reduction in total image size
- Faster iterative development
# Check NVIDIA drivers
nvidia-smi
# Verify Docker GPU access
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi- Lower
max_new_tokensin generation requests - Run fewer models simultaneously
# Check which ports are in use
make open-portsModify ports in docker-compose.yml if needed
This project is licensed under the MIT License - see the LICENSE.txt file for details.