🚀 LLM Inference API

Overview
Architecture
Features
Prerequisites
Quick Start
Usage
Development
Security Considerations
Performance Optimization
Troubleshooting
License

A GPU-accelerated inference platform for deploying multiple AI models as microservices. Built with Docker, FastAPI, and Prometheus/Grafana monitoring.

Overview

This project provides a containerized infrastructure for running multiple AI/ML models with GPU acceleration. Each model runs as an independent microservice behind an NGINX reverse proxy, with built-in monitoring via Prometheus and Grafana.

Model	Capability	Framework
Whisper	Automatic speech recognition & transcription	OpenAI
Stable Diffusion v1.4	Text-to-image generation	Hugging Face
LLaMA 2 7B (Q4_K_M)	Text generation & chat	Meta AI
Gemma 2 2B	Text generation & instruction following	Google
Silero TTS v3	Russian text-to-speech synthesis	Silero

Architecture

┌─────────────┐
│   NGINX     │  ← Reverse Proxy (Port 80)
└──────┬──────┘
       │
       ├─────► Whisper API      (Port 8035)
       ├─────► Stable Diff API  (Port 8015)
       ├─────► LLaMA API        (Port 8012)
       ├─────► Gemma API        (Port 8005)
       └─────► TTS API          (Port 8025)

┌──────────────────────────────────┐
│   Monitoring Stack               │
│   • Prometheus (Port 8090)       │
│   • Grafana (Port 8034)          │
└──────────────────────────────────┘

Key Design Decisions:

Microservices Architecture: Each model runs in isolation for independent updates and deployment
Shared Base Image: Common dependencies cached in Dockerfile.base to reduce build time and storage
GPU Sharing: All services share a single GPU through NVIDIA Container Toolkit
Non-root Containers: Services run as unprivileged users for enhanced security
Makefile Orchestration: Simplified commands for common operations

Features

✅ GPU Acceleration: CUDA-optimized inference for all models

✅ Containerized Deployment: Docker Compose orchestration with a shared base image

✅ Monitoring: Prometheus metrics + Grafana dashboards

✅ Security Hardening: Rootless Docker support, non-root containers

✅ Reverse Proxy: NGINX routing and request proxying

✅ Modular Design: Easy to add/remove models independently

✅ Optimized Builds: Shared base image with dependency caching

Prerequisites

Hardware Requirements

GPU: NVIDIA GPU with CUDA support (recommended: ≥16GB VRAM for all models)
RAM: Minimum 16GB system memory
Storage: ~50GB free disk space for models and containers

Software Requirements

Docker Engine (≥20.10)
- Installation Guide for Ubuntu
NVIDIA Container Toolkit
- Required for GPU access from containers
- Installation Guide
CUDA Drivers (≥11.8)
- Verify with: nvidia-smi
Python 3.12+ (for setup scripts)

Verify GPU Access

Check that Docker can access your GPU:

nvidia-smi

Alternatively, run python check_gpu.py after installing torch into your environment.

📚 Detailed GPU setup: See docs/GPU_SETUP.md

Quick Start

1. Clone and Configure

git clone https://github.com/spolivin/llm-inference-api.git
cd llm-inference-api

# Configure environment variables (Grafana credentials)
cp .env.example .env

2. Set Up Python Environment

Option A: Using Python venv

sudo apt-get update
sudo apt-get install python3.12-venv build-essential

python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements-all.txt

Option B: Using Conda

source setup_env.sh

3. Download Model Weights

Authenticate with Hugging Face:

sh hf_login.sh

Download all models:

# LLaMA 2 7B (quantized)
cd services/llama-api && sh download_llama_weights.sh && cd ../..

# Gemma 2 2B
cd services/gemma-api && sh download_gemma_model.sh && cd ../..

# Stable Diffusion v1.4
cd services/stable-diffusion-api && python download_sd_model.py && cd ../..

# Silero TTS v3 (Russian)
cd services/tts-api && python download_tts_model.py && cd ../..

# Whisper (base model)
cd services/whisper-api && python download_whisper_model.py && cd ../..

Logout from Hugging Face (security best practice):

sh hf_logout.sh

4. Build and Launch

# Build shared base image (one-time operation)
make build-base

# Build all services
make build-services

# Start all services
make up-services

5. Verify Deployment

Check that all services are running:

make services

Expected output: 8 containers running (5 models + NGINX + Prometheus + Grafana)

Usage

Once deployed, access services through NGINX on localhost or your VM's public IP:

API Endpoints

# Text Generation (LLaMA 2)
curl -X POST http://localhost/api/v1/llama/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain quantum computing", "max_new_tokens": 100}'

# Text Generation (Gemma 2)
curl -X POST http://localhost/api/v1/gemma/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Write a Python function", "max_new_tokens": 150}'

# Image Generation (Stable Diffusion)
curl -X POST http://localhost/api/v1/sd/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "a serene mountain landscape", "steps": 50}'

# Speech Recognition (Whisper)
curl -X POST http://localhost/api/v1/whisper/generate \
  -F "file=@audio.wav"

# Text-to-Speech (Silero - Russian)
curl -X POST http://localhost/api/v1/tts/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Привет, как дела?", "speaker": "kseniya"}'

Note: Generating image and synthesizing audio as shown above generate files and save them in temporary directory. Consult API scripts directory on how to download the generated files locally.

Monitoring

Prometheus: http://localhost:8090
Grafana: http://localhost:8034
- Default credentials: See .env file
- Pre-configured dashboards available

📊 Monitoring Guide: See docs/MONITORING.md

Development

Useful Commands

# View running containers
make containers

# View service logs
docker compose logs -f

# Check open ports
make open-ports

# Rebuild a specific service
docker compose build <service-name>

# Stop all services
make down-services

# Clean up everything (containers, volumes, images)
docker compose down -v --rmi all

Project Structure

llm-inference-api/
├── services/               # Individual model services
│   ├── whisper-api/
│   ├── stable-diffusion-api/
│   ├── llama-api/
│   ├── gemma-api/
│   └── tts-api/
├── nginx/                  # Reverse proxy configuration
├── prometheus/             # Metrics collection config
├── docs/                   # Additional documentation
├── Dockerfile.base         # Shared base image
├── docker-compose.yml      # Service orchestration
├── Makefile                # Build and deployment commands
└── requirements-*.txt      # Python dependencies

Security Considerations

This project implements several security best practices:

Non-root Containers: All services run as unprivileged users (UID 1000)
Rootless Docker Support: Compatible with rootless Docker installations
Token Management: HF tokens cleared after model download
Network Isolation: Services communicate through internal Docker network
Environment Variables: Sensitive configs managed via .env file

🔒 Security Details:

Performance Optimization

Memory Management

For systems with <16GB VRAM, consider:

Running a subset of models
Using smaller model variants
Implementing model swapping (load/unload on demand)

Build Optimization

The shared base image (Dockerfile.base) caches common dependencies:

PyTorch + CUDA libraries
FastAPI + Uvicorn
HuggingFace Transformers

Benefits:

Reduction in total build time
Reduction in total image size
Faster iterative development

Troubleshooting

GPU Not Detected

# Check NVIDIA drivers
nvidia-smi

# Verify Docker GPU access
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Out of Memory Errors

Lower max_new_tokens in generation requests
Run fewer models simultaneously

Port Conflicts

# Check which ports are in use
make open-ports

Modify ports in docker-compose.yml if needed

License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
api_scripts		api_scripts
docs		docs
nginx		nginx
prometheus		prometheus
services		services
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile.base		Dockerfile.base
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
check_gpu.py		check_gpu.py
dashboard.json		dashboard.json
docker-compose.yml		docker-compose.yml
fix_datasource_uid.py		fix_datasource_uid.py
hf_login.sh		hf_login.sh
hf_logout.sh		hf_logout.sh
requirements-all.txt		requirements-all.txt
requirements-base.txt		requirements-base.txt
setup_env.sh		setup_env.sh

Folders and files

Latest commit

History

Repository files navigation

🚀 LLM Inference API

Overview

Architecture

Features

Prerequisites

Hardware Requirements

Software Requirements

Verify GPU Access

Quick Start

1. Clone and Configure

2. Set Up Python Environment

3. Download Model Weights

4. Build and Launch

5. Verify Deployment

Usage

API Endpoints

Monitoring

Development

Useful Commands

Project Structure

Security Considerations

Performance Optimization

Memory Management

Build Optimization

Troubleshooting

GPU Not Detected

Out of Memory Errors

Port Conflicts

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages