Skip to content

spolivin/llm-inference-api

Repository files navigation

πŸš€ LLM Inference API

  1. Overview
  2. Architecture
  3. Features
  4. Prerequisites
  5. Quick Start
  6. Usage
  7. Development
  8. Security Considerations
  9. Performance Optimization
  10. Troubleshooting
  11. License

A GPU-accelerated inference platform for deploying multiple AI models as microservices. Built with Docker, FastAPI, and Prometheus/Grafana monitoring.

License: MIT

Overview

This project provides a containerized infrastructure for running multiple AI/ML models with GPU acceleration. Each model runs as an independent microservice behind an NGINX reverse proxy, with built-in monitoring via Prometheus and Grafana.

Model Capability Framework
Whisper Automatic speech recognition & transcription OpenAI
Stable Diffusion v1.4 Text-to-image generation Hugging Face
LLaMA 2 7B (Q4_K_M) Text generation & chat Meta AI
Gemma 2 2B Text generation & instruction following Google
Silero TTS v3 Russian text-to-speech synthesis Silero

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   NGINX     β”‚  ← Reverse Proxy (Port 80)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β”œβ”€β”€β”€β”€β”€β–Ί Whisper API      (Port 8035)
       β”œβ”€β”€β”€β”€β”€β–Ί Stable Diff API  (Port 8015)
       β”œβ”€β”€β”€β”€β”€β–Ί LLaMA API        (Port 8012)
       β”œβ”€β”€β”€β”€β”€β–Ί Gemma API        (Port 8005)
       └─────► TTS API          (Port 8025)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Monitoring Stack               β”‚
β”‚   β€’ Prometheus (Port 8090)       β”‚
β”‚   β€’ Grafana (Port 8034)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Decisions:

  • Microservices Architecture: Each model runs in isolation for independent updates and deployment
  • Shared Base Image: Common dependencies cached in Dockerfile.base to reduce build time and storage
  • GPU Sharing: All services share a single GPU through NVIDIA Container Toolkit
  • Non-root Containers: Services run as unprivileged users for enhanced security
  • Makefile Orchestration: Simplified commands for common operations

Features

βœ… GPU Acceleration: CUDA-optimized inference for all models

βœ… Containerized Deployment: Docker Compose orchestration with a shared base image

βœ… Monitoring: Prometheus metrics + Grafana dashboards

βœ… Security Hardening: Rootless Docker support, non-root containers

βœ… Reverse Proxy: NGINX routing and request proxying

βœ… Modular Design: Easy to add/remove models independently

βœ… Optimized Builds: Shared base image with dependency caching

Prerequisites

Hardware Requirements

  • GPU: NVIDIA GPU with CUDA support (recommended: β‰₯16GB VRAM for all models)
  • RAM: Minimum 16GB system memory
  • Storage: ~50GB free disk space for models and containers

Software Requirements

  1. Docker Engine (β‰₯20.10)

  2. NVIDIA Container Toolkit

  3. CUDA Drivers (β‰₯11.8)

    • Verify with: nvidia-smi
  4. Python 3.12+ (for setup scripts)

Verify GPU Access

Check that Docker can access your GPU:

nvidia-smi

Alternatively, run python check_gpu.py after installing torch into your environment.

πŸ“š Detailed GPU setup: See docs/GPU_SETUP.md

Quick Start

1. Clone and Configure

git clone https://github.com/spolivin/llm-inference-api.git
cd llm-inference-api

# Configure environment variables (Grafana credentials)
cp .env.example .env

2. Set Up Python Environment

Option A: Using Python venv

sudo apt-get update
sudo apt-get install python3.12-venv build-essential

python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements-all.txt

Option B: Using Conda

source setup_env.sh

3. Download Model Weights

Authenticate with Hugging Face:

sh hf_login.sh

Download all models:

# LLaMA 2 7B (quantized)
cd services/llama-api && sh download_llama_weights.sh && cd ../..

# Gemma 2 2B
cd services/gemma-api && sh download_gemma_model.sh && cd ../..

# Stable Diffusion v1.4
cd services/stable-diffusion-api && python download_sd_model.py && cd ../..

# Silero TTS v3 (Russian)
cd services/tts-api && python download_tts_model.py && cd ../..

# Whisper (base model)
cd services/whisper-api && python download_whisper_model.py && cd ../..

Logout from Hugging Face (security best practice):

sh hf_logout.sh

4. Build and Launch

# Build shared base image (one-time operation)
make build-base

# Build all services
make build-services

# Start all services
make up-services

5. Verify Deployment

Check that all services are running:

make services

Expected output: 8 containers running (5 models + NGINX + Prometheus + Grafana)

Usage

Once deployed, access services through NGINX on localhost or your VM's public IP:

API Endpoints

# Text Generation (LLaMA 2)
curl -X POST http://localhost/api/v1/llama/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain quantum computing", "max_new_tokens": 100}'

# Text Generation (Gemma 2)
curl -X POST http://localhost/api/v1/gemma/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Write a Python function", "max_new_tokens": 150}'

# Image Generation (Stable Diffusion)
curl -X POST http://localhost/api/v1/sd/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "a serene mountain landscape", "steps": 50}'

# Speech Recognition (Whisper)
curl -X POST http://localhost/api/v1/whisper/generate \
  -F "file=@audio.wav"

# Text-to-Speech (Silero - Russian)
curl -X POST http://localhost/api/v1/tts/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "ΠŸΡ€ΠΈΠ²Π΅Ρ‚, ΠΊΠ°ΠΊ Π΄Π΅Π»Π°?", "speaker": "kseniya"}'

Note: Generating image and synthesizing audio as shown above generate files and save them in temporary directory. Consult API scripts directory on how to download the generated files locally.

Monitoring

πŸ“Š Monitoring Guide: See docs/MONITORING.md

Development

Useful Commands

# View running containers
make containers

# View service logs
docker compose logs -f

# Check open ports
make open-ports

# Rebuild a specific service
docker compose build <service-name>

# Stop all services
make down-services

# Clean up everything (containers, volumes, images)
docker compose down -v --rmi all

Project Structure

llm-inference-api/
β”œβ”€β”€ services/               # Individual model services
β”‚   β”œβ”€β”€ whisper-api/
β”‚   β”œβ”€β”€ stable-diffusion-api/
β”‚   β”œβ”€β”€ llama-api/
β”‚   β”œβ”€β”€ gemma-api/
β”‚   └── tts-api/
β”œβ”€β”€ nginx/                  # Reverse proxy configuration
β”œβ”€β”€ prometheus/             # Metrics collection config
β”œβ”€β”€ docs/                   # Additional documentation
β”œβ”€β”€ Dockerfile.base         # Shared base image
β”œβ”€β”€ docker-compose.yml      # Service orchestration
β”œβ”€β”€ Makefile                # Build and deployment commands
└── requirements-*.txt      # Python dependencies

Security Considerations

This project implements several security best practices:

  1. Non-root Containers: All services run as unprivileged users (UID 1000)
  2. Rootless Docker Support: Compatible with rootless Docker installations
  3. Token Management: HF tokens cleared after model download
  4. Network Isolation: Services communicate through internal Docker network
  5. Environment Variables: Sensitive configs managed via .env file

πŸ”’ Security Details:

Performance Optimization

Memory Management

For systems with <16GB VRAM, consider:

  • Running a subset of models
  • Using smaller model variants
  • Implementing model swapping (load/unload on demand)

Build Optimization

The shared base image (Dockerfile.base) caches common dependencies:

  • PyTorch + CUDA libraries
  • FastAPI + Uvicorn
  • HuggingFace Transformers

Benefits:

  • Reduction in total build time
  • Reduction in total image size
  • Faster iterative development

Troubleshooting

GPU Not Detected

# Check NVIDIA drivers
nvidia-smi

# Verify Docker GPU access
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Out of Memory Errors

  • Lower max_new_tokens in generation requests
  • Run fewer models simultaneously

Port Conflicts

# Check which ports are in use
make open-ports

Modify ports in docker-compose.yml if needed

License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

About

GPU-enabled inference system for running LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors