Skip to content

LibreArbitre/llm-optimizer

Repository files navigation

🎯 LLM Optimizer

Find the optimal LLM configuration for your GPU's VRAM

A lightweight web application that helps you determine the best Large Language Model, quantization level, and context size based on your available GPU memory.

License: MIT

🌟 Features

🎮 GPU Presets

Quick selection for popular GPUs:

  • Consumer NVIDIA RTX 50 series: RTX 5060 (8 GB), RTX 5060 Ti (16 GB), RTX 5070 (12 GB), RTX 5070 Ti (16 GB), RTX 5080 (16 GB), RTX 5090 (32 GB)
  • Consumer NVIDIA RTX 40 series: RTX 4080 (16 GB), RTX 4090 (24 GB)
  • Data Center NVIDIA: A100 40/80 GB, L40S 48 GB, H100 80 GB, H100 NVL 94 GB, H200 141 GB, B100 192 GB, B200 192 GB, B300 Ultra 288 GB
  • Data Center AMD: MI300X 192 GB, MI325X 256 GB, MI355X 288 GB

🤖 Supported Models (April 2026)

Category Models
Tiny (< 5B) Qwen3.5 0.8B/2B/4B, Llama 3.2 1B/3B, Gemma 3 1B/4B, Phi-4 Mini 3.8B
Small (5–15B) Qwen3.5 9B, Gemma 3 12B, Mistral Nemo 12B, Ministral 8B, Phi-4 14B
Medium (15–50B) Qwen3.6 27B, Qwen3.5 27B, Qwen3.6 35B-A3B (MoE), Qwen3.5 35B-A3B (MoE), Gemma 3 27B, Mistral Small 4 (MoE)
Coding Qwen3-Coder 30B, Devstral 2 (123B dense)
Vision Qwen3-VL 32B/235B, Llama 3.2 11B Vision, Pixtral 12B
Large (50–150B) Qwen3.5 122B-A10B (MoE), GLM-4.5 Air (MoE 12B active), Llama 4 Scout (MoE 17B active), Mistral Large 2 (123B)
Huge (150B+) MiniMax M2.7 (MoE 10B active), DeepSeek-V4-Flash (MoE 13B active), Qwen3.5 397B-A17B (MoE), GLM-5.1 (MoE 40B active), Kimi K2.6 1T (MoE 32B active), DeepSeek-V4-Pro 1.6T (MoE 49B active)

⚙️ Quantization Support

  • FP16: Maximum precision (2 bytes/param)
  • FP8: Good quality/performance balance (1 byte/param)
  • FP4: Maximum VRAM savings (0.5 bytes/param)

🎯 Optimization Modes

  • Balanced: Best overall compromise
  • Largest Model: Prioritizes model parameter count
  • Maximum Context: Optimizes for longest context window
  • Best Quality: Minimizes quantization

🌐 Multi-language

  • English (default)
  • French
  • Language preference saved in cookies

🚀 Quick Start

Using Docker Compose (Recommended)

git clone https://github.com/YOUR_USERNAME/llm-optimizer.git
cd llm-optimizer
docker-compose up -d

Access at: http://localhost:8080

Using Docker

docker build -t llm-optimizer .
docker run -d -p 8080:80 llm-optimizer

Without Docker

Requirements: PHP 7.4+

php -S 0.0.0.0:8080

📐 How It Works

Calculation Formula

Total VRAM = (Parameters × Precision Factor) + (Context Size × KV_per_token)

Precision Factors:

  • FP16: 2
  • FP8: 1
  • FP4: 0.5

KV Cache per token (scales with model size via GQA):

kv_per_token = max(0.08, 0.04 × √params_B)  MB/token

This sqrt scaling reflects that modern models use Group Query Attention (GQA), where the number of KV heads grows much slower than total parameters. Calibrated values:

Model size KV cache/token
8B ~0.11 MB
14B ~0.15 MB
32B ~0.23 MB
70B ~0.33 MB

Example Calculations

Qwen3.6 27B in FP4 with 32K context on 16 GB GPU:

  • Model: 27B × 0.5 = 13.5 GB
  • KV cache: 32,768 × 0.00021 = 6.9 GB → too tight; FP4 with 8K context fits
  • For 32K context, drop to a 14B-class model (Phi-4 14B FP4 ≈ 11.9 GB total)

GLM-5.1 (754B MoE) in FP4 on 8× H200 (1128 GB total):

  • Model: 754B × 0.5 = 377 GB (full weights resident across GPUs)
  • Plenty of headroom for 256K+ context

Algorithm

  1. For each model and quantization level:

    • Calculate model memory: params × precision_factor
    • Calculate available context memory: (vram × 0.95) - model_memory
    • Compute KV cost: max(0.00008, 0.00004 × √params) GB/token
    • Find maximum context: context_memory / kv_per_token
    • Validate against minimum context constraint
  2. Score configurations based on priority:

    • Balanced: (params × 100) + (context / 100) - (precision × 50)
    • Model: (params × 1000) - (precision × 100) + (context / 1000)
    • Context: (context × 1000) + params
    • Quality: ((3 - precision) × 10000) + (params × 100) + (context / 1000)
  3. Return top 3 diversified recommendations + additional viable configurations

🏗️ Architecture

  • Backend: PHP 8.2-FPM
  • Web Server: Nginx (Alpine)
  • Base Image: Alpine Linux
  • Image Size: ~50 MB
  • Memory Usage: ~10 MB RAM
  • Response Time: <100 ms

Project Structure

llm-optimizer/
├── index.php              # Main application
├── Dockerfile             # Container definition
├── docker-compose.yml     # Local development
├── nginx.conf             # Web server config
├── start.sh               # Startup script
└── README.md

🧪 Testing

# Build and test
docker build -t llm-optimizer:test .
docker run -d -p 8888:80 --name test llm-optimizer:test

# Health check
curl http://localhost:8888

# Cleanup
docker stop test && docker rm test

🌐 Deployment

General Requirements

  • Docker support
  • Port 80 available (or custom port mapping)
  • Minimal resources: 128 MB RAM, 0.1 CPU

Platform-Specific Guides

Docker Swarm
docker service create \
  --name llm-optimizer \
  --publish 80:80 \
  --replicas 2 \
  llm-optimizer:latest
Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-optimizer
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-optimizer
  template:
    metadata:
      labels:
        app: llm-optimizer
    spec:
      containers:
      - name: llm-optimizer
        image: llm-optimizer:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: llm-optimizer
spec:
  selector:
    app: llm-optimizer
  ports:
  - port: 80
    targetPort: 80
  type: LoadBalancer
Docker Compose Production
version: '3.8'
services:
  app:
    image: llm-optimizer:latest
    restart: always
    ports:
      - "80:80"
    deploy:
      resources:
        limits:
          memory: 256M
          cpus: '0.5'

Reverse Proxy

The application works behind any reverse proxy (Traefik, Nginx, Caddy). It listens on port 80 and supports health checks at /.

🔧 Configuration

Environment Variables

None required. The application is stateless and requires no configuration.

Custom Models

To add your own models, edit index.php:

$models = [
    ['name' => 'Your Model', 'params' => 13, 'tags' => ['code']],
    // tags: 'code', 'vision', 'reasoning', 'multilingual'
];

Custom Port

In docker-compose.yml:

ports:
  - "YOUR_PORT:80"

📊 Use Cases

Example 1: RTX 5070 Ti Owner (16 GB)

Question: "What can I run with decent context?"

Results:

  • ⭐ Phi-4 14B (FP4) → 32K context
  • ✓ Gemma 3 12B (FP8) → 64K context
  • ✓ Qwen3.5 9B (FP4) → 64K context

Example 2: Data Center Deployment (H200 141 GB)

Question: "Largest model with 32K+ context?"

Results:

  • ⭐ Qwen3.5 122B-A10B (FP4) → 32K context
  • ✓ Mistral Large 2 (FP4) → 64K context

Example 3: Maximum Context Priority

Question: "Longest possible context window on 16 GB?"

Results:

  • ⭐ Qwen3.5 2B (FP4) → 512K context
  • ✓ Llama 3.2 3B (FP4) → 512K context

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Adding New Models

  1. Edit the $models array in index.php
  2. Test locally
  3. Submit PR with model name, parameter count, and tags

Adding Languages

  1. Add translation array in index.php
  2. Add language selector button
  3. Test all pages

📝 License

MIT License - feel free to use this project for any purpose.

📧 Support


Made with ❤️ for the LLM community

About

Find the optimal LLM configuration for your GPU. Calculate the best model size, quantization level, and context window based on available VRAM. Supports popular models (Llama, Mistral, Qwen, Gemma) with multi-language interface (EN/FR). Lightweight Docker deployment.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors