Find the optimal LLM configuration for your GPU's VRAM
A lightweight web application that helps you determine the best Large Language Model, quantization level, and context size based on your available GPU memory.
Quick selection for popular GPUs:
- Consumer NVIDIA RTX 50 series: RTX 5060 (8 GB), RTX 5060 Ti (16 GB), RTX 5070 (12 GB), RTX 5070 Ti (16 GB), RTX 5080 (16 GB), RTX 5090 (32 GB)
- Consumer NVIDIA RTX 40 series: RTX 4080 (16 GB), RTX 4090 (24 GB)
- Data Center NVIDIA: A100 40/80 GB, L40S 48 GB, H100 80 GB, H100 NVL 94 GB, H200 141 GB, B100 192 GB, B200 192 GB, B300 Ultra 288 GB
- Data Center AMD: MI300X 192 GB, MI325X 256 GB, MI355X 288 GB
| Category | Models |
|---|---|
| Tiny (< 5B) | Qwen3.5 0.8B/2B/4B, Llama 3.2 1B/3B, Gemma 3 1B/4B, Phi-4 Mini 3.8B |
| Small (5–15B) | Qwen3.5 9B, Gemma 3 12B, Mistral Nemo 12B, Ministral 8B, Phi-4 14B |
| Medium (15–50B) | Qwen3.6 27B, Qwen3.5 27B, Qwen3.6 35B-A3B (MoE), Qwen3.5 35B-A3B (MoE), Gemma 3 27B, Mistral Small 4 (MoE) |
| Coding | Qwen3-Coder 30B, Devstral 2 (123B dense) |
| Vision | Qwen3-VL 32B/235B, Llama 3.2 11B Vision, Pixtral 12B |
| Large (50–150B) | Qwen3.5 122B-A10B (MoE), GLM-4.5 Air (MoE 12B active), Llama 4 Scout (MoE 17B active), Mistral Large 2 (123B) |
| Huge (150B+) | MiniMax M2.7 (MoE 10B active), DeepSeek-V4-Flash (MoE 13B active), Qwen3.5 397B-A17B (MoE), GLM-5.1 (MoE 40B active), Kimi K2.6 1T (MoE 32B active), DeepSeek-V4-Pro 1.6T (MoE 49B active) |
- FP16: Maximum precision (2 bytes/param)
- FP8: Good quality/performance balance (1 byte/param)
- FP4: Maximum VRAM savings (0.5 bytes/param)
- Balanced: Best overall compromise
- Largest Model: Prioritizes model parameter count
- Maximum Context: Optimizes for longest context window
- Best Quality: Minimizes quantization
- English (default)
- French
- Language preference saved in cookies
git clone https://github.com/YOUR_USERNAME/llm-optimizer.git
cd llm-optimizer
docker-compose up -dAccess at: http://localhost:8080
docker build -t llm-optimizer .
docker run -d -p 8080:80 llm-optimizerRequirements: PHP 7.4+
php -S 0.0.0.0:8080Total VRAM = (Parameters × Precision Factor) + (Context Size × KV_per_token)
Precision Factors:
- FP16: 2
- FP8: 1
- FP4: 0.5
KV Cache per token (scales with model size via GQA):
kv_per_token = max(0.08, 0.04 × √params_B) MB/token
This sqrt scaling reflects that modern models use Group Query Attention (GQA), where the number of KV heads grows much slower than total parameters. Calibrated values:
| Model size | KV cache/token |
|---|---|
| 8B | ~0.11 MB |
| 14B | ~0.15 MB |
| 32B | ~0.23 MB |
| 70B | ~0.33 MB |
Qwen3.6 27B in FP4 with 32K context on 16 GB GPU:
- Model: 27B × 0.5 = 13.5 GB
- KV cache: 32,768 × 0.00021 = 6.9 GB → too tight; FP4 with 8K context fits
- For 32K context, drop to a 14B-class model (Phi-4 14B FP4 ≈ 11.9 GB total)
GLM-5.1 (754B MoE) in FP4 on 8× H200 (1128 GB total):
- Model: 754B × 0.5 = 377 GB (full weights resident across GPUs)
- Plenty of headroom for 256K+ context
-
For each model and quantization level:
- Calculate model memory:
params × precision_factor - Calculate available context memory:
(vram × 0.95) - model_memory - Compute KV cost:
max(0.00008, 0.00004 × √params)GB/token - Find maximum context:
context_memory / kv_per_token - Validate against minimum context constraint
- Calculate model memory:
-
Score configurations based on priority:
- Balanced:
(params × 100) + (context / 100) - (precision × 50) - Model:
(params × 1000) - (precision × 100) + (context / 1000) - Context:
(context × 1000) + params - Quality:
((3 - precision) × 10000) + (params × 100) + (context / 1000)
- Balanced:
-
Return top 3 diversified recommendations + additional viable configurations
- Backend: PHP 8.2-FPM
- Web Server: Nginx (Alpine)
- Base Image: Alpine Linux
- Image Size: ~50 MB
- Memory Usage: ~10 MB RAM
- Response Time: <100 ms
llm-optimizer/
├── index.php # Main application
├── Dockerfile # Container definition
├── docker-compose.yml # Local development
├── nginx.conf # Web server config
├── start.sh # Startup script
└── README.md
# Build and test
docker build -t llm-optimizer:test .
docker run -d -p 8888:80 --name test llm-optimizer:test
# Health check
curl http://localhost:8888
# Cleanup
docker stop test && docker rm test- Docker support
- Port 80 available (or custom port mapping)
- Minimal resources: 128 MB RAM, 0.1 CPU
Docker Swarm
docker service create \
--name llm-optimizer \
--publish 80:80 \
--replicas 2 \
llm-optimizer:latestKubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-optimizer
spec:
replicas: 2
selector:
matchLabels:
app: llm-optimizer
template:
metadata:
labels:
app: llm-optimizer
spec:
containers:
- name: llm-optimizer
image: llm-optimizer:latest
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: llm-optimizer
spec:
selector:
app: llm-optimizer
ports:
- port: 80
targetPort: 80
type: LoadBalancerDocker Compose Production
version: '3.8'
services:
app:
image: llm-optimizer:latest
restart: always
ports:
- "80:80"
deploy:
resources:
limits:
memory: 256M
cpus: '0.5'The application works behind any reverse proxy (Traefik, Nginx, Caddy). It listens on port 80 and supports health checks at /.
None required. The application is stateless and requires no configuration.
To add your own models, edit index.php:
$models = [
['name' => 'Your Model', 'params' => 13, 'tags' => ['code']],
// tags: 'code', 'vision', 'reasoning', 'multilingual'
];In docker-compose.yml:
ports:
- "YOUR_PORT:80"Question: "What can I run with decent context?"
Results:
- ⭐ Phi-4 14B (FP4) → 32K context
- ✓ Gemma 3 12B (FP8) → 64K context
- ✓ Qwen3.5 9B (FP4) → 64K context
Question: "Largest model with 32K+ context?"
Results:
- ⭐ Qwen3.5 122B-A10B (FP4) → 32K context
- ✓ Mistral Large 2 (FP4) → 64K context
Question: "Longest possible context window on 16 GB?"
Results:
- ⭐ Qwen3.5 2B (FP4) → 512K context
- ✓ Llama 3.2 3B (FP4) → 512K context
Contributions are welcome! Please feel free to submit a Pull Request.
- Edit the
$modelsarray inindex.php - Test locally
- Submit PR with model name, parameter count, and tags
- Add translation array in
index.php - Add language selector button
- Test all pages
MIT License - feel free to use this project for any purpose.
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
Made with ❤️ for the LLM community