Get up and running with TurboQuant+ KV cache compression in under 5 minutes.
- A GGUF model (download from HuggingFace)
- CMake 3.14+
- One of: Apple Silicon Mac, NVIDIA GPU (CUDA), AMD GPU (ROCm/HIP)
git clone https://github.com/TheTom/llama-cpp-turboquant
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
# Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# NVIDIA (CUDA)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j
# AMD (ROCm/HIP)
cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j
# Windows (CUDA, use Developer Command Prompt or WSL2)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -jRun a model with TurboQuant+ KV cache compression:
# Interactive chat
./build/bin/llama-cli -m model.gguf -ctk q8_0 -ctv turbo3 -fa on -ngl 99 -c 8192
# Server mode
./build/bin/llama-server -m model.gguf -ctk q8_0 -ctv turbo3 -fa on -ngl 99 -c 8192-ctk q8_0 -ctv turbo4 -fa onAsymmetric: full precision K, compressed V. Safe on all tested models including sensitive ones like Qwen2.5.
Known limitation: Models with
head_dim=64(e.g., GPT-OSS-120B) may crash or produce degraded output with turbo V compression. The WHT kernel requires sufficient dimensionality for CLT convergence, and d=64 is at the lower boundary. If you hit issues on a head_dim=64 model, fall back to-ctk q8_0 -ctv q8_0(no turbo compression). The paper validated at head_dim=128 and 256.
-ctk q8_0 -ctv turbo3 -fa on5.12x V compression with minimal quality loss (+1-2% PPL).
-ctk turbo3 -ctv turbo3 -fa onSymmetric. Works great on Llama 70B, Command-R+ 104B, Mistral 24B, Qwen3.5 MoE. Do NOT use on Qwen2.5 with Q4_K_M weights (catastrophic PPL).
| Your model | Recommended config |
|---|---|
| Q8_0 weights, any size | -ctk turbo4 -ctv turbo4 or -ctk turbo3 -ctv turbo3 |
| Q4_K_M, unknown model | -ctk q8_0 -ctv turbo4 (safe default) |
| Q4_K_M, Qwen2.5 | -ctk q8_0 -ctv turbo3 (must be asymmetric) |
| Q4_K_M, Llama/Mistral/Cohere 24B+ | -ctk turbo3 -ctv turbo3 (symmetric works) |
For full details see Configuration Recommendations.
These features are enabled by default on recent builds. No extra flags needed.
Skips V dequantization for positions where the softmax attention weight is negligible (< 1e-6). At long context, 90%+ of positions have negligible weight, so this eliminates roughly half the total dequant cost.
- +22.8% decode throughput at 32K context on MoE models
- Zero perplexity impact (validated at 32K, 50 chunks, wikitext-103)
- Works with any KV cache type (turbo3, q8_0, q4_0)
- Opt-out:
TURBO_SPARSE_V=0
See Sparse V paper.
Protects the first 2 + last 2 transformer layers with q8_0-V while compressing all remaining layers with turbo2-V. Recovers 37-91% of the quality gap between turbo2 and turbo3, with no speed penalty.
- Auto-enabled when you use
-ctv turbo2 - Opt-out:
TURBO_LAYER_ADAPTIVE=0 - Manually enable on older builds:
TURBO_LAYER_ADAPTIVE=7
See Boundary V paper.
# turbo2-V with Boundary V (auto-enabled)
# Best for extreme memory pressure. +5-9.5% PPL.
./build/bin/llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa on -ngl 99Help the project by sharing your numbers. Here's how to generate comparable results.
Important: Pick the right config before benchmarking. Symmetric turbo (e.g.,
-ctk turbo3 -ctv turbo3) is catastrophic on some model families (Qwen2.5, Qwen3 MoE) with Q4 weight quantization. If you benchmark with the wrong config, your PPL numbers will be misleading -- you'll measure the damage from a bad config, not the actual compression quality. Start with asymmetric (-ctk q8_0 -ctv turbo4) unless you've validated symmetric on your specific model. See Configuration Recommendations for which configs are safe on which models.
| Config | K bpv | V bpv | Avg bpv | KV compression vs fp16 |
|---|---|---|---|---|
| fp16/fp16 | 16.0 | 16.0 | 16.0 | 1.0x (baseline) |
| q8_0/q8_0 | 8.5 | 8.5 | 8.5 | 1.88x |
| q8_0/turbo4 | 8.5 | 4.25 | 6.375 | 2.51x |
| q8_0/turbo3 | 8.5 | 3.125 | 5.8125 | 2.75x |
| q8_0/turbo2 | 8.5 | 2.125 | 5.3125 | 3.01x |
| turbo4/turbo4 | 4.25 | 4.25 | 4.25 | 3.76x |
| turbo3/turbo3 | 3.125 | 3.125 | 3.125 | 5.12x |
Formula: compression = 16 / avg_bpv. For asymmetric: avg_bpv = (k_bpv + v_bpv) / 2.
llama-server logs the KV cache allocation at startup. Look for lines like:
llama_kv_cache_init: Metal KV buffer size = 1234.56 MiB
Compare this across configs to see real memory savings. You can also use llama-bench output which reports KV cache size.
# Wikitext-2 raw text for perplexity testing
# Option 1: Direct download (no HF account needed)
wget https://huggingface.co/nisten/llama3-8b-instruct-32k-gguf/raw/main/wiki.test.raw
# Option 2: Via Hugging Face CLI (requires HF token for some models)
# pip install huggingface-hub
# huggingface-cli download Salesforce/wikitext wikitext-2-raw-v1/test-00000-of-00001.parquet./build/bin/llama-perplexity \
-m model.gguf \
-ctk q8_0 -ctv q8_0 \
-fa on -ngl 99 \
-f wiki.test.raw \
-c 512 --chunks 20# Asymmetric (safe for any model)
./build/bin/llama-perplexity \
-m model.gguf \
-ctk q8_0 -ctv turbo3 \
-fa on -ngl 99 \
-f wiki.test.raw \
-c 512 --chunks 20
# Symmetric (if your model supports it)
./build/bin/llama-perplexity \
-m model.gguf \
-ctk turbo3 -ctv turbo3 \
-fa on -ngl 99 \
-f wiki.test.raw \
-c 512 --chunks 20# Short context
./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1
# Long context (compare turbo3 vs q8_0 at each length)
./build/bin/llama-bench -m model.gguf -ctk q8_0 -ctv q8_0 -fa 1 -p 8192 -r 1
./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1 -p 8192 -r 1
./build/bin/llama-bench -m model.gguf -ctk q8_0 -ctv q8_0 -fa 1 -p 32768 -r 1
./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1 -p 32768 -r 1Post your numbers in the GitHub discussion thread or open an issue on the repo.
Include: model name, weight quantization, GPU, VRAM, turbo config, PPL, and speed numbers.
If you're running 70B+ models at long context on Apple Silicon, macOS caps GPU memory at ~75% of RAM by default. This causes Metal to hang at ~49K context. Fix:
# Recommended: 90% of physical RAM (safe for sustained inference)
# 128GB Mac
sudo sysctl iogpu.wired_limit_mb=117964
# 96GB Mac
sudo sysctl iogpu.wired_limit_mb=88474
# 64GB Mac
sudo sysctl iogpu.wired_limit_mb=58982No reboot required. Resets on reboot.
- Configuration Recommendations (full config guide with all tested models)
- M5 Max Stress Test (70B and 104B results)
- Sparse V Dequantization
- Boundary V
- Block Size Optimization
- TurboQuant paper (Google Research)