Get up and running with TurboQuant+ KV cache compression in under 5 minutes.
- A GGUF model (download from HuggingFace)
- CMake 3.14+
- One of: Apple Silicon Mac, NVIDIA GPU (CUDA), AMD GPU (ROCm/HIP)
git clone https://github.com/TheTom/llama-cpp-turboquant
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
# Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# NVIDIA (CUDA)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j
# AMD (ROCm/HIP)
cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j
# AMD (ROCm/HIP) on Windows — clang requires explicit optimization flags
cmake -S . -B build -G Ninja \
-DGPU_TARGETS=gfx1201 \
-DGGML_HIP=ON \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_FLAGS_RELEASE="-O3 -DNDEBUG" \
-DCMAKE_CXX_FLAGS_RELEASE="-O3 -DNDEBUG"
cmake --build build --config Release
# Windows (CUDA, use Developer Command Prompt or WSL2)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -jRun a model with TurboQuant+ KV cache compression:
# Interactive chat
./build/bin/llama-cli -m model.gguf -ctk q8_0 -ctv turbo3 -fa on -ngl 99 -c 8192
# Server mode
./build/bin/llama-server -m model.gguf -ctk q8_0 -ctv turbo3 -fa on -ngl 99 -c 8192-ctk q8_0 -ctv turbo4 -fa onAsymmetric: full precision K, compressed V. Safe on all tested models including sensitive ones like Qwen2.5.
Known limitation: Models with
head_dim=64(e.g., GPT-OSS-120B) may crash or produce degraded output with turbo V compression. The WHT kernel requires sufficient dimensionality for CLT convergence, and d=64 is at the lower boundary. If you hit issues on a head_dim=64 model, fall back to-ctk q8_0 -ctv q8_0(no turbo compression). The paper validated at head_dim=128 and 256.
-ctk q8_0 -ctv turbo3 -fa on5.12x V compression with minimal quality loss (+1-2% PPL).
-ctk turbo3 -ctv turbo3 -fa onSymmetric. Works great on Llama 70B, Command-R+ 104B, Mistral 24B, Qwen3.5 MoE. Do NOT use on Qwen2.5 with Q4_K_M weights (catastrophic PPL).
| Your model | Recommended config |
|---|---|
| Q8_0 weights, any size | -ctk turbo4 -ctv turbo4 or -ctk turbo3 -ctv turbo3 |
| Q4_K_M, unknown model | -ctk q8_0 -ctv turbo4 (safe default) |
| Q4_K_M, Qwen2.5 | -ctk q8_0 -ctv turbo3 (must be asymmetric) |
| Q4_K_M, Llama/Mistral/Cohere 24B+ | -ctk turbo3 -ctv turbo3 (symmetric works) |
For full details see Configuration Recommendations.
These features are enabled by default on recent builds. No extra flags needed.
Skips V dequantization for positions where the softmax attention weight is negligible (< 1e-6). At long context, 90%+ of positions have negligible weight, so this eliminates roughly half the total dequant cost.
- +22.8% decode throughput at 32K context on MoE models
- Zero perplexity impact (validated at 32K, 50 chunks, wikitext-103)
- Works with any KV cache type (turbo3, q8_0, q4_0)
- Opt-out:
TURBO_SPARSE_V=0
See Sparse V paper.
Protects the first 2 + last 2 transformer layers with q8_0-V while compressing all remaining layers with turbo2-V. Recovers 37-91% of the quality gap between turbo2 and turbo3, with no speed penalty.
- Auto-enabled when you use
-ctv turbo2 - Opt-out:
TURBO_LAYER_ADAPTIVE=0 - Manually enable on older builds:
TURBO_LAYER_ADAPTIVE=7
See Boundary V paper.
# turbo2-V with Boundary V (auto-enabled)
# Best for extreme memory pressure. +5-9.5% PPL.
./build/bin/llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa on -ngl 99Help the project by sharing your numbers. Here's how to generate comparable results.
Important: Pick the right config before benchmarking. Symmetric turbo (e.g.,
-ctk turbo3 -ctv turbo3) is catastrophic on some model families (Qwen2.5, Qwen3 MoE) with Q4 weight quantization. If you benchmark with the wrong config, your PPL numbers will be misleading -- you'll measure the damage from a bad config, not the actual compression quality. Start with asymmetric (-ctk q8_0 -ctv turbo4) unless you've validated symmetric on your specific model. See Configuration Recommendations for which configs are safe on which models.
| Config | K bpv | V bpv | Avg bpv | KV compression vs fp16 |
|---|---|---|---|---|
| fp16/fp16 | 16.0 | 16.0 | 16.0 | 1.0x (baseline) |
| q8_0/q8_0 | 8.5 | 8.5 | 8.5 | 1.88x |
| q8_0/turbo4 | 8.5 | 4.25 | 6.375 | 2.51x |
| q8_0/turbo3 | 8.5 | 3.125 | 5.8125 | 2.75x |
| q8_0/turbo2 | 8.5 | 2.125 | 5.3125 | 3.01x |
| turbo4/turbo4 | 4.25 | 4.25 | 4.25 | 3.76x |
| turbo3/turbo3 | 3.125 | 3.125 | 3.125 | 5.12x |
Formula: compression = 16 / avg_bpv. For asymmetric: avg_bpv = (k_bpv + v_bpv) / 2.
llama-server logs the KV cache allocation at startup. Look for lines like:
llama_kv_cache_init: Metal KV buffer size = 1234.56 MiB
Compare this across configs to see real memory savings. You can also use llama-bench output which reports KV cache size.
# Wikitext-2 raw text for perplexity testing
# Option 1: Direct download (no HF account needed)
wget https://huggingface.co/nisten/llama3-8b-instruct-32k-gguf/raw/main/wiki.test.raw
# Option 2: Via Hugging Face CLI (requires HF token for some models)
# pip install huggingface-hub
# huggingface-cli download Salesforce/wikitext wikitext-2-raw-v1/test-00000-of-00001.parquet./build/bin/llama-perplexity \
-m model.gguf \
-ctk q8_0 -ctv q8_0 \
-fa on -ngl 99 \
-f wiki.test.raw \
-c 512 --chunks 20# Asymmetric (safe for any model)
./build/bin/llama-perplexity \
-m model.gguf \
-ctk q8_0 -ctv turbo3 \
-fa on -ngl 99 \
-f wiki.test.raw \
-c 512 --chunks 20
# Symmetric (if your model supports it)
./build/bin/llama-perplexity \
-m model.gguf \
-ctk turbo3 -ctv turbo3 \
-fa on -ngl 99 \
-f wiki.test.raw \
-c 512 --chunks 20# Short context
./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1
# Long context (compare turbo3 vs q8_0 at each length)
./build/bin/llama-bench -m model.gguf -ctk q8_0 -ctv q8_0 -fa 1 -p 8192 -r 1
./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1 -p 8192 -r 1
./build/bin/llama-bench -m model.gguf -ctk q8_0 -ctv q8_0 -fa 1 -p 32768 -r 1
./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1 -p 32768 -r 1Post your numbers on PR #45. This is where we are tracking all test results before merge.
Include: model name, weight quantization, GPU, VRAM, turbo config, PPL, and speed numbers.
Backend support: Metal (Apple Silicon), CUDA (NVIDIA), and AMD HIP/ROCm (tested on RX 9070 XT, RDNA 4). The quantization step (llama-quantize) works on any platform. Windows builds supported for both MSVC (CUDA) and clang (HIP).
TQ4_1S applies WHT rotation + Lloyd-Max polar quantization to model weights (not just KV cache). This is post-training quantization -- no retraining or calibration required. Apply directly to Q8_0 GGUF models.
Code: Weight compression is now on main. No separate branch needed.
See the weight compression paper for full methodology and results.
Clone, build, compress a known-good model, and benchmark. Paste the llama-bench output back in the PR #45 comments.
# 1. Clone and build from PR #45
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout pr/tq4-weight-compression
# Pick your backend:
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release # Apple Silicon
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release # NVIDIA
cmake --build build -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)
# 2. Download a known-good test model (Qwen3.5-27B Q8_0, 26.6 GB)
# Fits on 32GB+ VRAM. For 24GB cards, use Qwen2.5-7B-Instruct Q8_0 instead.
huggingface-cli download Qwen/Qwen3.5-27B-GGUF qwen3.5-27b-q8_0.gguf --local-dir models/
# 3. Compress with Config I
python3 -c "
n_layers = 64 # 64 for Qwen3.5-27B, 28 for Qwen2.5-7B
boundary = 2
for i in range(boundary, n_layers - boundary):
for t in ['attn_q', 'attn_k', 'attn_v', 'attn_output', 'ffn_gate', 'ffn_up']:
print(f'blk.{i}.{t}.weight=tq4_1s')
print(f'blk.{i}.ffn_down.weight=q4_k')
" > config_i.txt
./build/bin/llama-quantize --allow-requantize --tensor-type-file config_i.txt \
models/qwen3.5-27b-q8_0.gguf models/qwen3.5-27b-config-i.gguf Q8_0
# Show size before/after
python3 -c "
import os
src_path = 'models/qwen3.5-27b-q8_0.gguf'
dst_path = 'models/qwen3.5-27b-config-i.gguf'
src = os.path.getsize(src_path) / (1024**3)
dst = os.path.getsize(dst_path) / (1024**3)
print(f'{src_path}: {src:.2f} GB -> {dst_path}: {dst:.2f} GB ({(1-dst/src)*100:.0f}% smaller)')
"
# 4. Benchmark — run all 3 and paste the output in the PR
./build/bin/llama-bench -m models/qwen3.5-27b-q8_0.gguf -fa 1 -ngl 99 -p 512 -n 128
./build/bin/llama-bench -m models/qwen3.5-27b-config-i.gguf -fa 1 -ngl 99 -p 512 -n 128
./build/bin/llama-bench -m models/qwen3.5-27b-config-i.gguf -ctk q8_0 -ctv turbo4 -fa 1 -ngl 99 -p 512 -n 128Expected results (Qwen3.5-27B):
- Q8_0 source: 26.6 GB
- Config I output: ~19.1 GB (28% smaller)
- Quality: +1.3% PPL
- Speed (Metal): 94-102% of Q8_0. Speed (CUDA): varies by GPU, collecting data
What to report: paste all 3 llama-bench outputs in PR #45 along with your GPU model. Crashes, errors, or unexpected output are equally valuable.
The Quick Test above uses Config I, which works for Qwen, Phi, and most non-Llama models. Adjust n_layers for your model (28 for 1.5B, 64 for 27B, 80 for 70B, etc.). Source must be Q8_0 GGUF.
For Llama-family models, use Hybrid (Q4_K for ALL FFN) or Premium (Q5_K/Q6_K FFN):
# Llama Hybrid: TQ4_1S attention, Q4_K all FFN
python3 -c "
n_layers = 80 # Llama 3.1 70B
for i in range(2, n_layers - 2):
for t in ['attn_q', 'attn_k', 'attn_v', 'attn_output']:
print(f'blk.{i}.{t}.weight=tq4_1s')
for t in ['ffn_gate', 'ffn_up', 'ffn_down']:
print(f'blk.{i}.{t}.weight=q4_k')
" > llama_hybrid.txt
# Llama Premium: TQ4_1S attention, Q5_K/Q6_K FFN (better quality, +8G)
python3 -c "
n_layers = 80
for i in range(4, n_layers - 4):
for t in ['attn_q', 'attn_k', 'attn_v', 'attn_output']:
print(f'blk.{i}.{t}.weight=tq4_1s')
for t in ['ffn_gate', 'ffn_up']:
print(f'blk.{i}.{t}.weight=q5_k')
print(f'blk.{i}.ffn_down.weight=q6_k')
" > llama_premium.txt
./build/bin/llama-quantize \
--allow-requantize \
--tensor-type-file llama_hybrid.txt \
model-Q8_0.gguf model-hybrid.gguf Q8_0Source requirement: Q8_0 GGUF. Models already at Q4_K_M have minimal compression headroom.
For full community benchmark data across 12+ GPUs and 11+ models, see Community Results.
Based on architecture analysis and tested results. Models marked with * are predictions based on code analysis of convert_hf_to_gguf.py and need community validation.
| Model | Size (Q8_0) | Size (Compressed) | Config | PPL Delta | Decode | NIAH |
|---|---|---|---|---|---|---|
| Qwen2.5-1.5B | 1.76G | 1.28G | Config I | +1.9% | 96% | 6/6 |
| Qwen3.5-27B | 26.6G | 19.1G | Config I | +1.3% | 99% | 3/3 |
| Qwen3.5-35B-A3B MoE | 34.4G | 21.6G | Config I | +1.4% | 102% | — |
| Qwen2.5-72B | 72.0G | 45.8G | Config I | +3.9% | 95% | 3/3 |
| Phi-4 14B | 14.5G | 9.3G | Config I | +1.0% | 254% | 3/3 |
| Llama 3.1 70B | 69.8G | 49.8G | Premium | +5.8% | fast | 3/3 |
| Llama 3.1 70B | 69.8G | 40.2G | Hybrid | +16% | 133% | 3/3 |
Models marked with * need community validation. Predictions based on convert_hf_to_gguf.py analysis.
| Model Family | Config | Expected PPL Delta | Size Reduction | Notes |
|---|---|---|---|---|
| Phi-3* | Config I | ~+1% | ~36% | Same family as tested Phi-4 |
| Llama 2/3/3.2* | Hybrid/Premium | ~+6-20% | ~29-42% | Same family as tested Llama 3.1 |
| CodeLlama* | Hybrid/Premium | ~+6-20% | ~29-42% | Extends LlamaModel |
| Mistral* | Hybrid/Premium | ~+6-20% | ~29-42% | Extends LlamaModel, has permutation |
| Mixtral* | Hybrid/Premium | ~+6-20% | ~29-42% | MoE, extends LlamaModel |
| Granite (IBM)* | Hybrid/Premium | ~+6-20% | ~29-42% | Extends LlamaModel |
| Llama 4* | Config I | ~+1-4% | ~30-38% | undo_permute=False, no permutation |
| Gemma 2/3* | Config I | ~+1-4% | ~30-38% | No permutation |
| Command-R/R+* | Config I | ~+1-4% | ~30-38% | No permutation |
| DeepSeek V2/R1* | Config I | unknown | ~30% | MLA attention, less certain |
| Falcon* | Config I | ~+1-4% | ~30-38% | No permutation |
| InternLM* | Config I | ~+1-4% | ~30-38% | No permutation |
| OLMo* | Config I | ~+1-4% | ~30-38% | No permutation |
| Yi* | Config I | ~+1-4% | ~30-38% | No permutation |
Config I: attn+gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2. For Qwen, Phi, and non-Llama models. Hybrid: attn=TQ4_1S, ALL FFN=Q4_K, boundary 2+2. For Llama-family. Max compression, +16% PPL. Premium: attn=TQ4_1S, ffn_gate/up=Q5_K, ffn_down=Q6_K, boundary 4+4. For Llama-family. Best quality, +5.8% PPL.
The key discriminator is whether the model's GGUF conversion permutes Q/K weights for RoPE (undo_permute in convert_hf_to_gguf.py). Models without permutation tend to work well with Config I. However, permutation alone does not fully explain the sensitivity difference (see paper Section 5.7). Treat predictions as directional. Community validation on untested models is very welcome -- post results on PR #45.
# PPL baseline (Q8_0)
./build/bin/llama-perplexity -m model-Q8_0.gguf -ngl 99 -fa 1 \
-f wiki.test.raw -c 512 --chunks 20
# PPL compressed
./build/bin/llama-perplexity -m model-config-i.gguf -ngl 99 -fa 1 \
-f wiki.test.raw -c 512 --chunks 20
# Speed
./build/bin/llama-bench -m model-config-i.gguf -fa 1
# Combined with TurboQuant KV compression
./build/bin/llama-server -m model-config-i.gguf -ngl 99 -fa 1 \
-ctk q8_0 -ctv turbo3If you're running 70B+ models at long context on Apple Silicon, macOS caps GPU memory at ~75% of RAM by default. This causes Metal to hang at ~49K context. Fix:
# Recommended: 90% of physical RAM (safe for sustained inference)
# 128GB Mac
sudo sysctl iogpu.wired_limit_mb=117964
# 96GB Mac
sudo sysctl iogpu.wired_limit_mb=88474
# 64GB Mac
sudo sysctl iogpu.wired_limit_mb=58982No reboot required. Resets on reboot.
- Configuration Recommendations (full config guide with all tested models)
- M5 Max Stress Test (70B and 104B results)
- Sparse V Dequantization
- Boundary V
- Block Size Optimization
- TurboQuant paper (Google Research)