Getting Started with TurboQuant+

Get up and running with TurboQuant+ KV cache compression in under 5 minutes.

Prerequisites

A GGUF model (download from HuggingFace)
CMake 3.14+
One of: Apple Silicon Mac, NVIDIA GPU (CUDA), AMD GPU (ROCm/HIP)

Build

git clone https://github.com/TheTom/llama-cpp-turboquant
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache

# Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# NVIDIA (CUDA)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j

# AMD (ROCm/HIP)
cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j

# Windows (CUDA, use Developer Command Prompt or WSL2)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

Quick Start

Run a model with TurboQuant+ KV cache compression:

# Interactive chat
./build/bin/llama-cli -m model.gguf -ctk q8_0 -ctv turbo3 -fa on -ngl 99 -c 8192

# Server mode
./build/bin/llama-server -m model.gguf -ctk q8_0 -ctv turbo3 -fa on -ngl 99 -c 8192

Choosing Your Config

Safe default (works on any model)

-ctk q8_0 -ctv turbo4 -fa on

Asymmetric: full precision K, compressed V. Safe on all tested models including sensitive ones like Qwen2.5.

Known limitation: Models with head_dim=64 (e.g., GPT-OSS-120B) may crash or produce degraded output with turbo V compression. The WHT kernel requires sufficient dimensionality for CLT convergence, and d=64 is at the lower boundary. If you hit issues on a head_dim=64 model, fall back to -ctk q8_0 -ctv q8_0 (no turbo compression). The paper validated at head_dim=128 and 256.

More compression (works on most models)

-ctk q8_0 -ctv turbo3 -fa on

5.12x V compression with minimal quality loss (+1-2% PPL).

Maximum compression (validated large models only)

-ctk turbo3 -ctv turbo3 -fa on

Symmetric. Works great on Llama 70B, Command-R+ 104B, Mistral 24B, Qwen3.5 MoE. Do NOT use on Qwen2.5 with Q4_K_M weights (catastrophic PPL).

When to use what

Your model	Recommended config
Q8_0 weights, any size	`-ctk turbo4 -ctv turbo4` or `-ctk turbo3 -ctv turbo3`
Q4_K_M, unknown model	`-ctk q8_0 -ctv turbo4` (safe default)
Q4_K_M, Qwen2.5	`-ctk q8_0 -ctv turbo3` (must be asymmetric)
Q4_K_M, Llama/Mistral/Cohere 24B+	`-ctk turbo3 -ctv turbo3` (symmetric works)

For full details see Configuration Recommendations.

Built-in Optimizations

These features are enabled by default on recent builds. No extra flags needed.

Sparse V Dequant (Metal only)

Skips V dequantization for positions where the softmax attention weight is negligible (< 1e-6). At long context, 90%+ of positions have negligible weight, so this eliminates roughly half the total dequant cost.

+22.8% decode throughput at 32K context on MoE models
Zero perplexity impact (validated at 32K, 50 chunks, wikitext-103)
Works with any KV cache type (turbo3, q8_0, q4_0)
Opt-out: TURBO_SPARSE_V=0

See Sparse V paper.

Boundary V (auto-enabled with turbo2-V)

Protects the first 2 + last 2 transformer layers with q8_0-V while compressing all remaining layers with turbo2-V. Recovers 37-91% of the quality gap between turbo2 and turbo3, with no speed penalty.

Auto-enabled when you use -ctv turbo2
Opt-out: TURBO_LAYER_ADAPTIVE=0
Manually enable on older builds: TURBO_LAYER_ADAPTIVE=7

See Boundary V paper.

Maximum V Compression

# turbo2-V with Boundary V (auto-enabled)
# Best for extreme memory pressure. +5-9.5% PPL.
./build/bin/llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa on -ngl 99

Benchmarking

Help the project by sharing your numbers. Here's how to generate comparable results.

Important: Pick the right config before benchmarking. Symmetric turbo (e.g., -ctk turbo3 -ctv turbo3) is catastrophic on some model families (Qwen2.5, Qwen3 MoE) with Q4 weight quantization. If you benchmark with the wrong config, your PPL numbers will be misleading -- you'll measure the damage from a bad config, not the actual compression quality. Start with asymmetric (-ctk q8_0 -ctv turbo4) unless you've validated symmetric on your specific model. See Configuration Recommendations for which configs are safe on which models.

Compression reference

Config	K bpv	V bpv	Avg bpv	KV compression vs fp16
fp16/fp16	16.0	16.0	16.0	1.0x (baseline)
q8_0/q8_0	8.5	8.5	8.5	1.88x
q8_0/turbo4	8.5	4.25	6.375	2.51x
q8_0/turbo3	8.5	3.125	5.8125	2.75x
q8_0/turbo2	8.5	2.125	5.3125	3.01x
turbo4/turbo4	4.25	4.25	4.25	3.76x
turbo3/turbo3	3.125	3.125	3.125	5.12x

Formula: compression = 16 / avg_bpv. For asymmetric: avg_bpv = (k_bpv + v_bpv) / 2.

How to measure actual KV cache size

llama-server logs the KV cache allocation at startup. Look for lines like:

llama_kv_cache_init: Metal KV buffer size = 1234.56 MiB

Compare this across configs to see real memory savings. You can also use llama-bench output which reports KV cache size.

Step 1: Download test data

# Wikitext-2 raw text for perplexity testing
# Option 1: Direct download (no HF account needed)
wget https://huggingface.co/nisten/llama3-8b-instruct-32k-gguf/raw/main/wiki.test.raw

# Option 2: Via Hugging Face CLI (requires HF token for some models)
# pip install huggingface-hub
# huggingface-cli download Salesforce/wikitext wikitext-2-raw-v1/test-00000-of-00001.parquet

Step 2: Run baseline (always do this first)

./build/bin/llama-perplexity \
  -m model.gguf \
  -ctk q8_0 -ctv q8_0 \
  -fa on -ngl 99 \
  -f wiki.test.raw \
  -c 512 --chunks 20

Step 3: Run turbo configs

# Asymmetric (safe for any model)
./build/bin/llama-perplexity \
  -m model.gguf \
  -ctk q8_0 -ctv turbo3 \
  -fa on -ngl 99 \
  -f wiki.test.raw \
  -c 512 --chunks 20

# Symmetric (if your model supports it)
./build/bin/llama-perplexity \
  -m model.gguf \
  -ctk turbo3 -ctv turbo3 \
  -fa on -ngl 99 \
  -f wiki.test.raw \
  -c 512 --chunks 20

Step 4: Speed benchmarks

# Short context
./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1

# Long context (compare turbo3 vs q8_0 at each length)
./build/bin/llama-bench -m model.gguf -ctk q8_0 -ctv q8_0 -fa 1 -p 8192 -r 1
./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1 -p 8192 -r 1
./build/bin/llama-bench -m model.gguf -ctk q8_0 -ctv q8_0 -fa 1 -p 32768 -r 1
./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1 -p 32768 -r 1

Step 5: Share results

Post your numbers in the GitHub discussion thread or open an issue on the repo.

Include: model name, weight quantization, GPU, VRAM, turbo config, PPL, and speed numbers.

Apple Silicon: Large Models at Long Context

If you're running 70B+ models at long context on Apple Silicon, macOS caps GPU memory at ~75% of RAM by default. This causes Metal to hang at ~49K context. Fix:

# Recommended: 90% of physical RAM (safe for sustained inference)
# 128GB Mac
sudo sysctl iogpu.wired_limit_mb=117964

# 96GB Mac
sudo sysctl iogpu.wired_limit_mb=88474

# 64GB Mac
sudo sysctl iogpu.wired_limit_mb=58982

No reboot required. Resets on reboot.

Resources

Configuration Recommendations (full config guide with all tested models)
M5 Max Stress Test (70B and 104B results)
Sparse V Dequantization
Boundary V
Block Size Optimization
TurboQuant paper (Google Research)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started with TurboQuant+

Prerequisites

Build

Quick Start

Choosing Your Config

Safe default (works on any model)

More compression (works on most models)

Maximum compression (validated large models only)

When to use what

Built-in Optimizations

Sparse V Dequant (Metal only)

Boundary V (auto-enabled with turbo2-V)

Maximum V Compression

Benchmarking

Compression reference

How to measure actual KV cache size

Step 1: Download test data

Step 2: Run baseline (always do this first)

Step 3: Run turbo configs

Step 4: Speed benchmarks

Step 5: Share results

Apple Silicon: Large Models at Long Context

Resources

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started with TurboQuant+

Prerequisites

Build

Quick Start

Choosing Your Config

Safe default (works on any model)

More compression (works on most models)

Maximum compression (validated large models only)

When to use what

Built-in Optimizations

Sparse V Dequant (Metal only)

Boundary V (auto-enabled with turbo2-V)

Maximum V Compression

Benchmarking

Compression reference

How to measure actual KV cache size

Step 1: Download test data

Step 2: Run baseline (always do this first)

Step 3: Run turbo configs

Step 4: Speed benchmarks

Step 5: Share results

Apple Silicon: Large Models at Long Context

Resources