Skip to content

Latest commit

 

History

History
343 lines (239 loc) · 16.9 KB

File metadata and controls

343 lines (239 loc) · 16.9 KB

TurboQuant Configuration Recommendations

Practical guidance for choosing TurboQuant settings based on your model, weight quantization, and hardware. Based on validated testing across Metal (M1 through M5), CUDA (RTX 3080 Ti through Blackwell), and HIP (AMD RDNA 4).

Multi-backend validated: Metal (Apple Silicon), CUDA (NVIDIA), and HIP (AMD) all produce consistent quality results. Speed characteristics vary by backend. CUDA decode is faster than q8_0 on some models (dusterbloom fused MMA FA). Metal prefill beats q8_0 at 32K+ context on 70B.

Sparse V: Enabled by default on all Metal builds. Skips V dequant for negligible attention weights. No PPL impact, +22.8% decode on MoE at 32K. Opt-out: TURBO_SPARSE_V=0.

Block size 128: Now the default. turbo3 achieves 5.12x compression (was 4.57x at block_size=32) with zero quality cost. Validated on Metal and CUDA (PR #32 fix).

Validated Good

These configurations produce healthy PPL in current testing:

Model class Config Evidence
Q8_0 weights, any size -ctk turbo4 -ctv turbo4 phi-4 +1.7%, 35B MoE +0.2%, 27B Dense healthy
Q8_0 weights, any size -ctk turbo3 -ctv turbo3 phi-4 +4.2%, 35B MoE +1.1%
Q8_0 weights, any size -ctk q8_0 -ctv turbo4 phi-4 +0.3%
Q8_0 weights, any size -ctk q8_0 -ctv turbo3 phi-4 +1.1%
Q4_K_M, larger models (24B+) -ctk turbo3 -ctv turbo3 Mistral-24B PPL 4.99, Llama-70B +11.4%, Command-R+ 104B +3.6%
Q4_K_M, larger models (70B+) -ctk turbo4 -ctv turbo4 Llama-70B +6.3%, Command-R+ 104B +1.9%
Q4_K_M, tested sensitive models -ctk q8_0 -ctv turbo4 Qwen2.5-7B +1.0% (Metal, CUDA, AMD all confirmed)
Q4_K_M, tested sensitive models -ctk q8_0 -ctv turbo3 Qwen2.5-7B +2.0%
Q4_K_M, 70B+ -ctk q8_0 -ctv turbo2 Llama-70B +9.5%, Command-R+ 104B +7.9% (asymmetric rescue)

Validated Risky

These configurations produce catastrophic PPL in at least one tested model:

Model class Config Evidence
Q4_K_M, Qwen2.5-7B -ctk turbo4 -ctv turbo4 PPL 218 (vs 6.58 baseline)
Q4_K_M, Qwen2.5-7B -ctk turbo3 -ctv turbo3 PPL 3556
Q4_K_M, Qwen2.5-1.5B -ctk turbo3 -ctv turbo3 PPL 8641+ on both M5 and M2

Note: symmetric turbo on Q4_K_M is not universally broken. Mistral-24B, Llama-70B, and Command-R+ 104B Q4_K_M all handle it fine. Model family and size both matter. Qwen2.5 is consistently sensitive; Llama, Mistral, and Cohere are tolerant. Bigger models absorb quantization stacking better (104B turbo3 = +3.6% vs 70B turbo3 = +11.4%).

Experimental

These configurations showed promising results but have less validation depth:

Model class Config Evidence
Q4_K_M, tested sensitive models -ctk q8_0 -ctv turbo2 Qwen2.5-7B +5.1%
Q8_0 weights -ctk q8_0 -ctv turbo2 phi-4 +3.1%
Q4_K_M, Qwen2.5-7B (AMD) -ctk q8_0 -ctv turbo3 NaN on HIP (Metal gets +2.0%). HIP-specific, under investigation

⚠️ WARNING: q8_0/turbo3 produces NaN on HIP/AMD with models that have large K norms (e.g. Qwen2.5-7B where K norms can reach 274). This is under active investigation. Safe AMD alternative: q8_0/turbo4.

Boundary V (auto-enabled for turbo2-V)

A layer-aware V compression strategy that protects the first 2 + last 2 layers with q8_0-V while compressing all remaining layers with turbo2-V. Auto-enabled when -ctv turbo2 is set on recent builds. Opt-out: TURBO_LAYER_ADAPTIVE=0. On older builds, activate with TURBO_LAYER_ADAPTIVE=7.

Validated across 4 models on Metal. Consistently recovers 37-91% of the turbo2-to-turbo3 quality gap. Benefit scales with model depth.

Model Layers turbo2 PPL Boundary V PPL turbo3 PPL Quality recovered
phi-4-Q8_0 40 4.835 4.784 4.742 55%
Qwen2.5-7B Q4_K_M 28 6.911 6.835 6.707 37%
Qwen3.5-35B MoE 64 5.257 5.148 5.137 91%
Qwen3.5-27B Dense 36 6.534 6.423 6.273 42%

Validated at 512 and 8K context. NIAH retrieval passed. No speed penalty. Independently validated by @Corianas_ on NanoGPT.

Exploratory Findings (2026-03-31): Extended validation on Qwen3.5-35B MoE Q8_0 with q8_0-K + turbo2-V (Boundary V auto-enabled) at 512c, 8K, and 32K context. PPL matched or beat q8_0/turbo3 at every tested length. At 32K decode on M5 Max, turbo2+BV was ~2-3% faster than turbo3-V (n=2 runs). This makes q8_0/turbo2 the best tested MoE long-context V config on this model. See MoE V-compression frontier.

See Layer-Aware V Compression for the original Boundary V writeup.

Recommended Starting Points

Your situation Start with Why
Q8_0+ weights -ctk turbo4 -ctv turbo4 Best quality/compression balance
Q8_0+ weights, need more compression -ctk turbo3 -ctv turbo3 +4% PPL, 5.12x compression
Q4_K_M, unknown model -ctk q8_0 -ctv turbo4 Safe default, V still compressed
Q4_K_M, validated large model (24B+) -ctk turbo3 -ctv turbo3 If you've confirmed PPL is healthy
Q4_K_M, 70B+ -ctk turbo4 -ctv turbo4 +6.3% on 70B, +1.9% on 104B. Symmetric works on large Llama/Cohere
Maximum V compression -ctk q8_0 -ctv turbo2 +5-9.5% PPL, Boundary V auto-enabled
MoE long-context V compression -ctk q8_0 -ctv turbo2 Tested on Qwen3.5-35B MoE: 7.53x V, PPL within 1%, 32K decode ~2-3% faster than turbo3-V

Important framing: Asymmetric q8_0-K + turbo-V is a quality/robustness rescue, not a speed optimization. You trade some decode throughput (K is uncompressed) for quality safety on sensitive models. If your model works fine with symmetric turbo, use symmetric.

MoE exception: On the tested Qwen3.5-35B MoE Q8_0 setup, q8_0/turbo2 with Boundary V (auto-enabled) is the best tested long-context V config: 7.53x V compression, PPL within 1% of q8_0 at 512c/8K/32K, quality matching or exceeding q8_0/turbo3, and 32K decode ~2-3% faster than turbo3-V (n=2 runs, M5 Max). This is a scoped result for this tested setup. See MoE V-compression frontier.

Why K Precision Matters More Than V

The attention mechanism computes softmax(Q * K^T) * V. K determines which tokens receive attention weight via softmax. Softmax amplifies small errors exponentially: a small shift in Q*K scores can flip which tokens dominate the output. V errors, by contrast, scale linearly through the weighted sum.

In current testing:

  • q8_0-K + turbo3-V on Qwen2.5-7B Q4_K_M gives PPL 6.71 (+2.0% vs baseline)
  • turbo3-K + q8_0-V on the same model gives PPL 3556 (catastrophic)

Same total bits, opposite directions, 500x quality difference. K precision is the dominant lever.

This is why asymmetric -ctk q8_0 -ctv turbo3 can rescue models where symmetric -ctk turbo3 -ctv turbo3 fails. You still get V cache compression while maintaining the attention routing accuracy that K requires.

Tested Configurations

Metal (Apple Silicon)

All results from Metal flash attention. PPL measured on wikitext-2-raw (512 context, 4 chunks) unless noted.

phi-4-14B (Q8_0 weights) — healthy across all configs

K V M5 PPL M2 PPL vs q8_0 Status
q8_0 q8_0 4.690 4.691 baseline healthy
turbo4 turbo4 4.770 4.787 +1.7% / +2.0% healthy
turbo3 turbo3 4.886 4.956 +4.2% / +5.7% healthy
q8_0 turbo4 4.702 4.693 +0.3% healthy
q8_0 turbo3 4.742 +1.1% healthy
q8_0 turbo2 4.835 +3.1% healthy

Cross-hardware matched: M2 Pro and M5 Max produce equivalent results.

Qwen2.5-7B-Instruct (Q4_K_M weights) — sensitive to symmetric turbo, rescued by asymmetric K/V

K V M5 PPL M2 PPL vs q8_0 Status
q8_0 q8_0 6.577 6.579 baseline healthy
q8_0 turbo4 6.644 6.603 +1.0% rescued
q8_0 turbo3 6.707 6.715 +2.0% rescued
q8_0 turbo2 6.911 +5.1% rescued
turbo4 turbo4 217.7 227.5 catastrophic avoid
turbo3 turbo3 3556 3778 catastrophic avoid

Cross-hardware matched: both machines show identical quality patterns.

Llama-3.1-70B-Instruct (Q4_K_M weights, M5 Max 128GB) — tolerates symmetric turbo

K V PPL vs q8_0 Status
q8_0 q8_0 3.257 baseline healthy
q8_0 turbo4 3.301 +1.3% healthy
q8_0 turbo3 3.325 +2.1% healthy
q8_0 turbo2 3.568 +9.5% healthy
turbo4 turbo4 3.461 +6.3% healthy
turbo3 turbo3 3.629 +11.4% usable
turbo2 turbo2 5.161 +58.5% degraded

Asymmetric rescue works: q8_0/turbo2 = +9.5% vs symmetric turbo2/turbo2 = +58.5%. 6x improvement in quality degradation.

Long context PPL (wikitext-2-raw):

K V Context PPL
q8_0 q8_0 8K 3.617
q8_0 turbo4 8K 3.639
q8_0 turbo3 8K 3.653
turbo3 turbo3 32K 4.839
q8_0 q8_0 48K 3.575
turbo3 turbo3 48K 4.019

NIAH: 30/30 perfect (turbo3 = q8_0, 5 depths x 3 context lengths). See 70B stress test.

Command-R+ 104B (Q4_K_M weights, M5 Max 128GB) — tolerates symmetric turbo, 128K validated

K V PPL vs q8_0 Status
q8_0 q8_0 6.192 baseline healthy
q8_0 turbo4 6.211 +0.3% healthy
q8_0 turbo3 6.296 +1.7% healthy
q8_0 turbo2 6.678 +7.9% healthy
turbo4 turbo4 6.312 +1.9% healthy
turbo3 turbo3 6.415 +3.6% healthy
turbo2 turbo2 7.049 +13.8% usable

Largest model tested. 104B tolerates symmetric turbo better than 70B (bigger models = more headroom). turbo3 prefill faster than q8_0 at 32K (64.5 vs 62.3 t/s).

Long context PPL (turbo3/turbo3, wikitext-2-raw):

Context PPL Pass time
48K 3.672 931s
64K 4.321 1481s
96K 4.170 2966s
128K 4.024 4996s

128K full native context achieved by raising macOS GPU memory cap from default ~75% of physical RAM. Recommended setting is 90% to avoid kernel panics under sustained load. Peak memory 74 GB of 128 GB.

# Recommended: 90% of physical RAM (safe for sustained inference)
# 128GB Mac
sudo sysctl iogpu.wired_limit_mb=117964
# 96GB Mac
sudo sysctl iogpu.wired_limit_mb=88474
# 64GB Mac
sudo sysctl iogpu.wired_limit_mb=58982

NIAH: 10/10 perfect at 4K and 8K (turbo3). 16K timed out due to slow decode on 104B, not retrieval failure.

See M5 Max stress test.

Qwen3.5-35B-A3B MoE (Q8_0 weights) — healthy

K V M5 PPL Status
turbo3 turbo3 5.130 healthy
turbo4 turbo4 5.078 healthy

Qwen3.5-27B Dense (Q8_0 weights) — healthy

K V M5 PPL Status
turbo3 turbo3 6.339 healthy

Mistral-Small-24B-Instruct (Q4_K_M weights) — healthy at this size

K V M5 PPL Status
turbo3 turbo3 4.987 healthy

This shows Q4_K_M is not universally incompatible with symmetric turbo. The 24B model has enough capacity to absorb the quantization stacking that breaks the 7B model.

CUDA (NVIDIA)

Community-validated on RTX 3080 Ti, RTX 3090, RTX 4090, RTX 5090, and DGX Spark (Blackwell sm_121).

seanrasch — Qwen3.5-4B Q4_K_M (RTX 3090, block_size=128)

K V PPL vs fp16 Status
fp16 fp16 10.037 baseline
q8_0 turbo4 10.124 +0.9% healthy
q8_0 turbo3 10.163 +1.3% healthy
turbo3 turbo3 10.247 +2.1% healthy
q8_0 turbo2 10.568 +5.3% healthy

Qwen3.5 hybrid architecture handles symmetric turbo without the blowup seen on Qwen2.x.

dusterbloom — Decode Speed (RTX 3090, block_size=128, fused MMA FA)

Model Decode vs q8_0
Gemma-3-12B +7.3% faster
Qwen3.5-35B MoE +4.2% faster
Nemotron-9B +3.4% faster
Qwen3.5-9B +0.8% faster

Prefill near parity (-0.3% to +2.5% at pp8192).

Char__Bob — Mistral-Small-24B (RTX 3090)

q8_0 OOMs at 128K on 3090. turbo3 enables 128K (18,430 MiB fits). Decode flat ~43 t/s.

seanrasch — NIAH (RTX 3080 Ti)

Qwen3.5-9B turbo3/turbo3: 33/33 (100%), matches f16.

HIP (AMD)

First AMD validation. RX 9070 XT (RDNA 4, gfx1201), Windows 11, HIP SDK 7.1. First attempt, no tuning.

Qwen2.5-7B-Instruct (Q4_K_M weights)

K V PPL vs q8_0 Status
q8_0 q8_0 7.794 baseline OK
q8_0 turbo4 7.876 +1.0% recommended
q8_0 turbo3 NaN HIP-specific issue
turbo4 turbo4 401.4 catastrophic Q4_K_M sensitivity
turbo3 turbo3 81,277 catastrophic Q4_K_M sensitivity

Asymmetric q8_0/turbo4 confirmed on AMD. Symmetric Q4_K_M failure consistent across all three GPU vendors. q8_0/turbo3 NaN is HIP-specific (Metal gets +2.0%). No speed penalty on working configs.

See Windows RDNA 4 Setup Guide for build instructions and 9 gotchas.

Practical Guidance

Strong base weights (Q8_0, Q6_K, or higher)

Symmetric turbo works well. Start with turbo4 for best quality, turbo3 for more compression:

# Best quality with compression
llama-server -m model-Q8_0.gguf -ctk turbo4 -ctv turbo4 -fa 1

# More compression, still healthy
llama-server -m model-Q8_0.gguf -ctk turbo3 -ctv turbo3 -fa 1

Low-bit base weights (Q4_K_M) on sensitive models

Qwen2.5-7B Q4_K_M fails catastrophically with symmetric turbo but is rescued by asymmetric K/V. Not all Q4_K_M models are sensitive. If unsure, start asymmetric:

# Recommended: near-baseline quality with V compression
llama-server -m model-Q4_K_M.gguf -ctk q8_0 -ctv turbo4 -fa 1

# More V compression, still good (+2% PPL)
llama-server -m model-Q4_K_M.gguf -ctk q8_0 -ctv turbo3 -fa 1

# Maximum V compression (+5-9.5% PPL, Boundary V auto-enabled)
llama-server -m model-Q4_K_M.gguf -ctk q8_0 -ctv turbo2 -fa 1

Low-bit base weights (Q4_K_M) on larger or less sensitive models

Symmetric turbo3 works on Mistral-24B Q4_K_M (PPL 4.99), Llama-70B Q4_K_M (PPL 3.629), and Command-R+ 104B Q4_K_M (PPL 6.415). Validate on your specific model:

# Try symmetric first on large Q4_K_M models
llama-server -m large-model-Q4_K_M.gguf -ctk turbo3 -ctv turbo3 -fa 1

# Fall back to asymmetric if quality is poor
llama-server -m large-model-Q4_K_M.gguf -ctk q8_0 -ctv turbo3 -fa 1

Unknown models

Start conservative and validate:

# Safe starting point for any model
llama-server -m model.gguf -ctk q8_0 -ctv turbo4 -fa 1

Block Size

The default storage block size is 128 elements (changed from 32). turbo3 achieves 5.12x compression (vs 4.57x at block_size=32) with zero quality cost.

Validated on:

  • Metal: 3 architectures (dense, dense Qwen, hybrid MoE), 3 context lengths (512, 8K, 32K), 2 Apple Silicon platforms (M5 Max, M2 Pro). Both symmetric and asymmetric paths.
  • CUDA: Validated via PR #32 fix (warp-to-block mapping for block_size=128). PPL identical on RTX 3090 sm_86.
  • NIAH: 3/3 pass at block_size=128 (phi-4, symmetric turbo3, 4K context).

On M2 Pro (Qwen2.5-1.5B, q8_0-K + turbo3-V), block_size=128 improved decode speed by 3-7%. This gain was not observed on M5 Max and may be specific to bandwidth-constrained hardware.

See block size study for the full data.

Notes and Caveats

  • Multi-backend: Metal, CUDA, and HIP all validated. Asymmetric q8_0-K + turbo4-V confirmed across all three GPU vendors on Qwen2.5-7B Q4_K_M.
  • Model sensitivity varies: Qwen2.5 is consistently sensitive to symmetric turbo on Q4_K_M. Llama, Mistral, and Qwen3.5 tolerate it. Test before deploying on new model families.
  • turbo2 as V cache: turbo2-V with Boundary V auto-enabled gives +5-9.5% PPL depending on model. Boundary V recovers 37-91% of the quality gap.
  • PPL is measured at 512 context with 4 chunks unless noted. Long-context PPL validated up to 128K on Command-R+ 104B.
  • 104B at 128K context: Confirmed on M5 Max 128GB with Command-R+ Q4_K_M. turbo3/turbo3 PPL 4.024. Requires raising macOS GPU memory cap: sudo sysctl iogpu.wired_limit_mb=117964 (90% of 128GB, safe for sustained inference). Without this, Metal hangs at ~49K context on 70B+ models. Setting above 90% risks kernel panics under sustained load (community reported). GGML_METAL_NO_RESIDENCY=1 is not needed (isolation testing confirmed). See M5 Max stress test.
  • turbo3 prefill faster than q8_0 at 32K: Confirmed on both 70B (+7.4%) and 104B (+3.5%). Smaller KV cache reduces memory bandwidth during attention.
  • Community: 30+ testers across M1/M2/M3/M5 Mac, RTX 3080 Ti/3090/4090/5090, DGX Spark Blackwell, AMD RX 9070 XT.