TurboQuant+ Quality Benchmarks

Why These Benchmarks Matter

Speed without quality is useless. We claim 4.6× KV cache compression at 91-97% of q8_0 speed. But we have ZERO quantitative quality data on the actual llama.cpp build. "Coherent text" and Python cosine similarity (0.985) are not sufficient.

The llama.cpp CONTRIBUTING.md requires for new quant types:

Perplexity vs FP16/BF16
KL divergence data
Performance data via llama-bench

Papers (KIVI, QJL, RotateKV) all use wikitext-2 perplexity as the primary metric. Prince Canuma validated with NIAH at 8K/32K/64K context.

What We're Measuring

1. Perplexity (wikitext-2)

What: How well the model predicts next tokens over a standard text corpus
Why: Gold standard for LLM quality. Lower = better. Sensitive to KV cache quality.
Target: turbo3 within 1% of q8_0 perplexity. If >2%, quality problem.
Comparison: f16, q8_0, q4_0, q4_1, q5_0, turbo3

2. KL Divergence vs f16

What: How different the output probability distribution is vs full precision
Why: Measures distributional shift, not just top-token accuracy
Metrics: mean KLD, delta-p RMS, same-top-p percentage
Required by: llama.cpp CONTRIBUTING.md for upstream acceptance

3. Passkey Retrieval (built-in NIAH)

What: Can the model retrieve a specific passkey from a long haystack?
Why: Tests attention pattern preservation over long context
Comparison: f16 vs turbo3 at various context lengths and needle positions

4. Generation Quality (qualitative)

What: Side-by-side text generation comparison
Why: Catches issues that aggregate metrics miss (repetition, coherence)

Test Configuration

Model: Qwen 3.5 35B-A3B MoE Q8_0 (primary), Qwopus v2 27B Q8_0 (secondary)
Hardware: Apple M5 Max 128GB, Metal GPU
Dataset: wikitext-2-raw (downloaded via scripts/get-wikitext-2.sh)
Context: 512 tokens for perplexity (fast), 2048+ for NIAH
Chunks: 8 for initial, 32 for final numbers

Commands

Download dataset

cd ~/local_llms/llama.cpp && bash scripts/get-wikitext-2.sh

Perplexity suite

LLAMA=~/local_llms/llama.cpp/build-turbo/bin
MODEL=~/local_llms/models/Qwen3.5-35B-A3B-Q8_0.gguf
WIKI=~/local_llms/llama.cpp/wikitext-2-raw/wiki.test.raw

for ct in f16 q8_0 q4_0 q4_1 q5_0 turbo3; do
  echo "=== $ct ==="
  $LLAMA/llama-perplexity -m $MODEL -f $WIKI -c 512 \
    -ctk $ct -ctv $ct -fa on --chunks 8 -ngl 99 \
    2>&1 | tee results/ppl_${ct}.log
done

KL Divergence

# Save f16 baseline logits
$LLAMA/llama-perplexity -m $MODEL -f $WIKI -c 512 \
  -ctk f16 -ctv f16 -fa on --chunks 8 -ngl 99 \
  --kl-divergence-base results/f16_logits.kld

# Compute KL divergence for each cache type
for ct in q8_0 q4_0 turbo3; do
  $LLAMA/llama-perplexity -m $MODEL -f $WIKI -c 512 \
    -ctk $ct -ctv $ct -fa on --chunks 8 -ngl 99 \
    --kl-divergence --kl-divergence-base results/f16_logits.kld \
    2>&1 | tee results/kld_${ct}.log
done

Results

To be filled after running benchmarks.

Perplexity (wikitext-2, 512 context)

Cache Type	Bits/val	Perplexity	vs f16	vs q8_0
f16	16	—	baseline	—
q8_0	8	—	—	baseline
q4_0	4	—	—	—
q4_1	4.5	—	—	—
q5_0	5.5	—	—	—
turbo3	3.5	—	—	—

KL Divergence vs f16

MoE (Qwen3.5-35B-A3B Q8_0):

Cache Type	Mean KLD	Delta-p RMS	Same Top-p %
q8_0	0.001549	1.231%	98.43%
q4_0	0.008091	2.753%	95.83%
turbo3	0.016145	4.090%	94.31%

Dense (Qwen3.5-27B Q8_0):

Cache Type	Mean KLD	Delta-p RMS	Same Top-p %
q8_0	0.000018	0.127%	99.90%
q4_0	0.002741	1.437%	97.65%
turbo3	0.009900	2.738%	95.98%

See full analysis.

Passkey Retrieval

Cache Type	1K	2K	4K	8K
f16	—	—	—	—
q8_0	—	—	—	—
turbo3	—	—	—	—

INITIAL RESULTS — QUALITY FAILURE

Perplexity (wikitext-2, 512 context, 8 chunks)

Cache Type	Bits/val	Perplexity	vs f16
f16	16	6.121	baseline
q8_0	8	6.111	-0.16% (better!)
q4_0	4	6.142	+0.34%
turbo3	3.5	165.6	+2607% ❌

turbo3 perplexity is 27× worse than f16. This is catastrophic.

The model generates "coherent-looking" text but the actual predictions are garbage. Speed benchmarks were meaningless — we were measuring how fast the model produces wrong answers.

Root cause investigation needed

The block size 32 change + MSE-only + pre-rotate-queries may have introduced a bug:

The norm handling: storing full 128-element group norm in each 32-element block but dequant treats it as a per-block norm — scale mismatch?
The pre-rotate-queries: rotation matrix might not match the quantize rotation
The 3-bit index split (qs + signs) might have a packing error
The non-vec flash attention instantiation might use wrong nl parameter

MUST FIX before claiming any quality results.

Quality Bisection Log

Goal: find which change broke perplexity (165.6 vs 6.1 baseline).

Changes applied (in order):

Pre-rotate-queries: rotation moved from dequant to Q in attention graph
MSE-only: dropped QJL, 3-bit centroids (2-bit qs + 1-bit signs)
Block size 32: storage blocks shrunk from 128 to 32 elements

Bisection plan:

Test A: revert block 32 → 128, keep MSE-only + pre-rotate
Test B: revert MSE-only → 2bit+QJL, keep block 128 + pre-rotate
Test C: revert pre-rotate → rotation in dequant, keep block 128 + 2bit+QJL
If A/B/C all still bad: check the rotation matrix consistency

Bisection Results

Test A: Disable Q rotation → PPL 165.6 (STILL BAD)

Pre-rotate-queries is NOT the problem. Q rotation has no effect on perplexity.

Root Cause Found: V CACHE IN ROTATED SPACE

The bug: both K and V caches go through the same turbo3 quantize (which rotates) and dequant (which returns rotated values without inverse rotation).

For K cache: this is correct because Q is also rotated → <Rq, Rk> = <q, k> ✓ For V cache: this is WRONG because the attention output = attn_weights @ V needs to be in ORIGINAL space, not rotated space.

The fix: V cache must NOT be rotated during quantize, or the attention output must be inverse-rotated after the V multiplication.

Options:

Skip rotation for V in quantize — only rotate K
Add inverse rotation after attention output: out = R^T @ (attn @ R*V) = attn @ V ✓
Use f16 for V cache (only compress K) — mixed types need support

Option 1 is simplest: in the SET_ROWS kernel, check if this is K or V and skip rotation for V. The dequant already skips rotation. Problem: the SET_ROWS kernel doesn't know if it's writing K or V.

Option 2 is cleanest: add R^T multiplication to attention output in the graph. This is one ggml_mul_mat per layer (same as Q rotation). Cost is negligible.

Root Cause #1: V cache dequant returns rotated-space values

Python verified: cosine(input, dequant_output) = 0.02 (garbage) After inverse rotation: cosine = 0.987 (correct)

Root Cause #2: dynamic_cast to llama_kv_cache_context FAILS

The Qwen 3.5 MoE uses llama_memory_hybrid_context (not llama_kv_cache_context). The dynamic_cast returns null, so the V inverse rotation never executes. This also means the Q pre-rotation never executed — explaining why the chat server seemed to work (no rotation applied = raw quantize/dequant, which happens to produce plausible-looking text at low context but garbage perplexity).

Fix needed

Store turbo rotation tensors in llm_graph_context (not KV cache) OR access them through the hybrid memory interface
Apply V inverse rotation through a non-cast path
Verify Q rotation also executes (it was also using the same cast)

The "coherent text" mystery explained

The model produced grammatical text because:

Without Q rotation, attention scores are computed in wrong space but still produce some attention pattern (just not the right one)
Without V inverse rotation, the output is in rotated space which is orthogonal to the correct space — but each layer compounds the error
Short conversations look plausible but perplexity reveals the content is wrong

Bisect: block 128 → STILL PPL=165.6

Block size is NOT the issue. PPL identical at block 128 and block 32.

Bisect: disable Q rotation → STILL PPL=165.6

Pre-rotate-queries is NOT the issue. PPL identical with/without Q rotation.

Root cause #2 confirmed: dynamic_cast fails for MoE hybrid memory

The Q rotation and V inverse rotation NEVER execute because mctx is llama_memory_hybrid (not llama_kv_cache). The kv_cache IS inside the hybrid object but we can't reach it.

What this means

Without Q rotation and without V inverse rotation:

K cache is quantized in ROTATED space, dequanted in ROTATED space
Q is in ORIGINAL space (no pre-rotation)
<original_Q, rotated_K> = GARBAGE (cosine ≈ 0.02)
V cache similarly in rotated space, output never inverse-rotated

The entire time, K/V values were in the wrong coordinate system. The "coherent text" was an illusion from the language model's resilience.

Fix approach

Need to get rotation tensors through the hybrid memory hierarchy. Virtual methods on llama_memory_context_i are the cleanest path.

FIX: Inverse rotation restored in dequant

Perplexity (wikitext-2, 512 context, 8 chunks)

Cache Type	Perplexity	vs f16	vs q8_0
f16	6.121	baseline	—
q8_0	6.111	-0.16%	baseline
q4_0	6.142	+0.34%	+0.51%
turbo3	6.194	+1.19%	+1.36%

turbo3 within 1.4% of q8_0. Quality target MET.

Root cause of previous failure

Pre-rotate-queries never executed because:

Q tensor ne[0]=256 (GQA concatenated heads) vs rotation ne[0]=128
MoE hybrid memory context fails dynamic_cast to llama_kv_cache_context

Current state

Graph-side WHT rotation: Q rotated via ggml_mul_mat after RoPE, V un-rotated after attention
Block-32 storage: dequant is simple centroid lookup (no WHT), matches q4_0 GPU parallelism
q8_0 speed parity achieved

Top-of-Tree Quality and Speed (2026-03-25)

Model: Qwen3.5-35B-A3B-Q8_0 | Hardware: Apple M5 Max 128GB | Flash Attention ON

Quality (wikitext-2, 512 context)

Cache Type	Bits/val	Compression	PPL (32-chunk)	PPL (8-chunk)	vs q8_0
f16	16	1.0x	—	6.121	-0.16%
q8_0	8	2.0x	5.414 ± 0.140	6.111 ± 0.326	baseline
q4_0	4	4.0x	—	6.142	+0.51%
turbo3	3.5	4.6x	5.460 ± 0.141	6.193 ± 0.332	+0.8-1.3%

Long-Context Quality (wikitext-103, 32K context, 50 chunks)

Config	PPL	± CI	vs q8_0	Sparse V Δ
q8_0 baseline (8-bit KV)	7.0638	0.021	—	—
q4_0 baseline (4-bit KV)	7.0857	0.021	+0.31%	—
turbo3 WITHOUT sparse V (3.5-bit)	7.1796	0.021	+1.64%	—
turbo3 WITH sparse V (3.5-bit)	7.1796	0.021	+1.64%	0.0000

Note: q4_0 is included as a reference baseline. No optimization was applied to q4_0 in this work.

Speed (wikitext-2, 512 context, 32 chunks prefill)

Cache Type	Prefill tok/s	vs q8_0
q8_0	2694	1.00x
turbo3 (block-32 + graph WHT)	2747	1.02x

Speed Optimization Journey (739 → 2747, 3.72x total)

Step	tok/s	vs q8_0
fp32 WHT in dequant	739	0.27x
+ fp16 WHT	1074	0.40x
+ half4 vectorized butterfly	1411	0.52x
+ graph-side WHT rotation	2095	0.78x
+ block-32 storage	2747	1.02x

Summary

4.6x compression, <1.3% quality loss, q8_0 speed parity
Key breakthrough: moving WHT from per-block dequant to graph-level ggml_mul_mat
See Speed Experiments for the full journey

Current State Summary (after quality fix)

What works:

turbo3 KV cache compression: 4.6× compression ratio
Perplexity: 6.194 (1.4% of q8_0) — QUALITY TARGET MET
MSE-only mode: 3-bit PolarQuant, no QJL — better quality than paper's Algorithm 2
WHT rotation in dequant: correct inverse rotation restoring original space
Block size 128: works on Metal for both MoE and Dense models
Works for MoE (hybrid memory), Dense, and ISWA models

Speed:

~10.7 tok/s MoE gen (8× slower than q8_0 85.5)
Bottleneck: inverse WHT rotation in dequant (called per block per attention)

What doesn't work:

Pre-rotate-queries optimization: never executed because
1. Q tensor ne[0]=256 (GQA concatenated heads) vs rotation ne[0]=128
2. Need per-head reshape before rotation
Previous speed numbers (51-77 tok/s) were INVALID — measured garbage output speed

Lessons learned:

ALWAYS run perplexity before claiming quality
"Coherent text" is NOT quality validation
dynamic_cast fails silently for MoE hybrid memory
GQA models concatenate heads in ne[0] — can't apply per-head ops without reshape
#include in ggml-metal.metal causes silent CPU fallback
Block size matters for speed but NOT for quality (both 32 and 128 had same PPL)

FilesExpand file tree

quality-benchmarks.md

Latest commit

History

quality-benchmarks.md

File metadata and controls

TurboQuant+ Quality Benchmarks

Why These Benchmarks Matter

What We're Measuring

1. Perplexity (wikitext-2)

2. KL Divergence vs f16

3. Passkey Retrieval (built-in NIAH)

4. Generation Quality (qualitative)

Test Configuration

Commands

Download dataset

Perplexity suite

KL Divergence

Results

Perplexity (wikitext-2, 512 context)

KL Divergence vs f16

Passkey Retrieval

INITIAL RESULTS — QUALITY FAILURE

Perplexity (wikitext-2, 512 context, 8 chunks)

Root cause investigation needed

Quality Bisection Log

Bisection Results

Test A: Disable Q rotation → PPL 165.6 (STILL BAD)

Root Cause Found: V CACHE IN ROTATED SPACE

Root Cause #1: V cache dequant returns rotated-space values

Root Cause #2: dynamic_cast to llama_kv_cache_context FAILS

Fix needed

The "coherent text" mystery explained

Bisect: block 128 → STILL PPL=165.6

Bisect: disable Q rotation → STILL PPL=165.6

Root cause #2 confirmed: dynamic_cast fails for MoE hybrid memory

What this means

Fix approach

FIX: Inverse rotation restored in dequant

Perplexity (wikitext-2, 512 context, 8 chunks)

Root cause of previous failure

Current state

Top-of-Tree Quality and Speed (2026-03-25)

Quality (wikitext-2, 512 context)

Long-Context Quality (wikitext-103, 32K context, 50 chunks)

Speed (wikitext-2, 512 context, 32 chunks prefill)

Speed Optimization Journey (739 → 2747, 3.72x total)

Summary

Current State Summary (after quality fix)

What works:

Speed:

What doesn't work:

Lessons learned: