TurboQuant Decode Speed: Why Some Hardware Struggles

The Problem

turbo3 decode speed varies dramatically across Apple Silicon generations:

Hardware	Decode vs q8_0 (short)	Decode vs q8_0 (16K)	Has Tensor API
M5 Max	0.89x	~0.82x	yes
M2 Pro	0.73x	0.67x (with 4-mag fix)	no
M1 Max	0.84x	0.45x (reported)	no

M5 Max has near-parity. M2/M1 fall off a cliff at long context. This doc explains why and what we did about it.

Root Cause: Constant Memory LUT Divergence

turbo3 dequant uses an 8-entry centroid lookup table (LUT) in Metal constant memory. Each element's value is determined by a 3-bit index (2-bit qs + 1-bit sign), so each thread in a 32-thread simdgroup may access a different LUT entry.

This "divergent access" pattern is where hardware generations differ:

M5 Max (Apple10): Efficient constant cache handles 8-way divergent access with minimal penalty. LUT costs ~14% of decode time.
M2 Pro (Apple8): Older constant cache serializes divergent access. LUT costs ~25% of decode time — 2x worse than M5.
M1 Max (Apple7): Even worse, likely 30%+ LUT cost based on reported numbers.

How We Know: Profiling Isolation

We added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time:

Mode	What	M5 (% of ceiling)	M2 (% of ceiling)
1	No dequant at all	78.9 (100%)	24.5 (100%)
2	+ read norm only	75.1 (95%)	22.1 (90%)
4	+ read all bytes	75.2 (95%)	21.9 (89%)
3	+ qs extraction + LUT	64.9 (82%)	16.4 (67%)
0	+ signs + full LUT	59.2 (75%)	14.0 (57%)

Key finding: No-dequant turbo3 is 12% FASTER than q8_0 on M2 Pro (24.5 vs 21.9) because the compressed cache moves less data over the memory bus. The compression FORMAT is not the problem. The dequant COMPUTATION is.

The LUT accounts for:

M5: 13.7% of ceiling (Mode 4 → Mode 3)
M2: 25.1% of ceiling — 2x worse

Reading the bytes (norm + qs + signs) costs ~10% on both chips. That's not the bottleneck.

What We Tried (7 Approaches)

#	Approach	M2 8K tok/s	vs Main	Why it works/doesn't
1	4-mag LUT + XOR sign	15.1	+38%	Halves constant addresses (4 vs 8). Sweet spot on M2.
2	Batched byte extract	13.7	+25%	Better byte reading, still 8 LUT addresses
3	Deferred norm multiply	12.9	+18%	Loses ILP — per-element norm hides LUT latency
4	2-pair half2 LUT	12.0	+10%	Ternary overhead exceeds savings from 2 addresses
5	Select chain (zero LUT)	11.9	+9%	Too much ALU — branches on M2 GPU
6	Bit-arithmetic	11.6	+6%	Pure ALU, zero memory — but ALU cost too high
7	Non-vec FA (nl=2)	10.2	-7%	Non-vec kernel not designed for single-token decode

Also tested on M5 Max (no help needed):

float cn[8] registers: 75.2 (spills to stack on Metal)
half cn[8] registers: 74.4 (also spills)
Split 2×4 half LUT: 74.0 (branch overhead)

The Fix: Auto-Detection

The build auto-detects hardware at Metal library compile time:

M1/M2/M3/M4 (has_tensor=false) → TURBO_USE_4MAG=1 → 4-entry magnitude LUT
M5+          (has_tensor=true)  → TURBO_USE_4MAG=0 → 8-entry full LUT

Each chip gets its optimal dequant path. No user configuration needed.

How 4-mag LUT works

The 8 centroids have structure: 4 magnitudes × 2 signs. We split the 3-bit index:

2-bit qs → selects magnitude from 4-entry LUT (4 possible constant addresses)
1-bit sign → reverses magnitude index via XOR, then negates via ALU multiply
Norm → multiplied per-element (provides ALU work that hides constant memory latency)

The XOR trick: for negative values (sign=0), the magnitude index is reversed (qs=0 → highest magnitude). qs ^ 0x3 flips the 2-bit index without a branch.

Critical finding: On M2, branches cost MORE than divergent constant reads

We tested 9 approaches total. The pattern is clear:

Approach	M2 8K	Constant reads	Branches	Result
4-mag LUT	15.1	4 divergent	0	BEST
Batched extract (8-LUT)	13.7	8 divergent	0	Good
Deferred norm (4-mag)	12.9	4 divergent	0	Lose ILP
2-pair half2 + ternary	12.0	2 divergent	4	Branches hurt
Select chain (zero LUT)	11.9	0	8	Branches kill
Bit-arithmetic	11.6	0	4+	ALU too heavy
Named-reg + ternary select	10.3	4 uniform	8	Worst — ternary kills
Main (8-entry LUT)	10.95	8 divergent	0	Baseline
Inline block (FA inner loop)	13.5	4 divergent	0	I-cache pressure
Non-vec FA (nl=2)	10.2	8 divergent	0	Wrong kernel for decode

Any approach that replaces constant reads with branches loses on M2. The Apple8 GPU's branch predictor/execution is worse than its constant cache. The 4-mag LUT succeeds because it reduces constant addresses (4 vs 8) WITHOUT adding branches.

Per-element norm multiply is faster than deferred

Deferring float4 * norm at the end (12.9 tok/s) is slower than per-element v * norm (15.1 tok/s). The per-element multiply provides ALU work that hides constant memory latency via instruction-level parallelism. The GPU can overlap the next constant read while the current multiply executes.

Why CUDA Doesn't Have This Problem

@spiritbuun's CUDA fork achieves 97.5% of q8_0 decode. The key difference:

CUDA has 255 registers per thread — cn[8] fits in registers without spilling
Metal has a smaller register file — cn[8] spills to thread stack memory
CUDA warp semantics — better divergent access handling across 32 threads
Metal simdgroup semantics — constant cache serializes more on older chips

The register LUT approach works perfectly on CUDA but is fundamentally incompatible with Metal's register file on current hardware.

Remaining Gap

Even with 4-mag LUT, M2 Pro decode at 8K is 15.1 vs 24.5 ceiling (62%). The remaining 38% gap needs kernel-level changes:

SMEM pre-dequant: Pre-dequant entire KV blocks into threadgroup memory before dot products. Would eliminate constant memory from the inner loop entirely. Requires modifying the FA vec kernel template.
Per-group device-memory cached centroid×norm: Store 8 pre-computed centroid×norm values per 128-element group alongside the block data. Dequant reads from device memory (sequential) instead of constant memory (divergent). Changes the block format.
Fused Q·centroid attention: Compute attention scores directly from quantized indices without full dequantization. Precompute Q·centroid table (8 values per block), then each K element is a table lookup. Complex (custom FA kernel).

These are tracked in GitHub issue #39.

Summary

What	Result
Root cause	Constant memory LUT divergence, 2x worse on M2 vs M5
Best fix found	4-mag LUT (+38-45% on M2, auto-detected)
M5 impact	Zero regression (uses separate code path)
Remaining gap	38% to ceiling, needs kernel-level surgery
CUDA comparison	buun gets 97.5% via register LUT (not portable to Metal)

Complete Experiment Log (12 approaches, 2026-03-26/27)

M2 Pro (Apple8, has_tensor=false) — 8K context decode

#	Approach	tok/s	vs q8_0	vs Ceiling	Key finding
—	No-op ceiling	24.5	1.12x	100%	turbo3 format FASTER than q8_0 (less bandwidth)
1	4-mag LUT + XOR sign	15.1	0.69x	62%	Sweet spot: 4 constant addrs, 0 branches
2	Batched byte extract (8-LUT)	13.7	0.63x	56%	Better byte reading, still 8 addresses
3	Inline block in FA loop	13.5	0.62x	55%	I-cache pressure from expanded inline
4	Deferred norm (float4 * norm)	12.9	0.59x	53%	Loses ILP — norm multiply hides LUT latency
5	2-pair half2 + ternary	12.0	0.55x	49%	Ternary overhead exceeds 2-addr savings
6	Select chain (zero LUT)	11.9	0.54x	49%	8 ternaries = 8 branches on Apple8
7	Bit-arithmetic (mul+add)	11.6	0.53x	47%	7 ALU > 4 constant reads
8	FMA branchless (zero ternary)	11.4	0.52x	47%	fma doesn't help — same ALU count
—	Main (8-entry constant LUT)	10.95	0.50x	45%	Baseline — 8-way divergent
9	Named-reg ternary select	10.3	0.47x	42%	4 uniform reads + 8 branches = worst
10	Non-vec FA forced (nl=2)	10.2	0.47x	42%	Non-vec kernel wrong for single-token decode

M5 Max (Apple10, has_tensor=true) — short context decode

All approaches: 75-77 tok/s (M5 uses 8-LUT path, unaffected by TURBO_USE_4MAG). No regression from any experiment. q8_0 baseline: 85 tok/s.

Additional checks

Qwen2 attention bias: NOT present in Qwen2.5-7B-Instruct model (339 tensors, no attn_q/k/v.bias). The GÖKYILDIZ bug does not apply.
Model-independence: Profiling pattern (LUT = 25% cost on M2) is consistent across Qwen2.5-7B (M2 Pro) and Qwen3.5-35B-A3B (M5 Max). Architecture-independent.
Non-vec FA (nl=2): Faster on M5 (+1.7%) but much worse on M2 (-7%). The non-vec kernel processes batch=1 inefficiently at long context.

Hardware truth (Apple8/M2 Pro)

1 divergent constant read < 7 ALU ops — even with fma()
Metal compiles ternaries to branches — not predicated moves like CUDA
Branches cost MORE than divergent constant reads — the opposite of CUDA
Array indexing ALWAYS spills — Metal's register file is too small for cn[4+]
4 constant addresses is the sweet spot — fewer adds branches, more adds thrashing
Per-element norm multiply provides ILP — hides constant memory latency

Papers reviewed for novel approaches

AttentionPack (arxiv 2603.23914): validates kernel fusion for attention-aware decompression
GlowQ (arxiv 2603.25385): validates group-shared factor caching (+37.4% throughput)
Embedding Compression via Spherical Coordinates (2602.00079): same PolarQuant framework, different encoding
MKA: Memory-Keyed Attention (2603.20587): learned memory lookup, hardware-aware
Scaling Attention via Feature Sparsity (2603.22300): skip negligible attention weights

Next frontier

The 4-mag LUT is the dequant-level ceiling. The remaining 38% gap requires:

Block format change: embed precomputed centroid×norm in device memory (sequential reads, zero divergence)
Custom FA kernel: fuse dequant into attention with restructured computation
Different quantization scheme: format designed for Metal's constraints from scratch

Extended Experiments (approaches 12-13, 2026-03-27)

#	Approach	M2 8K tok/s	vs 4-mag	Key finding
12	FMA branchless (zero ternary, zero memory)	11.4	-24%	7 ALU ops slower than 4 constant reads even with fma()
13	simd_shuffle cross-lane magnitude select	14.7	-2.6%	Shuffle latency ≈ constant read latency on Apple8

FMA branchless details

Fully branchless: XOR mask via 3 - 3*sign_bit (not ternary), sign via 2*s - 1 (not ternary), magnitude via 3-chained fma(). Zero branches, zero memory. Still slower because 7 ALU cycles > 1 divergent constant read cycle on Apple8.

simd_shuffle details

Threads 0-3 within each 8-thread block compute mag[i]×norm. All threads use simd_shuffle(value, block_base + mi) to read the correct mag×norm. This IS branchless and memory-free — the value moves via cross-lane register transfer. But shuffle latency on Apple8 is comparable to constant cache access, negating the benefit.

Total: 13 approaches tested

The 4-mag constant LUT at 15.1 tok/s (0.69× q8_0) is the definitive dequant-level ceiling on Apple8 hardware. Every alternative — fewer constant addresses, zero constant addresses, branchless ALU, cross-lane shuffle, inline FA blocks — is equal or worse.

Extended Experiments (approaches 14, 2026-03-27)

#	Approach	M2 8K tok/s	vs 4-mag	Key finding
14	Fused block dot (per-centroid Q accum)	8.1	-46%	64 float comparisons per block — worst result of all

Fused block dot details

Flipped computation: instead of per-element centroid lookup, iterate over 4 centroid values and accumulate matching Q elements. Uses float(mi == c) as a branchless mask. But 4 centroids × 4 elements × 4 comparisons = 64 float comparison operations per dequant call. Each comparison likely compiles to a branch on Apple8, making this the most expensive approach tested.

Final tally: 14 approaches tested

Rank	Approach	M2 8K	vs Ceiling	Category
1	4-mag LUT	15.1	62%	4 constant reads
2	simd_shuffle	14.7	60%	Cross-lane transfer
3	Batched extract (8-LUT)	13.7	56%	8 constant reads
4	Inline FA block	13.5	55%	Inlined dequant
5	Deferred norm	12.9	53%	Loses ILP
6	2-pair half2	12.0	49%	2 reads + ternary
7	Select chain	11.9	49%	Pure branches
8	Bit-arithmetic	11.6	47%	Pure ALU
9	FMA branchless	11.4	47%	fma() chain
10	Main (8-LUT)	10.95	45%	Baseline
11	Named-reg ternary	10.3	42%	Registers + branches
12	Non-vec FA	10.2	42%	Wrong kernel
13	Fused block dot	8.1	33%	64 comparisons
—	Ceiling (no dequant)	24.5	100%	Zero overhead

Extended Experiments (approach 15, 2026-03-28)

#	Approach	M2 8K tok/s	vs 4-mag	Key finding
15	SMEM pre-dequant (threadgroup memory tile)	10.17	-33%	Threadgroup store/load overhead > constant cache savings

SMEM pre-dequant details

Pre-dequantize entire K/V tiles (C=32 × DK=128 = 8KB half) into threadgroup memory before dot products. All 32 threads cooperatively dequant C cache positions, barrier, then compute from SMEM.

Why it failed: The FA vec kernel distributes work so each thread operates on unique data (DK4/NL=1 dequant per thread per cache position). Each dequanted value is used exactly once by its producer thread. Caching in SMEM adds 64 threadgroup memory ops (32 stores + 32 loads) per thread per outer iteration + barrier, for zero reduction in constant LUT reads. Additionally, separating dequant from compute destroys the ILP that makes the 4-mag LUT work (constant reads overlapped with ALU). At short context, the overhead is negligible (+1.8%), but at 8K it's catastrophic (-51.5%).

Lesson: SMEM only helps when data is shared between threads. Don't separate what the hardware pipelines together.

Updated tally: 16 approaches tested

Rank	Approach	M2 8K	vs Ceiling	Category
1	4-mag LUT	15.1	62%	4 constant reads
2	simd_shuffle	14.7	60%	Cross-lane transfer
3	Batched extract (8-LUT)	13.7	56%	8 constant reads
4	Inline FA block	13.5	55%	Inlined dequant
5	Deferred norm	12.9	53%	Loses ILP
6	2-pair half2	12.0	49%	2 reads + ternary
7	Select chain	11.9	49%	Pure branches
8	Bit-arithmetic	11.6	47%	Pure ALU
9	FMA branchless	11.4	47%	fma() chain
10	Main (8-LUT)	10.95	45%	Baseline
11	Named-reg ternary	10.3	42%	Registers + branches
12	Non-vec FA	10.2	42%	Wrong kernel
13	SMEM pre-dequant	10.17	41%	Threadgroup cache (ILP loss)
14	Q·centroid precompute	10.10	41%	select() register LUT
15	Fused block dot	8.1	33%	64 comparisons
—	Ceiling (no dequant)	24.5	100%	Zero overhead

Critical ILP insight (2026-03-28)

The 2x slowdown of SMEM pre-dequant revealed WHY the 4-mag LUT works: interleaved constant reads and ALU provide instruction-level parallelism. The GPU overlaps the next constant read while the current dot product + norm multiply executes. Any approach that separates memory reads from compute phases loses this overlap and runs at 2x the latency.

This explains the ranking pattern: approaches that maintain per-element read → ALU interleaving (4-mag, simd_shuffle, batched 8-LUT) outperform approaches that batch reads (SMEM, deferred norm, QC precompute) or eliminate reads but add more ALU (select chain, FMA, bit-arithmetic).

The remaining 38% gap cannot be closed by rearranging the same reads and ALU. The only path forward is reducing the total constant read count per element below 4, which requires changing the block format or computation structure.

Extended Experiments (approaches 16h + 19, 2026-03-28 evening)

Clean benchmarks (no DINOv2 contention). Note: absolute tok/s numbers differ from earlier runs due to DINOv2 GPU contention during those measurements. Relative rankings are consistent.

Qwen2.5-7B-Instruct-Q4_K_M (8K context, p=8192 tg128)

#	Approach	Env Var	turbo3 t/s	turbo4 t/s	q8_0 t/s	Key finding
—	Baseline (4-mag LUT)	—	25.88	27.59	33.69	turbo4 > turbo3 by 6.6%
16h	Half register LUT (`half cn[8]`)	TURBO_HALF_REG_LUT=1	25.55 (-1.3%)	27.26 (-1.2%)	—	Still spills on Apple8. Noise.
19	Threadgroup centroid cache	TURBO_TG_CENTROID=1	25.96 (+0.3%)	27.42 (-0.6%)	—	Flat. Threadgroup read ≈ constant read.

Qwen2.5-1.5B-Instruct-Q4_K_M (8K context, p=8192 tg128)

#	Approach	Env Var	turbo3 t/s	turbo4 t/s	q8_0 t/s	Key finding
—	Baseline (4-mag LUT)	—	51.84	58.74	113.02	Gap worse on 1.5B (attention dominates)
16h	Half register LUT	TURBO_HALF_REG_LUT=1	50.67 (-2.3%)	57.76 (-1.7%)	—	Consistent regression across models
19	Threadgroup centroid cache	TURBO_TG_CENTROID=1	52.19 (+0.7%)	57.98 (-1.3%)	—	Flat

Prompt processing (pp8192, for reference)

Model	turbo3	turbo4	q8_0
7B	241.26	240.00	254.74
1.5B	816.13	807.03	896.42

Key takeaways from approaches 16h + 19

Both approaches are noise on M2 Pro. Half Reg LUT is consistently -1 to -2% (regression). TG Centroid is flat (within variance).
turbo4 > turbo3 on M2 Pro by 6-13% on decode. Holds across both model sizes. 16 centroids (turbo4) is faster than 8 (turbo3) — possibly better ILP with 4-bit indices.
The gap to q8_0 grows with smaller models: 30% on 7B, 118% on 1.5B. Smaller models make attention a larger fraction of decode, making the dequant overhead more visible.
The centroid LUT is NOT the sole bottleneck. Moving centroids to half-precision registers or threadgroup memory doesn't help. The bottleneck is broader — likely the full WHT transform + extraction pipeline, not just the final centroid lookup.

Revised theory (2026-03-28)

The original theory was "constant memory LUT divergence is the bottleneck." After testing 4 approaches targeting the centroid LUT specifically (#15 SMEM pre-dequant, #16q Q·centroid precompute, #16h half reg LUT, #19 TG centroid), NONE improved decode speed. The 4-mag LUT improvement from earlier was real (+38%), but it was optimizing the full dequant pipeline (fewer reads + XOR sign trick), not just the centroid lookup.

The remaining gap is structural: turbo3/4 dequant requires WHT extraction + centroid lookup + norm multiply per element, while q8_0 is a simple int8 * scale. No rearrangement of the same operations will close this gap.

Updated tally: 18 approaches tested (16h and 19 added)

Rank	Approach	M2 8K	vs Ceiling	Category
1	4-mag LUT	15.1	62%	4 constant reads
2	simd_shuffle	14.7	60%	Cross-lane transfer
3	Batched extract (8-LUT)	13.7	56%	8 constant reads
4	Inline FA block	13.5	55%	Inlined dequant
5	Deferred norm	12.9	53%	Loses ILP
6	2-pair half2	12.0	49%	2 reads + ternary
7	Select chain	11.9	49%	Pure branches
8	Bit-arithmetic	11.6	47%	Pure ALU
9	FMA branchless	11.4	47%	fma() chain
10	Main (8-LUT)	10.95	45%	Baseline
11	Named-reg ternary	10.3	42%	Registers + branches
12	Non-vec FA	10.2	42%	Wrong kernel
13	SMEM pre-dequant	10.17	41%	Threadgroup cache (ILP loss)
14	Q·centroid precompute	10.10	41%	select() register LUT
15	Fused block dot	8.1	33%	64 comparisons
16h	Half register LUT	~noise	~62%	half cn[8] — still spills
19	TG centroid cache	~noise	~62%	Threadgroup centroid table
—	Ceiling (no dequant)	24.5	100%	Zero overhead

Extended Experiments (approaches 20 + 22, 2026-03-28 night)

Qwen2.5-1.5B-Instruct-Q4_K_M (8K context, p=8192 tg128)

#	Approach	Env Var	turbo3 t/s	turbo4 t/s	Key finding
—	Baseline (4-mag LUT)	—	51.80	58.60	—
20	2-bit direct encode (pure ALU, no LUT)	TURBO_DIRECT_ENCODE=1	52.16 (+0.7%)	57.78 (-1.4%)	Same speed with zero constant reads
22	Async prefetch (batch device reads)	TURBO_ASYNC_PREFETCH=1	50.73 (-2.1%)	57.02 (-2.7%)	GPU already interleaves reads optimally

Definitive proof: constant memory LUT is FREE on M2 Pro

Approach #20 replaced the entire centroid LUT with pure ALU (norm * (0.25 + 0.5*idx)) — zero constant memory reads. The result is identical speed. This means the 4 constant reads per element are hitting L1 cache and costing essentially nothing.

The bottleneck is device memory bandwidth: streaming 14 bytes per 32 elements (turbo3) from DRAM, strided across cache positions. q8_0 streams 34 bytes per 32 elements but with a simpler access pattern and trivial dequant (int8 * scale).

Why async prefetch didn't help (#22)

Staging device memory reads (norm, qs, signs) into registers before constant reads forced a specific execution order. The GPU's out-of-order execution already interleaves these optimally. Forcing order adds register pressure without hiding latency.

Updated tally: 20 approaches tested

Rank	Approach	M2 8K	vs Ceiling	Category
1	4-mag LUT	15.1	62%	4 constant reads
2	simd_shuffle	14.7	60%	Cross-lane transfer
3	Batched extract (8-LUT)	13.7	56%	8 constant reads
4	Inline FA block	13.5	55%	Inlined dequant
5	Deferred norm	12.9	53%	Loses ILP
6	2-pair half2	12.0	49%	2 reads + ternary
7	Select chain	11.9	49%	Pure branches
8	Bit-arithmetic	11.6	47%	Pure ALU
9	FMA branchless	11.4	47%	fma() chain
10	Main (8-LUT)	10.95	45%	Baseline
11	Named-reg ternary	10.3	42%	Registers + branches
12	Non-vec FA	10.2	42%	Wrong kernel
13	SMEM pre-dequant	10.17	41%	Threadgroup cache (ILP loss)
14	Q·centroid precompute	10.10	41%	select() register LUT
15	Fused block dot	8.1	33%	64 comparisons
16h	Half register LUT	~noise	~62%	half cn[8] — still spills
19	TG centroid cache	~noise	~62%	Threadgroup centroid table
20	Direct encode (no LUT)	~noise	~62%	Pure ALU, zero constant reads
22	Async prefetch	-2%	~61%	Forced read ordering
—	Ceiling (no dequant)	24.5	100%	Zero overhead

SMEM Pre-Dequant Re-Run (2026-03-28 night, corrected)

The original -51% SMEM result was measured during DINOv2 GPU contention. Clean re-run shows it's actually a small positive:

Qwen2.5-1.5B Q4_K_M (8K context)

Config	SMEM t/s	Baseline t/s	Delta
turbo3 short	53.96	51.84	+4.1%
turbo3 8K	54.13	51.84	+4.4%
turbo4 short	60.20	58.74	+2.5%
turbo4 8K	60.37	58.74	+2.8%

Qwen2.5-7B Q4_K_M (8K context)

Config	SMEM t/s	Baseline t/s	Delta
turbo3 8K	26.17	25.88	+1.1%
turbo4 8K	26.31	27.59	-4.6%

Analysis: SMEM helps small models, hurts large turbo4

The pattern is clear: SMEM pre-dequant trades constant memory pressure for threadgroup memory + barriers. This helps when:

Model is small (attention is larger fraction of decode → dequant matters more)
KV format is turbo3 (8 centroids = more constant reads per element to avoid)

It hurts when:

Model is large (register pressure from SMEM allocation → spills)
KV format is turbo4 on 7B (16 centroids = larger pre-dequant tile → more SMEM overhead)

This suggests a model-size-adaptive dispatch: use SMEM pre-dequant for small models (≤3B), 4-mag LUT for large models (≥7B). The crossover is somewhere in between.

SMEM Pre-Dequant on M5 Max — Small Regression (2026-03-28 night, corrected)

Tested SMEM on M5 Max (Apple10) with Qwen3.5-35B-A3B MoE. Initial results showed -77% but that was an env var bug (TURBO_SMEM_DEQUANT not set at runtime). Corrected results with env var properly set:

Qwen3.5-35B-A3B Q8_0, M5 Max, decode (tg128 t/s)

Context	turbo3 baseline	turbo3 SMEM	turbo4 baseline	turbo4 SMEM
short	78.47	80.32 (+2.4%)	80.40	77.64 (-3.4%)
8K	78.90	76.99 (-2.4%)	79.84	78.78 (-1.3%)
16K	78.64	77.98 (-0.8%)	—	—
32K	78.17	76.18 (-2.5%)	—	—

Note: high variance on some runs (±2.57 t/s). The one positive (turbo3 short +2.4%) is noise.

Why SMEM doesn't help on M5

M5's constant cache (Apple10) is already fast — handles 8-way divergent reads with minimal penalty
SMEM adds threadgroup store/barrier/load overhead for data the constant cache was already serving efficiently
Unlike M2 where SMEM helped small models +4%, M5 sees no benefit at any context depth

Verdict: SMEM is dead

Do NOT ship SMEM pre-dequant. Small regression on M5, marginal win on M2 small models only. Not worth the complexity.

Final Conclusion: M2 Decode Ceiling (2026-03-28)

20 approaches tested + 1 corrected re-run. The 4-mag LUT is the M2 ceiling for large models. SMEM pre-dequant adds ~2-4% for small models.

The bottleneck is NOT:

Constant memory LUT divergence (proven by #20: zero LUT = same speed)
Centroid lookup specifically (proven by #16h, #19: different storage = same speed)
Read ordering (proven by #22: forced ordering = worse)
Branch overhead (proven by #6-9: branchless = worse)

The bottleneck IS:

Device memory bandwidth for streaming quantized KV blocks from DRAM
Dequant ALU complexity (WHT extraction + centroid + norm vs q8_0's simple int8 * scale)
These two costs are inherent to the turbo format and cannot be optimized away without changing the format itself
For small models where attention dominates, SMEM can squeeze out ~4% by trading constant reads for threadgroup reads

Remaining dequant-level approaches (low priority, likely moot)

#	Approach	Category	Expected Impact	Status
17	Device-memory centroid×norm	Block format change	Moot — centroid reads are free	NOT TESTED
18	Byte-indexed 256-entry LUT	LUT restructure	Moot — same reason	NOT TESTED
21	Hybrid 4-mag + simd_shuffle	Combined	Low — simd_shuffle was only -2.6% standalone	NOT TESTED

Phase 2: Kernel/Format/Pipeline-Level Optimization (2026-03-28 night)

Phase 1 (approaches 1-22) exhausted dequant-level optimization: rearranging reads, ALU, barriers, and storage within the existing FA vec kernel template. The 4-mag LUT is the dequant ceiling.

Phase 2 attacks at a higher level: different kernel shapes, different data layouts, and runtime dispatch. These are more invasive but target the actual structural gap.

Experiment #23: Turbo-Only Non-Vec Decode Kernel

Category: Kernel path

Hypothesis: The current FA vec kernel (kernel_flash_attn_ext_vec) is designed for general-purpose quantized attention. A turbo-specific decode kernel built from scratch for single-token long-context generation could avoid overhead from the generic template.

Why this might work:

The vec kernel's template machinery (generic dequant function pointers, parameterized NL/NSG/NE) adds register pressure and instruction cache footprint. A hand-specialized turbo3 dk128/dv128 kernel can hardcode everything.
Earlier test of non-vec FA (approach #10, 10.2 t/s) used the EXISTING generic non-vec kernel, which is optimized for multi-token prefill, not single-token decode. A purpose-built decode kernel is a different thing entirely.
Older Apple GPUs (Apple7/8) reward brutally specific kernels over generic templates.

Implementation plan:

Create a new kernel function kernel_flash_attn_ext_turbo3_decode_dk128 in ggml-metal.metal
Hardcode: dk=128, dv=128, single-token (ne01=1), turbo3 block format
Inline the 4-mag dequant directly (no function pointer indirection)
Optimize the Q*K^T loop for the exact turbo3 block layout: read 14 bytes (2 bytes norm + 8 bytes qs + 4 bytes signs), extract inline, dot product inline
Optimize the V accumulation separately (can use different strategy than K)
Wire through ggml-metal-ops.cpp dispatch: use this kernel when type_k == TURBO3_0 && ne01 == 1 && dk == 128
Gate behind TURBO_SPECIALIZED_DECODE=1 env var

Key code touchpoints:

ggml/src/ggml-metal/ggml-metal.metal — new kernel function + instantiation
ggml/src/ggml-metal/ggml-metal-ops.cpp — dispatch logic to select this kernel
ggml/src/ggml-metal/ggml-metal-device.m — env var wiring

Success criteria:

Any improvement over 25.88 t/s (turbo3 7B 8K baseline) is a win
30+ t/s would close the gap to q8_0 (33.69) significantly
Regression at short context is acceptable if long context improves

Risk: High effort, uncertain payoff. The template overhead might not be the bottleneck.

Result: NEGATIVE. Metal shader compiler already fully specializes templates — function pointers resolved, dimensions constant-folded, dead code eliminated. No runtime overhead to remove.

Model	Baseline t/s	Specialized t/s	Delta
1.5B short	53.34	52.77	-1.1%
1.5B 8K	51.73	51.44	-0.6%
7B 8K	25.89	25.83	-0.2% (noise)

The slight regression is likely from increased instruction footprint — duplicated inline dequant in K and V loops vs the compiler sharing code between template-generated paths. The generic template IS the optimized kernel.

Experiment #24: Apple7/8-Specific Alternate Block Format

Category: Data layout

Hypothesis: The turbo3 block format was designed for the algorithm, not for Apple8 GPU memory access patterns. A second on-device KV format optimized for how the M2 FA vec kernel actually consumes data could reduce memory stalls.

Why this might work:

Current turbo3 block layout: [norm(2B)] [qs(8B)] [signs(4B)] = 14 bytes per 32 elements. The kernel reads these in 3 separate device memory accesses at different offsets within the block.
If we interleave the data by 4-element consumption order (matching the vec kernel's DK4 stride), each fetch brings exactly what the next compute step needs — no wasted bandwidth, better prefetch prediction.
The profiling data shows "reading bytes" costs ~10% of ceiling on both M2 and M5. Reducing this by even half could be meaningful when combined with other improvements.

Implementation plan:

Design an alternate block layout block_turbo3_m2 where data is arranged by the vec kernel's consumption order:
- For each 4-element group: [norm_chunk] [qs_chunk] [sign_bits] contiguous
- Or: per-8-element granule matching SIMD consumption width
Add a SET_ROWS variant that packs into the new layout during KV cache write
Add a dequantize_turbo3_m2_t4 that reads the new layout
Wire as a new kernel instantiation
Gate behind TURBO_M2_FORMAT=1 env var

Key code touchpoints:

ggml/src/ggml-common.h — new block struct definition
ggml/src/ggml-metal/ggml-metal.metal — new dequant function + kernel instantiation
ggml/src/ggml-metal/ggml-metal-ops.cpp — SET_ROWS + dispatch for new type
ggml/src/ggml-metal/ggml-metal-device.m — env var

Success criteria:

Measurable improvement in the "read bytes" profiling mode (mode 4 → mode 3 gap)
Any decode speed improvement over 4-mag baseline
Must not regress PPL (same data, different packing)

Risk: Medium effort, medium payoff. The 10% "read bytes" cost is real but may already be well-hidden by the dequant ALU.

Experiment #25: Dual Runtime Dispatch by Chip Family + Context Depth

Category: Pipeline

Hypothesis: No single kernel configuration is optimal across all context depths. A runtime dispatch that selects different strategies based on context depth could win overall.

Why this might work:

Already proven: 4-mag helps at 16K on M5 (+2.4%) but hurts at 32K (-7.3%). Crossover at ~20K.
turbo3 no-dequant is 12% faster than q8_0 at short context (bandwidth advantage) but 38% slower at 8K (dequant overhead dominates).
The vec kernel has NL parameter controlling simdgroup work distribution. Different NL values may be optimal at different context depths.

Implementation plan:

Compile both 4-mag and 8-LUT FA kernel instantiations for turbo3/turbo4 on pre-M5
In ggml-metal-ops.cpp, add context-depth dispatch:
- Pre-M5 + context < threshold: one kernel config
- Pre-M5 + context ≥ threshold: different kernel config
Profile to find the optimal crossover point
Also test different NL values (currently NL=1 for dk128) at different context depths
Gate behind TURBO_CONTEXT_DISPATCH=1 env var

Key code touchpoints:

ggml/src/ggml-metal/ggml-metal-ops.cpp — dispatch logic (primary)
ggml/src/ggml-metal/ggml-metal.metal — may need additional kernel instantiations with different NL
ggml/src/ggml-metal/ggml-metal-device.m — env var

Success criteria:

Better decode across the full context range (short through 32K) than any single configuration
Minimal code complexity (dispatch is a few lines in ops.cpp)

Risk: Low effort, medium payoff. The gains from 4-mag vs 8-LUT switching were small on M5 (+2.4% at one depth). On M2 where the gap is larger, the payoff might be bigger.

Result: NEGATIVE on 1.5B. turbo3/turbo4 decode is completely flat across all context depths (~52/~59 t/s). q8_0 gets faster at long context (48 → 109 t/s) because attention becomes a larger fraction of decode. No crossover point to dispatch on.

Config	short	2K	4K	8K	16K
f16 (ceiling)	119.60	—	—	116.22	116.14
q8_0	~48	64	~82	107	109
turbo4	59.29	59.02	59.00	60.65	59.03
turbo3 (4-mag)	52.20	51.94	52.36	51.79	51.86
turbo3 (zero dequant)	55.68	55.87	55.83	55.42	55.69

Additional findings:

Zero-dequant gap is only 7% on 1.5B — FFN matmuls dominate, not dequant
Asymmetric K/V (mixed types) not supported — no mixed-type kernel templates
turbo2 not registered in this build — needs update
Needs retesting on larger model (7B+) where attention is a bigger fraction of decode

Priority order for Phase 2

#23 Turbo-only non-vec decode kernel — highest potential, directly attacks the structural gap. Start here.
#25 Dual runtime dispatch — low effort, can run in parallel with #23 as a quick profiling exercise.
#24 Alternate block format — medium effort, save for after #23 results inform whether the bottleneck is compute or memory layout.

Additional ideas from full brainstorm (backlog, not prioritized)

These are logged for future reference. Each could become a numbered experiment if the top 3 don't pan out.

Kernel-path:

Lane-specialized dequant microkernels (hardcode per dk/dv pair)
Split K/V strategies (different kernel for score vs accumulation)
Metadata-first K path (stage norm+bits, defer centroid materialization)
Two-token decode kernel (microbatch=2 for better utilization)
Persistent-thread decode kernel (keep threadgroups resident across tiles)

Computation:

Norm hoisting at higher granularity (fused partial accumulation)
Affine centroid approximation (approximate arithmetic + correction term)
Top-2 magnitude encode (cheap primary path + rare correction)
Softmax-side pruning before full V work (conservative early exit)
Mixed-precision accumulation schedule (audit half/float widen points)

Instrumentation:

Per-section timing inside Metal kernel (finer than current profiling modes)
Dimension-specific perf matrix (128 vs 192 vs 256 on Apple8)
Instruction-cache pressure audit (code-size-reduced kernel variants)

M5 Max Long-Context Discovery (2026-03-27)

The constant cache bottleneck hits M5 Max too at long context

Depth	Ceiling	Reads only	Full dequant	q8_0	LUT cost	Ceiling vs q8_0
8K	78.9	75.2	59.2	78.8	20%	1.00x
16K	75.9	74.7	58.7	72.0	21%	1.05x
32K	78.3	71.9	47.1	61.0	34%	1.28x

turbo3 with zero dequant is 28% FASTER than q8_0 at 32K on M5 Max. The compressed cache bandwidth advantage grows with context. The LUT cost explodes from 20% to 34% as context grows.

4-mag vs 8-LUT on M5 Max across context depths

Depth	q8_0	8-LUT	4-mag	4-mag vs 8-LUT
short	85.0	76.7	76.2	-0.7%
16K	72.0	58.9	60.3	+2.4%
32K	61.0	47.6	44.1	-7.3%

4-mag helps at 16K (+2.4%) but hurts at 32K (-7.3%) on M5. Crossover around 20K.

Context-adaptive dispatch (planned)

Compile both 4-mag and 8-LUT FA kernel instantiations. At dispatch time, select based on KV cache size:

Pre-M5 (no tensor API): always 4-mag
M5+ with context < 8K: 8-LUT (minimal cache pressure)
M5+ with 8K-20K context: 4-mag (moderate pressure, 4-mag helps)
M5+ with context > 20K: 8-LUT (fully thrashed, ALU overhead dominates)

The real prize

If we can reduce dequant cost from 34% to ~10% at 32K, turbo3 decode would be FASTER than q8_0 at long context. The bandwidth advantage (28% at 32K) far exceeds the dequant overhead — we just need to close the gap. This flips the narrative from "turbo3 is slower but smaller" to "turbo3 is faster AND smaller."

FilesExpand file tree

decode-speed-hardware-analysis.md

Latest commit

History

decode-speed-hardware-analysis.md

File metadata and controls

TurboQuant Decode Speed: Why Some Hardware Struggles

The Problem

Root Cause: Constant Memory LUT Divergence

How We Know: Profiling Isolation

What We Tried (7 Approaches)

The Fix: Auto-Detection

How 4-mag LUT works

Critical finding: On M2, branches cost MORE than divergent constant reads

Per-element norm multiply is faster than deferred

Why CUDA Doesn't Have This Problem

Remaining Gap

Summary

Complete Experiment Log (12 approaches, 2026-03-26/27)

M2 Pro (Apple8, has_tensor=false) — 8K context decode

M5 Max (Apple10, has_tensor=true) — short context decode

Additional checks

Hardware truth (Apple8/M2 Pro)

Papers reviewed for novel approaches

Next frontier

Extended Experiments (approaches 12-13, 2026-03-27)

FMA branchless details

simd_shuffle details

Total: 13 approaches tested

Extended Experiments (approaches 14, 2026-03-27)

Fused block dot details

Final tally: 14 approaches tested

Extended Experiments (approach 15, 2026-03-28)

SMEM pre-dequant details

Updated tally: 16 approaches tested

Critical ILP insight (2026-03-28)

Extended Experiments (approaches 16h + 19, 2026-03-28 evening)

Qwen2.5-7B-Instruct-Q4_K_M (8K context, p=8192 tg128)

Qwen2.5-1.5B-Instruct-Q4_K_M (8K context, p=8192 tg128)

Prompt processing (pp8192, for reference)

Key takeaways from approaches 16h + 19

Revised theory (2026-03-28)

Updated tally: 18 approaches tested (16h and 19 added)

Extended Experiments (approaches 20 + 22, 2026-03-28 night)

Qwen2.5-1.5B-Instruct-Q4_K_M (8K context, p=8192 tg128)

Definitive proof: constant memory LUT is FREE on M2 Pro

Why async prefetch didn't help (#22)

Updated tally: 20 approaches tested

SMEM Pre-Dequant Re-Run (2026-03-28 night, corrected)

Qwen2.5-1.5B Q4_K_M (8K context)

Qwen2.5-7B Q4_K_M (8K context)

Analysis: SMEM helps small models, hurts large turbo4

SMEM Pre-Dequant on M5 Max — Small Regression (2026-03-28 night, corrected)

Qwen3.5-35B-A3B Q8_0, M5 Max, decode (tg128 t/s)

Why SMEM doesn't help on M5

Verdict: SMEM is dead

Final Conclusion: M2 Decode Ceiling (2026-03-28)

Remaining dequant-level approaches (low priority, likely moot)

Phase 2: Kernel/Format/Pipeline-Level Optimization (2026-03-28 night)

Experiment #23: Turbo-Only Non-Vec Decode Kernel

Experiment #24: Apple7/8-Specific Alternate Block Format

Experiment #25: Dual Runtime Dispatch by Chip Family + Context Depth

Priority order for Phase 2

Additional ideas from full brainstorm (backlog, not prioritized)

M5 Max Long-Context Discovery (2026-03-27)

The constant cache bottleneck hits M5 Max too at long context

4-mag vs 8-LUT on M5 Max across context depths

Context-adaptive dispatch (planned)

The real prize