Skip to content

Latest commit

 

History

History
661 lines (483 loc) · 38.3 KB

File metadata and controls

661 lines (483 loc) · 38.3 KB

TurboQuant Decode Speed: Why Some Hardware Struggles

The Problem

turbo3 decode speed varies dramatically across Apple Silicon generations:

Hardware Decode vs q8_0 (short) Decode vs q8_0 (16K) Has Tensor API
M5 Max 0.89x ~0.82x yes
M2 Pro 0.73x 0.67x (with 4-mag fix) no
M1 Max 0.84x 0.45x (reported) no

M5 Max has near-parity. M2/M1 fall off a cliff at long context. This doc explains why and what we did about it.

Root Cause: Constant Memory LUT Divergence

turbo3 dequant uses an 8-entry centroid lookup table (LUT) in Metal constant memory. Each element's value is determined by a 3-bit index (2-bit qs + 1-bit sign), so each thread in a 32-thread simdgroup may access a different LUT entry.

This "divergent access" pattern is where hardware generations differ:

  • M5 Max (Apple10): Efficient constant cache handles 8-way divergent access with minimal penalty. LUT costs ~14% of decode time.
  • M2 Pro (Apple8): Older constant cache serializes divergent access. LUT costs ~25% of decode time — 2x worse than M5.
  • M1 Max (Apple7): Even worse, likely 30%+ LUT cost based on reported numbers.

How We Know: Profiling Isolation

We added TURBO_PROFILE_MODE (0-4) to strip away dequant layers one at a time:

Mode What M5 (% of ceiling) M2 (% of ceiling)
1 No dequant at all 78.9 (100%) 24.5 (100%)
2 + read norm only 75.1 (95%) 22.1 (90%)
4 + read all bytes 75.2 (95%) 21.9 (89%)
3 + qs extraction + LUT 64.9 (82%) 16.4 (67%)
0 + signs + full LUT 59.2 (75%) 14.0 (57%)

Key finding: No-dequant turbo3 is 12% FASTER than q8_0 on M2 Pro (24.5 vs 21.9) because the compressed cache moves less data over the memory bus. The compression FORMAT is not the problem. The dequant COMPUTATION is.

The LUT accounts for:

  • M5: 13.7% of ceiling (Mode 4 → Mode 3)
  • M2: 25.1% of ceiling — 2x worse

Reading the bytes (norm + qs + signs) costs ~10% on both chips. That's not the bottleneck.

What We Tried (7 Approaches)

# Approach M2 8K tok/s vs Main Why it works/doesn't
1 4-mag LUT + XOR sign 15.1 +38% Halves constant addresses (4 vs 8). Sweet spot on M2.
2 Batched byte extract 13.7 +25% Better byte reading, still 8 LUT addresses
3 Deferred norm multiply 12.9 +18% Loses ILP — per-element norm hides LUT latency
4 2-pair half2 LUT 12.0 +10% Ternary overhead exceeds savings from 2 addresses
5 Select chain (zero LUT) 11.9 +9% Too much ALU — branches on M2 GPU
6 Bit-arithmetic 11.6 +6% Pure ALU, zero memory — but ALU cost too high
7 Non-vec FA (nl=2) 10.2 -7% Non-vec kernel not designed for single-token decode

Also tested on M5 Max (no help needed):

  • float cn[8] registers: 75.2 (spills to stack on Metal)
  • half cn[8] registers: 74.4 (also spills)
  • Split 2×4 half LUT: 74.0 (branch overhead)

The Fix: Auto-Detection

The build auto-detects hardware at Metal library compile time:

M1/M2/M3/M4 (has_tensor=false) → TURBO_USE_4MAG=1 → 4-entry magnitude LUT
M5+          (has_tensor=true)  → TURBO_USE_4MAG=0 → 8-entry full LUT

Each chip gets its optimal dequant path. No user configuration needed.

How 4-mag LUT works

The 8 centroids have structure: 4 magnitudes × 2 signs. We split the 3-bit index:

  • 2-bit qs → selects magnitude from 4-entry LUT (4 possible constant addresses)
  • 1-bit sign → reverses magnitude index via XOR, then negates via ALU multiply
  • Norm → multiplied per-element (provides ALU work that hides constant memory latency)

The XOR trick: for negative values (sign=0), the magnitude index is reversed (qs=0 → highest magnitude). qs ^ 0x3 flips the 2-bit index without a branch.

Critical finding: On M2, branches cost MORE than divergent constant reads

We tested 9 approaches total. The pattern is clear:

Approach M2 8K Constant reads Branches Result
4-mag LUT 15.1 4 divergent 0 BEST
Batched extract (8-LUT) 13.7 8 divergent 0 Good
Deferred norm (4-mag) 12.9 4 divergent 0 Lose ILP
2-pair half2 + ternary 12.0 2 divergent 4 Branches hurt
Select chain (zero LUT) 11.9 0 8 Branches kill
Bit-arithmetic 11.6 0 4+ ALU too heavy
Named-reg + ternary select 10.3 4 uniform 8 Worst — ternary kills
Main (8-entry LUT) 10.95 8 divergent 0 Baseline
Inline block (FA inner loop) 13.5 4 divergent 0 I-cache pressure
Non-vec FA (nl=2) 10.2 8 divergent 0 Wrong kernel for decode

Any approach that replaces constant reads with branches loses on M2. The Apple8 GPU's branch predictor/execution is worse than its constant cache. The 4-mag LUT succeeds because it reduces constant addresses (4 vs 8) WITHOUT adding branches.

Per-element norm multiply is faster than deferred

Deferring float4 * norm at the end (12.9 tok/s) is slower than per-element v * norm (15.1 tok/s). The per-element multiply provides ALU work that hides constant memory latency via instruction-level parallelism. The GPU can overlap the next constant read while the current multiply executes.

Why CUDA Doesn't Have This Problem

@spiritbuun's CUDA fork achieves 97.5% of q8_0 decode. The key difference:

  • CUDA has 255 registers per thread — cn[8] fits in registers without spilling
  • Metal has a smaller register file — cn[8] spills to thread stack memory
  • CUDA warp semantics — better divergent access handling across 32 threads
  • Metal simdgroup semantics — constant cache serializes more on older chips

The register LUT approach works perfectly on CUDA but is fundamentally incompatible with Metal's register file on current hardware.

Remaining Gap

Even with 4-mag LUT, M2 Pro decode at 8K is 15.1 vs 24.5 ceiling (62%). The remaining 38% gap needs kernel-level changes:

  1. SMEM pre-dequant: Pre-dequant entire KV blocks into threadgroup memory before dot products. Would eliminate constant memory from the inner loop entirely. Requires modifying the FA vec kernel template.

  2. Per-group device-memory cached centroid×norm: Store 8 pre-computed centroid×norm values per 128-element group alongside the block data. Dequant reads from device memory (sequential) instead of constant memory (divergent). Changes the block format.

  3. Fused Q·centroid attention: Compute attention scores directly from quantized indices without full dequantization. Precompute Q·centroid table (8 values per block), then each K element is a table lookup. Complex (custom FA kernel).

These are tracked in GitHub issue #39.

Summary

What Result
Root cause Constant memory LUT divergence, 2x worse on M2 vs M5
Best fix found 4-mag LUT (+38-45% on M2, auto-detected)
M5 impact Zero regression (uses separate code path)
Remaining gap 38% to ceiling, needs kernel-level surgery
CUDA comparison buun gets 97.5% via register LUT (not portable to Metal)

Complete Experiment Log (12 approaches, 2026-03-26/27)

M2 Pro (Apple8, has_tensor=false) — 8K context decode

# Approach tok/s vs q8_0 vs Ceiling Key finding
No-op ceiling 24.5 1.12x 100% turbo3 format FASTER than q8_0 (less bandwidth)
1 4-mag LUT + XOR sign 15.1 0.69x 62% Sweet spot: 4 constant addrs, 0 branches
2 Batched byte extract (8-LUT) 13.7 0.63x 56% Better byte reading, still 8 addresses
3 Inline block in FA loop 13.5 0.62x 55% I-cache pressure from expanded inline
4 Deferred norm (float4 * norm) 12.9 0.59x 53% Loses ILP — norm multiply hides LUT latency
5 2-pair half2 + ternary 12.0 0.55x 49% Ternary overhead exceeds 2-addr savings
6 Select chain (zero LUT) 11.9 0.54x 49% 8 ternaries = 8 branches on Apple8
7 Bit-arithmetic (mul+add) 11.6 0.53x 47% 7 ALU > 4 constant reads
8 FMA branchless (zero ternary) 11.4 0.52x 47% fma doesn't help — same ALU count
Main (8-entry constant LUT) 10.95 0.50x 45% Baseline — 8-way divergent
9 Named-reg ternary select 10.3 0.47x 42% 4 uniform reads + 8 branches = worst
10 Non-vec FA forced (nl=2) 10.2 0.47x 42% Non-vec kernel wrong for single-token decode

M5 Max (Apple10, has_tensor=true) — short context decode

All approaches: 75-77 tok/s (M5 uses 8-LUT path, unaffected by TURBO_USE_4MAG). No regression from any experiment. q8_0 baseline: 85 tok/s.

Additional checks

  • Qwen2 attention bias: NOT present in Qwen2.5-7B-Instruct model (339 tensors, no attn_q/k/v.bias). The GÖKYILDIZ bug does not apply.
  • Model-independence: Profiling pattern (LUT = 25% cost on M2) is consistent across Qwen2.5-7B (M2 Pro) and Qwen3.5-35B-A3B (M5 Max). Architecture-independent.
  • Non-vec FA (nl=2): Faster on M5 (+1.7%) but much worse on M2 (-7%). The non-vec kernel processes batch=1 inefficiently at long context.

Hardware truth (Apple8/M2 Pro)

  1. 1 divergent constant read < 7 ALU ops — even with fma()
  2. Metal compiles ternaries to branches — not predicated moves like CUDA
  3. Branches cost MORE than divergent constant reads — the opposite of CUDA
  4. Array indexing ALWAYS spills — Metal's register file is too small for cn[4+]
  5. 4 constant addresses is the sweet spot — fewer adds branches, more adds thrashing
  6. Per-element norm multiply provides ILP — hides constant memory latency

Papers reviewed for novel approaches

  • AttentionPack (arxiv 2603.23914): validates kernel fusion for attention-aware decompression
  • GlowQ (arxiv 2603.25385): validates group-shared factor caching (+37.4% throughput)
  • Embedding Compression via Spherical Coordinates (2602.00079): same PolarQuant framework, different encoding
  • MKA: Memory-Keyed Attention (2603.20587): learned memory lookup, hardware-aware
  • Scaling Attention via Feature Sparsity (2603.22300): skip negligible attention weights

Next frontier

The 4-mag LUT is the dequant-level ceiling. The remaining 38% gap requires:

  1. Block format change: embed precomputed centroid×norm in device memory (sequential reads, zero divergence)
  2. Custom FA kernel: fuse dequant into attention with restructured computation
  3. Different quantization scheme: format designed for Metal's constraints from scratch

Extended Experiments (approaches 12-13, 2026-03-27)

# Approach M2 8K tok/s vs 4-mag Key finding
12 FMA branchless (zero ternary, zero memory) 11.4 -24% 7 ALU ops slower than 4 constant reads even with fma()
13 simd_shuffle cross-lane magnitude select 14.7 -2.6% Shuffle latency ≈ constant read latency on Apple8

FMA branchless details

Fully branchless: XOR mask via 3 - 3*sign_bit (not ternary), sign via 2*s - 1 (not ternary), magnitude via 3-chained fma(). Zero branches, zero memory. Still slower because 7 ALU cycles > 1 divergent constant read cycle on Apple8.

simd_shuffle details

Threads 0-3 within each 8-thread block compute mag[i]×norm. All threads use simd_shuffle(value, block_base + mi) to read the correct mag×norm. This IS branchless and memory-free — the value moves via cross-lane register transfer. But shuffle latency on Apple8 is comparable to constant cache access, negating the benefit.

Total: 13 approaches tested

The 4-mag constant LUT at 15.1 tok/s (0.69× q8_0) is the definitive dequant-level ceiling on Apple8 hardware. Every alternative — fewer constant addresses, zero constant addresses, branchless ALU, cross-lane shuffle, inline FA blocks — is equal or worse.

Extended Experiments (approaches 14, 2026-03-27)

# Approach M2 8K tok/s vs 4-mag Key finding
14 Fused block dot (per-centroid Q accum) 8.1 -46% 64 float comparisons per block — worst result of all

Fused block dot details

Flipped computation: instead of per-element centroid lookup, iterate over 4 centroid values and accumulate matching Q elements. Uses float(mi == c) as a branchless mask. But 4 centroids × 4 elements × 4 comparisons = 64 float comparison operations per dequant call. Each comparison likely compiles to a branch on Apple8, making this the most expensive approach tested.

Final tally: 14 approaches tested

Rank Approach M2 8K vs Ceiling Category
1 4-mag LUT 15.1 62% 4 constant reads
2 simd_shuffle 14.7 60% Cross-lane transfer
3 Batched extract (8-LUT) 13.7 56% 8 constant reads
4 Inline FA block 13.5 55% Inlined dequant
5 Deferred norm 12.9 53% Loses ILP
6 2-pair half2 12.0 49% 2 reads + ternary
7 Select chain 11.9 49% Pure branches
8 Bit-arithmetic 11.6 47% Pure ALU
9 FMA branchless 11.4 47% fma() chain
10 Main (8-LUT) 10.95 45% Baseline
11 Named-reg ternary 10.3 42% Registers + branches
12 Non-vec FA 10.2 42% Wrong kernel
13 Fused block dot 8.1 33% 64 comparisons
Ceiling (no dequant) 24.5 100% Zero overhead

Extended Experiments (approach 15, 2026-03-28)

# Approach M2 8K tok/s vs 4-mag Key finding
15 SMEM pre-dequant (threadgroup memory tile) 10.17 -33% Threadgroup store/load overhead > constant cache savings

SMEM pre-dequant details

Pre-dequantize entire K/V tiles (C=32 × DK=128 = 8KB half) into threadgroup memory before dot products. All 32 threads cooperatively dequant C cache positions, barrier, then compute from SMEM.

Why it failed: The FA vec kernel distributes work so each thread operates on unique data (DK4/NL=1 dequant per thread per cache position). Each dequanted value is used exactly once by its producer thread. Caching in SMEM adds 64 threadgroup memory ops (32 stores + 32 loads) per thread per outer iteration + barrier, for zero reduction in constant LUT reads. Additionally, separating dequant from compute destroys the ILP that makes the 4-mag LUT work (constant reads overlapped with ALU). At short context, the overhead is negligible (+1.8%), but at 8K it's catastrophic (-51.5%).

Lesson: SMEM only helps when data is shared between threads. Don't separate what the hardware pipelines together.

Updated tally: 16 approaches tested

Rank Approach M2 8K vs Ceiling Category
1 4-mag LUT 15.1 62% 4 constant reads
2 simd_shuffle 14.7 60% Cross-lane transfer
3 Batched extract (8-LUT) 13.7 56% 8 constant reads
4 Inline FA block 13.5 55% Inlined dequant
5 Deferred norm 12.9 53% Loses ILP
6 2-pair half2 12.0 49% 2 reads + ternary
7 Select chain 11.9 49% Pure branches
8 Bit-arithmetic 11.6 47% Pure ALU
9 FMA branchless 11.4 47% fma() chain
10 Main (8-LUT) 10.95 45% Baseline
11 Named-reg ternary 10.3 42% Registers + branches
12 Non-vec FA 10.2 42% Wrong kernel
13 SMEM pre-dequant 10.17 41% Threadgroup cache (ILP loss)
14 Q·centroid precompute 10.10 41% select() register LUT
15 Fused block dot 8.1 33% 64 comparisons
Ceiling (no dequant) 24.5 100% Zero overhead

Critical ILP insight (2026-03-28)

The 2x slowdown of SMEM pre-dequant revealed WHY the 4-mag LUT works: interleaved constant reads and ALU provide instruction-level parallelism. The GPU overlaps the next constant read while the current dot product + norm multiply executes. Any approach that separates memory reads from compute phases loses this overlap and runs at 2x the latency.

This explains the ranking pattern: approaches that maintain per-element read → ALU interleaving (4-mag, simd_shuffle, batched 8-LUT) outperform approaches that batch reads (SMEM, deferred norm, QC precompute) or eliminate reads but add more ALU (select chain, FMA, bit-arithmetic).

The remaining 38% gap cannot be closed by rearranging the same reads and ALU. The only path forward is reducing the total constant read count per element below 4, which requires changing the block format or computation structure.

Extended Experiments (approaches 16h + 19, 2026-03-28 evening)

Clean benchmarks (no DINOv2 contention). Note: absolute tok/s numbers differ from earlier runs due to DINOv2 GPU contention during those measurements. Relative rankings are consistent.

Qwen2.5-7B-Instruct-Q4_K_M (8K context, p=8192 tg128)

# Approach Env Var turbo3 t/s turbo4 t/s q8_0 t/s Key finding
Baseline (4-mag LUT) 25.88 27.59 33.69 turbo4 > turbo3 by 6.6%
16h Half register LUT (half cn[8]) TURBO_HALF_REG_LUT=1 25.55 (-1.3%) 27.26 (-1.2%) Still spills on Apple8. Noise.
19 Threadgroup centroid cache TURBO_TG_CENTROID=1 25.96 (+0.3%) 27.42 (-0.6%) Flat. Threadgroup read ≈ constant read.

Qwen2.5-1.5B-Instruct-Q4_K_M (8K context, p=8192 tg128)

# Approach Env Var turbo3 t/s turbo4 t/s q8_0 t/s Key finding
Baseline (4-mag LUT) 51.84 58.74 113.02 Gap worse on 1.5B (attention dominates)
16h Half register LUT TURBO_HALF_REG_LUT=1 50.67 (-2.3%) 57.76 (-1.7%) Consistent regression across models
19 Threadgroup centroid cache TURBO_TG_CENTROID=1 52.19 (+0.7%) 57.98 (-1.3%) Flat

Prompt processing (pp8192, for reference)

Model turbo3 turbo4 q8_0
7B 241.26 240.00 254.74
1.5B 816.13 807.03 896.42

Key takeaways from approaches 16h + 19

  1. Both approaches are noise on M2 Pro. Half Reg LUT is consistently -1 to -2% (regression). TG Centroid is flat (within variance).

  2. turbo4 > turbo3 on M2 Pro by 6-13% on decode. Holds across both model sizes. 16 centroids (turbo4) is faster than 8 (turbo3) — possibly better ILP with 4-bit indices.

  3. The gap to q8_0 grows with smaller models: 30% on 7B, 118% on 1.5B. Smaller models make attention a larger fraction of decode, making the dequant overhead more visible.

  4. The centroid LUT is NOT the sole bottleneck. Moving centroids to half-precision registers or threadgroup memory doesn't help. The bottleneck is broader — likely the full WHT transform + extraction pipeline, not just the final centroid lookup.

Revised theory (2026-03-28)

The original theory was "constant memory LUT divergence is the bottleneck." After testing 4 approaches targeting the centroid LUT specifically (#15 SMEM pre-dequant, #16q Q·centroid precompute, #16h half reg LUT, #19 TG centroid), NONE improved decode speed. The 4-mag LUT improvement from earlier was real (+38%), but it was optimizing the full dequant pipeline (fewer reads + XOR sign trick), not just the centroid lookup.

The remaining gap is structural: turbo3/4 dequant requires WHT extraction + centroid lookup + norm multiply per element, while q8_0 is a simple int8 * scale. No rearrangement of the same operations will close this gap.

Updated tally: 18 approaches tested (16h and 19 added)

Rank Approach M2 8K vs Ceiling Category
1 4-mag LUT 15.1 62% 4 constant reads
2 simd_shuffle 14.7 60% Cross-lane transfer
3 Batched extract (8-LUT) 13.7 56% 8 constant reads
4 Inline FA block 13.5 55% Inlined dequant
5 Deferred norm 12.9 53% Loses ILP
6 2-pair half2 12.0 49% 2 reads + ternary
7 Select chain 11.9 49% Pure branches
8 Bit-arithmetic 11.6 47% Pure ALU
9 FMA branchless 11.4 47% fma() chain
10 Main (8-LUT) 10.95 45% Baseline
11 Named-reg ternary 10.3 42% Registers + branches
12 Non-vec FA 10.2 42% Wrong kernel
13 SMEM pre-dequant 10.17 41% Threadgroup cache (ILP loss)
14 Q·centroid precompute 10.10 41% select() register LUT
15 Fused block dot 8.1 33% 64 comparisons
16h Half register LUT ~noise ~62% half cn[8] — still spills
19 TG centroid cache ~noise ~62% Threadgroup centroid table
Ceiling (no dequant) 24.5 100% Zero overhead

Extended Experiments (approaches 20 + 22, 2026-03-28 night)

Qwen2.5-1.5B-Instruct-Q4_K_M (8K context, p=8192 tg128)

# Approach Env Var turbo3 t/s turbo4 t/s Key finding
Baseline (4-mag LUT) 51.80 58.60
20 2-bit direct encode (pure ALU, no LUT) TURBO_DIRECT_ENCODE=1 52.16 (+0.7%) 57.78 (-1.4%) Same speed with zero constant reads
22 Async prefetch (batch device reads) TURBO_ASYNC_PREFETCH=1 50.73 (-2.1%) 57.02 (-2.7%) GPU already interleaves reads optimally

Definitive proof: constant memory LUT is FREE on M2 Pro

Approach #20 replaced the entire centroid LUT with pure ALU (norm * (0.25 + 0.5*idx)) — zero constant memory reads. The result is identical speed. This means the 4 constant reads per element are hitting L1 cache and costing essentially nothing.

The bottleneck is device memory bandwidth: streaming 14 bytes per 32 elements (turbo3) from DRAM, strided across cache positions. q8_0 streams 34 bytes per 32 elements but with a simpler access pattern and trivial dequant (int8 * scale).

Why async prefetch didn't help (#22)

Staging device memory reads (norm, qs, signs) into registers before constant reads forced a specific execution order. The GPU's out-of-order execution already interleaves these optimally. Forcing order adds register pressure without hiding latency.

Updated tally: 20 approaches tested

Rank Approach M2 8K vs Ceiling Category
1 4-mag LUT 15.1 62% 4 constant reads
2 simd_shuffle 14.7 60% Cross-lane transfer
3 Batched extract (8-LUT) 13.7 56% 8 constant reads
4 Inline FA block 13.5 55% Inlined dequant
5 Deferred norm 12.9 53% Loses ILP
6 2-pair half2 12.0 49% 2 reads + ternary
7 Select chain 11.9 49% Pure branches
8 Bit-arithmetic 11.6 47% Pure ALU
9 FMA branchless 11.4 47% fma() chain
10 Main (8-LUT) 10.95 45% Baseline
11 Named-reg ternary 10.3 42% Registers + branches
12 Non-vec FA 10.2 42% Wrong kernel
13 SMEM pre-dequant 10.17 41% Threadgroup cache (ILP loss)
14 Q·centroid precompute 10.10 41% select() register LUT
15 Fused block dot 8.1 33% 64 comparisons
16h Half register LUT ~noise ~62% half cn[8] — still spills
19 TG centroid cache ~noise ~62% Threadgroup centroid table
20 Direct encode (no LUT) ~noise ~62% Pure ALU, zero constant reads
22 Async prefetch -2% ~61% Forced read ordering
Ceiling (no dequant) 24.5 100% Zero overhead

SMEM Pre-Dequant Re-Run (2026-03-28 night, corrected)

The original -51% SMEM result was measured during DINOv2 GPU contention. Clean re-run shows it's actually a small positive:

Qwen2.5-1.5B Q4_K_M (8K context)

Config SMEM t/s Baseline t/s Delta
turbo3 short 53.96 51.84 +4.1%
turbo3 8K 54.13 51.84 +4.4%
turbo4 short 60.20 58.74 +2.5%
turbo4 8K 60.37 58.74 +2.8%

Qwen2.5-7B Q4_K_M (8K context)

Config SMEM t/s Baseline t/s Delta
turbo3 8K 26.17 25.88 +1.1%
turbo4 8K 26.31 27.59 -4.6%

Analysis: SMEM helps small models, hurts large turbo4

The pattern is clear: SMEM pre-dequant trades constant memory pressure for threadgroup memory + barriers. This helps when:

  • Model is small (attention is larger fraction of decode → dequant matters more)
  • KV format is turbo3 (8 centroids = more constant reads per element to avoid)

It hurts when:

  • Model is large (register pressure from SMEM allocation → spills)
  • KV format is turbo4 on 7B (16 centroids = larger pre-dequant tile → more SMEM overhead)

This suggests a model-size-adaptive dispatch: use SMEM pre-dequant for small models (≤3B), 4-mag LUT for large models (≥7B). The crossover is somewhere in between.

SMEM Pre-Dequant on M5 Max — Small Regression (2026-03-28 night, corrected)

Tested SMEM on M5 Max (Apple10) with Qwen3.5-35B-A3B MoE. Initial results showed -77% but that was an env var bug (TURBO_SMEM_DEQUANT not set at runtime). Corrected results with env var properly set:

Qwen3.5-35B-A3B Q8_0, M5 Max, decode (tg128 t/s)

Context turbo3 baseline turbo3 SMEM turbo4 baseline turbo4 SMEM
short 78.47 80.32 (+2.4%) 80.40 77.64 (-3.4%)
8K 78.90 76.99 (-2.4%) 79.84 78.78 (-1.3%)
16K 78.64 77.98 (-0.8%)
32K 78.17 76.18 (-2.5%)

Note: high variance on some runs (±2.57 t/s). The one positive (turbo3 short +2.4%) is noise.

Why SMEM doesn't help on M5

  • M5's constant cache (Apple10) is already fast — handles 8-way divergent reads with minimal penalty
  • SMEM adds threadgroup store/barrier/load overhead for data the constant cache was already serving efficiently
  • Unlike M2 where SMEM helped small models +4%, M5 sees no benefit at any context depth

Verdict: SMEM is dead

Do NOT ship SMEM pre-dequant. Small regression on M5, marginal win on M2 small models only. Not worth the complexity.

Final Conclusion: M2 Decode Ceiling (2026-03-28)

20 approaches tested + 1 corrected re-run. The 4-mag LUT is the M2 ceiling for large models. SMEM pre-dequant adds ~2-4% for small models.

The bottleneck is NOT:

  • Constant memory LUT divergence (proven by #20: zero LUT = same speed)
  • Centroid lookup specifically (proven by #16h, #19: different storage = same speed)
  • Read ordering (proven by #22: forced ordering = worse)
  • Branch overhead (proven by #6-9: branchless = worse)

The bottleneck IS:

  • Device memory bandwidth for streaming quantized KV blocks from DRAM
  • Dequant ALU complexity (WHT extraction + centroid + norm vs q8_0's simple int8 * scale)
  • These two costs are inherent to the turbo format and cannot be optimized away without changing the format itself
  • For small models where attention dominates, SMEM can squeeze out ~4% by trading constant reads for threadgroup reads

Remaining dequant-level approaches (low priority, likely moot)

# Approach Category Expected Impact Status
17 Device-memory centroid×norm Block format change Moot — centroid reads are free NOT TESTED
18 Byte-indexed 256-entry LUT LUT restructure Moot — same reason NOT TESTED
21 Hybrid 4-mag + simd_shuffle Combined Low — simd_shuffle was only -2.6% standalone NOT TESTED

Phase 2: Kernel/Format/Pipeline-Level Optimization (2026-03-28 night)

Phase 1 (approaches 1-22) exhausted dequant-level optimization: rearranging reads, ALU, barriers, and storage within the existing FA vec kernel template. The 4-mag LUT is the dequant ceiling.

Phase 2 attacks at a higher level: different kernel shapes, different data layouts, and runtime dispatch. These are more invasive but target the actual structural gap.

Experiment #23: Turbo-Only Non-Vec Decode Kernel

Category: Kernel path

Hypothesis: The current FA vec kernel (kernel_flash_attn_ext_vec) is designed for general-purpose quantized attention. A turbo-specific decode kernel built from scratch for single-token long-context generation could avoid overhead from the generic template.

Why this might work:

  • The vec kernel's template machinery (generic dequant function pointers, parameterized NL/NSG/NE) adds register pressure and instruction cache footprint. A hand-specialized turbo3 dk128/dv128 kernel can hardcode everything.
  • Earlier test of non-vec FA (approach #10, 10.2 t/s) used the EXISTING generic non-vec kernel, which is optimized for multi-token prefill, not single-token decode. A purpose-built decode kernel is a different thing entirely.
  • Older Apple GPUs (Apple7/8) reward brutally specific kernels over generic templates.

Implementation plan:

  1. Create a new kernel function kernel_flash_attn_ext_turbo3_decode_dk128 in ggml-metal.metal
  2. Hardcode: dk=128, dv=128, single-token (ne01=1), turbo3 block format
  3. Inline the 4-mag dequant directly (no function pointer indirection)
  4. Optimize the Q*K^T loop for the exact turbo3 block layout: read 14 bytes (2 bytes norm + 8 bytes qs + 4 bytes signs), extract inline, dot product inline
  5. Optimize the V accumulation separately (can use different strategy than K)
  6. Wire through ggml-metal-ops.cpp dispatch: use this kernel when type_k == TURBO3_0 && ne01 == 1 && dk == 128
  7. Gate behind TURBO_SPECIALIZED_DECODE=1 env var

Key code touchpoints:

  • ggml/src/ggml-metal/ggml-metal.metal — new kernel function + instantiation
  • ggml/src/ggml-metal/ggml-metal-ops.cpp — dispatch logic to select this kernel
  • ggml/src/ggml-metal/ggml-metal-device.m — env var wiring

Success criteria:

  • Any improvement over 25.88 t/s (turbo3 7B 8K baseline) is a win
  • 30+ t/s would close the gap to q8_0 (33.69) significantly
  • Regression at short context is acceptable if long context improves

Risk: High effort, uncertain payoff. The template overhead might not be the bottleneck.

Result: NEGATIVE. Metal shader compiler already fully specializes templates — function pointers resolved, dimensions constant-folded, dead code eliminated. No runtime overhead to remove.

Model Baseline t/s Specialized t/s Delta
1.5B short 53.34 52.77 -1.1%
1.5B 8K 51.73 51.44 -0.6%
7B 8K 25.89 25.83 -0.2% (noise)

The slight regression is likely from increased instruction footprint — duplicated inline dequant in K and V loops vs the compiler sharing code between template-generated paths. The generic template IS the optimized kernel.


Experiment #24: Apple7/8-Specific Alternate Block Format

Category: Data layout

Hypothesis: The turbo3 block format was designed for the algorithm, not for Apple8 GPU memory access patterns. A second on-device KV format optimized for how the M2 FA vec kernel actually consumes data could reduce memory stalls.

Why this might work:

  • Current turbo3 block layout: [norm(2B)] [qs(8B)] [signs(4B)] = 14 bytes per 32 elements. The kernel reads these in 3 separate device memory accesses at different offsets within the block.
  • If we interleave the data by 4-element consumption order (matching the vec kernel's DK4 stride), each fetch brings exactly what the next compute step needs — no wasted bandwidth, better prefetch prediction.
  • The profiling data shows "reading bytes" costs ~10% of ceiling on both M2 and M5. Reducing this by even half could be meaningful when combined with other improvements.

Implementation plan:

  1. Design an alternate block layout block_turbo3_m2 where data is arranged by the vec kernel's consumption order:
    • For each 4-element group: [norm_chunk] [qs_chunk] [sign_bits] contiguous
    • Or: per-8-element granule matching SIMD consumption width
  2. Add a SET_ROWS variant that packs into the new layout during KV cache write
  3. Add a dequantize_turbo3_m2_t4 that reads the new layout
  4. Wire as a new kernel instantiation
  5. Gate behind TURBO_M2_FORMAT=1 env var

Key code touchpoints:

  • ggml/src/ggml-common.h — new block struct definition
  • ggml/src/ggml-metal/ggml-metal.metal — new dequant function + kernel instantiation
  • ggml/src/ggml-metal/ggml-metal-ops.cpp — SET_ROWS + dispatch for new type
  • ggml/src/ggml-metal/ggml-metal-device.m — env var

Success criteria:

  • Measurable improvement in the "read bytes" profiling mode (mode 4 → mode 3 gap)
  • Any decode speed improvement over 4-mag baseline
  • Must not regress PPL (same data, different packing)

Risk: Medium effort, medium payoff. The 10% "read bytes" cost is real but may already be well-hidden by the dequant ALU.


Experiment #25: Dual Runtime Dispatch by Chip Family + Context Depth

Category: Pipeline

Hypothesis: No single kernel configuration is optimal across all context depths. A runtime dispatch that selects different strategies based on context depth could win overall.

Why this might work:

  • Already proven: 4-mag helps at 16K on M5 (+2.4%) but hurts at 32K (-7.3%). Crossover at ~20K.
  • turbo3 no-dequant is 12% faster than q8_0 at short context (bandwidth advantage) but 38% slower at 8K (dequant overhead dominates).
  • The vec kernel has NL parameter controlling simdgroup work distribution. Different NL values may be optimal at different context depths.

Implementation plan:

  1. Compile both 4-mag and 8-LUT FA kernel instantiations for turbo3/turbo4 on pre-M5
  2. In ggml-metal-ops.cpp, add context-depth dispatch:
    • Pre-M5 + context < threshold: one kernel config
    • Pre-M5 + context ≥ threshold: different kernel config
  3. Profile to find the optimal crossover point
  4. Also test different NL values (currently NL=1 for dk128) at different context depths
  5. Gate behind TURBO_CONTEXT_DISPATCH=1 env var

Key code touchpoints:

  • ggml/src/ggml-metal/ggml-metal-ops.cpp — dispatch logic (primary)
  • ggml/src/ggml-metal/ggml-metal.metal — may need additional kernel instantiations with different NL
  • ggml/src/ggml-metal/ggml-metal-device.m — env var

Success criteria:

  • Better decode across the full context range (short through 32K) than any single configuration
  • Minimal code complexity (dispatch is a few lines in ops.cpp)

Risk: Low effort, medium payoff. The gains from 4-mag vs 8-LUT switching were small on M5 (+2.4% at one depth). On M2 where the gap is larger, the payoff might be bigger.

Result: NEGATIVE on 1.5B. turbo3/turbo4 decode is completely flat across all context depths (~52/~59 t/s). q8_0 gets faster at long context (48 → 109 t/s) because attention becomes a larger fraction of decode. No crossover point to dispatch on.

Config short 2K 4K 8K 16K
f16 (ceiling) 119.60 116.22 116.14
q8_0 ~48 64 ~82 107 109
turbo4 59.29 59.02 59.00 60.65 59.03
turbo3 (4-mag) 52.20 51.94 52.36 51.79 51.86
turbo3 (zero dequant) 55.68 55.87 55.83 55.42 55.69

Additional findings:

  • Zero-dequant gap is only 7% on 1.5B — FFN matmuls dominate, not dequant
  • Asymmetric K/V (mixed types) not supported — no mixed-type kernel templates
  • turbo2 not registered in this build — needs update
  • Needs retesting on larger model (7B+) where attention is a bigger fraction of decode

Priority order for Phase 2

  1. #23 Turbo-only non-vec decode kernel — highest potential, directly attacks the structural gap. Start here.
  2. #25 Dual runtime dispatch — low effort, can run in parallel with #23 as a quick profiling exercise.
  3. #24 Alternate block format — medium effort, save for after #23 results inform whether the bottleneck is compute or memory layout.

Additional ideas from full brainstorm (backlog, not prioritized)

These are logged for future reference. Each could become a numbered experiment if the top 3 don't pan out.

Kernel-path:

  • Lane-specialized dequant microkernels (hardcode per dk/dv pair)
  • Split K/V strategies (different kernel for score vs accumulation)
  • Metadata-first K path (stage norm+bits, defer centroid materialization)
  • Two-token decode kernel (microbatch=2 for better utilization)
  • Persistent-thread decode kernel (keep threadgroups resident across tiles)

Computation:

  • Norm hoisting at higher granularity (fused partial accumulation)
  • Affine centroid approximation (approximate arithmetic + correction term)
  • Top-2 magnitude encode (cheap primary path + rare correction)
  • Softmax-side pruning before full V work (conservative early exit)
  • Mixed-precision accumulation schedule (audit half/float widen points)

Instrumentation:

  • Per-section timing inside Metal kernel (finer than current profiling modes)
  • Dimension-specific perf matrix (128 vs 192 vs 256 on Apple8)
  • Instruction-cache pressure audit (code-size-reduced kernel variants)

M5 Max Long-Context Discovery (2026-03-27)

The constant cache bottleneck hits M5 Max too at long context

Depth Ceiling Reads only Full dequant q8_0 LUT cost Ceiling vs q8_0
8K 78.9 75.2 59.2 78.8 20% 1.00x
16K 75.9 74.7 58.7 72.0 21% 1.05x
32K 78.3 71.9 47.1 61.0 34% 1.28x

turbo3 with zero dequant is 28% FASTER than q8_0 at 32K on M5 Max. The compressed cache bandwidth advantage grows with context. The LUT cost explodes from 20% to 34% as context grows.

4-mag vs 8-LUT on M5 Max across context depths

Depth q8_0 8-LUT 4-mag 4-mag vs 8-LUT
short 85.0 76.7 76.2 -0.7%
16K 72.0 58.9 60.3 +2.4%
32K 61.0 47.6 44.1 -7.3%

4-mag helps at 16K (+2.4%) but hurts at 32K (-7.3%) on M5. Crossover around 20K.

Context-adaptive dispatch (planned)

Compile both 4-mag and 8-LUT FA kernel instantiations. At dispatch time, select based on KV cache size:

  • Pre-M5 (no tensor API): always 4-mag
  • M5+ with context < 8K: 8-LUT (minimal cache pressure)
  • M5+ with 8K-20K context: 4-mag (moderate pressure, 4-mag helps)
  • M5+ with context > 20K: 8-LUT (fully thrashed, ALU overhead dominates)

The real prize

If we can reduce dequant cost from 34% to ~10% at 32K, turbo3 decode would be FASTER than q8_0 at long context. The bandwidth advantage (28% at 32K) far exceeds the dequant overhead — we just need to close the gap. This flips the narrative from "turbo3 is slower but smaller" to "turbo3 is faster AND smaller."