QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0) by jesusmb1995 · Pull Request #115 · tetherto/qvac-fabric-llm.cpp

jesusmb1995 · 2026-03-27T19:09:40Z

Summary

Implements TurboQuant KV cache quantization (Zandieh et al., ICLR 2026) for CPU and Vulkan backends with full Flash Attention support. Compresses KV cache to 3.25-4.25 bits per value, enabling ~4-5x larger context windows on the same hardware.

Paper: https://arxiv.org/pdf/2504.19874
Community discussion:

TurboQuant - Extreme KV Cache Quantization ggml-org/llama.cpp#20969
TurboQuant KV Cache Compression — Working Implementation Ready for Review ikawrakow/ik_llama.cpp#1509
Related upstream PR: llama : rotate activations for better quantization ggml-org/llama.cpp#21038 (graph-level rotation for existing quant types)

Recommended configurations:

High compression + speed: K=pq3_0 V=pq3_0 — codebook-only, no QJL overhead. Minimal PPL/speed loss at 3.25 bpw with a small retrieval quality trade-off on long contexts.
High compression + and long-context quality: K=tbq3_0 V=pq3_0 — QJL-corrected keys with codebook-only values. Best retrieval accuracy at 3.75 avg bpw, with a moderate speed cost from QJL correction in the FA shader.

Features

Full set of TurboQuant types: tbq3_0, tbq4_0, pq3_0, pq4_0 (and _64 variants)
Automatic head_dim detection (64 vs 128) — user specifies pq3_0, internal type auto-selects
Coopmat1 and Coopmat2 Flash Attention support (noticeable prefill speedup)
Pre-compiled fused Flash Attention shaders for mixed K/V types (asymmetric compression)
QJL Stage 2 correction in all FA paths (scalar, cm1, cm2)
Comprehensive test/benchmark scripts (perplexity, throughput, RULER)
Cooperative copy_to_quant Vulkan path for TBQ/PQ (faster KV writes)

How does TurboQuant work?

Random rotations spread values evenly across coordinates, preventing concentration on a few axes where zero-coordinates waste bits. In high dimensions, the marginal distribution of each coordinate of a unit-sphere vector follows a Beta distribution that converges to N(0, 1/d) as d grows. The algorithm exploits this by placing Lloyd-Max codebook centroids at optimal positions for this known distribution, minimizing MSE reconstruction error. Centroids are found by solving a continuous 1-dimensional k-means problem.

An additional QJL correction step (Stage 2) reduces bias in dot-product estimation. It quantizes the residual error from Stage 1 to 1-bit by storing only the signs of the residual vector after applying a random rotation (Hadamard × sign diagonal). Since only signs are stored (no centroid rounding), the paper proves this yields an unbiased dot-product estimator. This step is important for maintaining retrieval quality on long contexts.

Optimization details

Hadamard instead of dense rotation: Rotations based on Hadamard use the butterfly pattern in O(d log d) instead of O(d²). Hadamard is deterministic, but applying a random sign diagonal preserves randomness while remaining orthogonal and invertible.
Dense rotation for K/V/Q at graph level, FHT in shader for QJL: At block sizes d=64/128, O(d²) is negligible and utilizes better GPU parallelism for the graph-level rotation. The butterfly FHT is used inside the Flash Attention shader for the QJL projection, avoiding the need to copy a dense matrix into the shader (which would add memory pressure). Since there is no Q cache, the QJL projection of Q must be recomputed every step to apply corrections against the 1-bit signs stored in K blocks.

Type	Bits/val	Block size	Compression vs FP16	Description
`q4_0`	4.50	18 B	3.5x	Baseline: 16 linear values
`pq3_0`	3.25	52 B	4.9x	8 Lloyd-Max centroids
`pq4_0`	4.25	68 B	3.8x	16 Lloyd-Max centroids
`tbq3_0`	4.25	68 B	3.8x	8 centroids + QJL correction
`tbq4_0`	5.25	84 B	3.0x	16 centroids + QJL correction

Implementation overview

vulkan-shaders-gen.cpp — orchestrates SPIR-V compilation of all variant combos
ggml-vulkan.cpp — host-side: creates pipeline objects, dispatches compute

TurboQuant KV cache shader flow (TBQ/PQ is ONLY a KV cache type, never model weights):

STEP 1: Write to cache (same for all paths)

copy_to_quant.comp: float K/V → TBQ/PQ quantized blocks
- L2 norm, codebook binary search, 3/4-bit index packing
- TBQ only: also computes QJL residual (qjl[], d_r)
- PQ only: no QJL, smaller block, faster

STEP 2: Read cache at attention time (paths diverge here)

PATH A: Scalar Flash Attention (broad HW support, baseline)

flash_attn.comp
Includes: types.glsl, tq_utils.comp (via flash_attn_base.glsl), dequant_funcs.glsl
Dequantizes K/V inline, element by element
For TBQ/PQ K: uses centroid-gather optimization (reorders Q·K into per-centroid partial sums)
For TBQ K only: applies QJL correction to attention scores
Full fused kernel: QK^T → softmax → PV → output

PATH B: Cooperative matrix v1 Flash Attention (KHR, cross-vendor)

flash_attn_cm1.comp
K is fully dequantized into shared memory, then coopMatMulAdd for K·Q^T (subgroup-scope 16×16 tiles)
P·V accumulation is still scalar with inline dequant
Same QJL correction as scalar (applied to sfsh[] after coopmat store)

PATH C: Cooperative matrix v2 Flash Attention (NV only, most efficient)

flash_attn_cm2.comp
K and V loaded via coopMatLoadTensorNV with decode callback (dequant-on-load, no shared memory staging)
Both K·Q^T and P·V use coopMatMulAdd (workgroup-scope matrices)
QJL correction via raw byte reads from data_k[] with hardcoded byte offsets per type

PATH D: No-FA fallback, small N (MUL_MAT with N ≤ 8, e.g. decode)

mul_mat_vec_tbq3_0.comp / mul_mat_vec_tbq4_0.comp
Fused dequant + dot product, no centroid gather
QJL correction applied in the same kernel

PATH E: No-FA fallback, large N (K·Q MUL_MAT with N > 8, e.g. prefill)

This is the path exercised by -fa off with a TBQ/PQ K cache. Only the K·Q matmul is affected: V stays f16 under -fa off (upstream guard), so V·A stays on the existing f16 path.
Stage 1: mul_mm.comp runs with TBQ/PQ load_a_to_shmem — centroid dequant × d into shared memory, then generic tiled matmul (scalar / cm1 pipelines; cm2 falls through to cm1/scalar since no _mat_f16 cm2 shader exists for TBQ/PQ).
Stage 2 (TBQ only): mul_mm_tbq_qjl_correction.comp is dispatched after the main matmul as an additive pass — one workgroup per (row, col, batch), QUANT_K threads running the same Walsh–Hadamard + QJL dot product as the vec shader, accumulating d_r · √(π/2) / QUANT_K · sum_qjl(H(B)) into D.
PQ has no Stage 2 (no qjl[] / d_r), so Stage 1 alone is exact.
Requires B (src1) as f32; the scheduler is expected to feed f32 on this path. f16 src1 for standalone TBQ MUL_MAT reports not supported and falls back to CPU.
Fixes external review Issue 3 on PR QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0) #115: before this patch supports_op claimed TBQ/PQ MUL_MAT on cm2 devices (RTX 5090) but had no pipeline behind it, so the correctness run segfaulted. tests/test-backend-ops.cpp now covers all 8 TBQ/PQ types × n ∈ {1,8,16,32} as a repro.
Non-dim01-contiguous quantized src0 (permuted layouts) is now routed to the matrix path as well, so TBQ/PQ MUL_MAT works regardless of src0 stride pattern.

Example usage

llama-cli -m model.gguf --cache-type-k tbq3_0 --cache-type-v pq3_0
llama-cli -m model.gguf --cache-type-k pq3_0 --cache-type-v pq3_0

Works transparently with both head_dim=128 (Llama-3.1, Qwen, Mistral) and head_dim=64 (Llama-3.2-1B/3B) — the right block size is auto-selected.

Results / testing

Please see Asana for latest available data: https://app.asana.com/1/45238840754660/project/1212638335655939/task/1214143691877486

Will comment here with a public report when results can be shared.

PR for testing integration on LLM Addon: tetherto/qvac#1564

Limitations

head_dim must be 64 or 128. Codebooks and Hadamard transform are pre-computed for these dimensions.
d=64 quality is poor on small models — expected, as KV cache quantization generally degrades more on small models.
Metal shaders and vectorized CPU not yet implemented.
Optimized Flash Attention shaders require K to be PQ or TBQ, and V to be PQ, TBQ, Q4, Q8, or F16.
Quantized V with -fa off is not supported by this PR. Upstream llama_init_from_model rejects quantized V when flash attention is disabled ("V cache quantization requires flash_attn"), and that guard is intentionally left in place. The -fa off K·Q MUL_MAT fix in this PR would extend cleanly to A·V for a quantized V as well, but the v_trans V-cache layout used under -fa off is populated by ggml_set_rows with row_size=1, which corrupts any blck_size > 1 type at write time (reproducible on CPU as well, independent of backend). Fixing that is a KV-cache refactor out of scope here; the guard will be revisited once that lands.

TBQ / PQ Vulkan support matrix

What runs on GPU vs. is refused by the context, across FA on/off on dense and MoE models. The MoE-KV-cache rows behave the same as dense because attention itself is plain MUL_MAT / FLASH_ATTN_EXT, not MUL_MAT_ID; MoE routing (MUL_MAT_ID) only applies to the FFN weights, which are never stored as TBQ/PQ.

Scenario	FA	K-type	V-type	Path	Status
Dense / MoE — KV cache	on	tbq3/4_0 or pq3/4_0	pq/tbq/q4_0/q8_0/f16	Fused FA (scalar / cm1 / cm2), QJL in kernel	Full GPU
Dense / MoE — KV cache	on	tbq3/4_0 or pq3/4_0	other quantized (q5_0, q4_1, iq4_nl, k-quants, …)	No matching Vulkan FA pipeline → per-layer backend split	Runs, but attention falls back to CPU
Dense / MoE — KV cache	off	tbq3/4_0 or pq3/4_0	f16	K·Q via `mul_mm.comp` + QJL correction; V·A on the existing f16 path	Full GPU (Path E)
Dense / MoE — KV cache	off	tbq3/4_0 or pq3/4_0	any quantized type (incl. tbq/pq)	—	Context init refused (upstream FA-off rule: "V cache quantization requires flash_attn")

Notes:

Head dimensions of both 128 and 64 are supported; the _64 block variants (tbq*_0_64, pq*_0_64) have their own pipelines, codebooks, and sign tables.
MoE FFN weights are not in this table on purpose: TBQ/PQ are KV-cache quantizations only (llama-quantize has no TBQ/PQ target, and no GGUF stores FFN experts in those types), so MUL_MAT_ID never receives TBQ/PQ src0. Attention in MoE models is a plain MUL_MAT / FLASH_ATTN_EXT and therefore falls under the "KV cache" rows above.

Remaining work

SIMD optimization — AVX2/NEON for CPU quantize/dequantize
Metal shaders — Apple GPU backend support
2-bit variant — even higher compression
Direct cosine similarity evaluation

zoq · 2026-04-01T13:45:20Z

Are you planning to merge this before the rebase to the latest version of llama.cpp?

jesusmb1995 · 2026-04-07T18:22:44Z

Are you planning to merge this before the rebase to the latest version of llama.cpp?

~~Not particularly, if the rebase to latest version of llama.cpp will happen soon then I will change the target to the correct temp branch. I think its better if I target latest llama.cpp version.~~

Edit: @zoq Since it seems we want this merged in about 1-2 weeks, I would target this version for now. yes, planning to merge this before the rebase.

The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside `gg_run_ctest_release`, but the top-level driver runs `gg_run ctest_debug` first and `gg_run ctest_release` second. As a result the debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf` and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native` against the build host's CPU and then executed on a different, older-microarch runner in the pool, producing SIGILL during ctest_debug. Ref PR tetherto#115 CI run 24463228694. Move the append into the top-level flag-handling block, right after `GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once, before either ctest function is invoked, and both debug and release builds pick it up. Remove the duplicate inside `gg_run_ctest_release`. No workflow change required: `.github/workflows/build.yml` already exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was correct; the bug was purely a scoping error in `ci/run.sh`. The other `GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the top-level branches that skip heavier test functions) are left untouched — they were already at the correct scope.

The per-thread TBQ/PQ quantize shader was single-threaded per block — one lane normalized 128 values, ran the FHT serially, and packed the QJL sketch bit-by-bit, with three float[128] private arrays spilling to GPU private memory. On a 5090 this capped the tbq3_0 / tbq4_0 write throughput at ~80 GB/s (~4 % of peak). Switch to a cooperative shader that treats one workgroup (32 lanes == one subgroup on NVIDIA) as one block: - norm, norm-correction and residual-norm reductions use subgroupAdd - the Fast Hadamard Transform runs log2(BK) passes with the BK/2 butterflies in each pass spread across the 32 threads, separated by a single barrier() each - the QJL sign sketch is packed with subgroupBallot (32 bits per call, written as four bytes directly) instead of 128 serial OR into memory - scratch moves from private arrays to shared memory (tq3_sh_x, tq3_sh_idx, tq3_sh_proj, and tq4_* analogues) On the host side, the 32 TBQ/PQ cpy_f32_quant pipelines drop their wg_denoms from {32,1,1} to {1,1,1} so that "one workgroup == one block", and the shader's CPY main() picks up a matching TQ_COOP branch that drops the *32 + gl_LocalInvocationID.x offset from the block index decode. The GGML_OP_SET_ROWS dispatch path also needs to know about the new "one workgroup per block" contract: for TBQ/PQ dst types, divide ne by ggml_blck_size(dst) instead of 32 * ggml_blck_size(dst). Without this gate the set_rows kernel dispatched only 1/32 of the required workgroups, silently leaving 31 out of 32 KV-cache blocks uninitialized and driving perplexity on Mistral-7B-Instruct-v0.3 from ~5.9 to ~1090 with no visible failure from llama-bench or test-backend-ops (the CPY tests only exercise GGML_OP_CPY, which already had the /blck_size rule). Unrelated types keep the /32/blck_size rule so q4_0, q8_0 etc. behave exactly as before. Measured on 2x RTX 5090, Vulkan 1.4.321, PR tetherto#115 tip b23276f. test-quantize-perf -b vulkan, 4 MiB input, 500 iters: type baseline avg optimized avg avg speedup tbq3_0 187.5 us 42.7 us 4.39x tbq4_0 192.9 us 44.3 us 4.35x pq3_0 68.6 us 44.8 us 1.53x pq4_0 84.9 us 44.9 us 1.89x llama-bench on Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1, -r 3: K / V pp2048 base -> opt tg128 base -> opt tbq3_0 / pq3_0 9764 -> 9880 +1.2 % 179.1 -> 206.9 +15.5 % pq3_0 / pq3_0 15396 -> 15653 +1.7 % 190.5 -> 214.0 +12.3 % tbq4_0 / pq4_0 9568 -> 9782 +2.2 % 164.9 -> 205.1 +24.4 % llama-perplexity on wikitext-2 test, 40 chunks, seed 42, Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1: K / V baseline PPL optimized PPL f16 / f16 5.8254 +/- 0.13612 5.8254 +/- 0.13612 (control) tbq3_0 / pq3_0 5.9333 +/- 0.13811 5.9129 +/- 0.13747 pq3_0 / pq3_0 5.9806 +/- 0.13879 5.9894 +/- 0.13918 tbq4_0 / pq4_0 5.8646 +/- 0.13707 5.8570 +/- 0.13679 All PPL deltas are an order of magnitude smaller than the 95 % CI and come from FP associativity in the subgroup tree reduction vs the previous sequential sum. No algorithmic change. tests/test-turboquant.sh: 112/112 backend-op tests still pass on both GPUs. test-quantize-fns reproduces the 4 pre-existing _64 roundtrip failures with bit-identical error magnitudes (0.010805, 0.009384) — that is the wrong-codebook bug from PR tetherto#115 review, not introduced or fixed here. Made-with: Cursor

@gianni-cor

build_attn_inp_kv_impl() and build_attn_inp_kv_iswa() allocated inp->self_rotk as an nrot x nrot Hadamard where nrot = largest power-of-two that divides n_embd_head_k (>= 64) but allocated inp->self_rotv as a fixed 64x64 tensor, independent of n_embd_head_v. Because ggml_rotate_hadamard() reshapes its input using rot->ne[0] as the inner dim, an n_embd_head_v = 128 vector was rotated as two independent 64-d halves, i.e. block_diag(H64, H64), instead of a full H128. The d=128 PQ/TBQ codebooks in ggml-quants.c are Lloyd-Max fitted to the coordinate distribution of a full d=128 random orthogonal rotation (sigma ~ 1/sqrt(128)), so the narrower-than-expected 64-d rotation left the codebook ~2x too narrow per coordinate and silently inflated V reconstruction error. This hits essentially every recent dense model with head_dim_v = 128 (Llama 3, Mistral, Qwen2.5, ...) whenever a TBQ/PQ V cache is selected, since resolve_tq_type() keeps the non-_64 variants for head_dim = 128. Reported by @gianni-cor in tetherto#115 with a repro showing ~1.40x worse 3-bit and ~1.50x worse 4-bit V reconstruction at alpha = 0.10. Factor the sizing + allocation out into a file-local helper build_hadamard_rot(ctx, can_rot, n_embd_head) that applies the same "largest power-of-two that divides n_embd_head, starting at 64" rule used for self_rotk, and returns nullptr when can_rot is false. Call it for both K and V in build_attn_inp_kv_impl() and build_attn_inp_kv_iswa(), which makes self_rotv correctly 128x128 for head_dim_v = 128 and keeps self_rotk behavior unchanged. No other call site allocates self_rot{k,v}, so the rotation is now symmetric across K and V and across the SWA and non-SWA builders. Net change is -16 lines: two 17-line if/else blocks per builder collapse into two one-liners.

The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside `gg_run_ctest_release`, but the top-level driver runs `gg_run ctest_debug` first and `gg_run ctest_release` second. As a result the debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf` and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native` against the build host's CPU and then executed on a different, older-microarch runner in the pool, producing SIGILL during ctest_debug. Ref PR ggml-org#115 CI run 24463228694. Move the append into the top-level flag-handling block, right after `GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once, before either ctest function is invoked, and both debug and release builds pick it up. Remove the duplicate inside `gg_run_ctest_release`. No workflow change required: `.github/workflows/build.yml` already exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was correct; the bug was purely a scoping error in `ci/run.sh`. The other `GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the top-level branches that skip heavier test functions) are left untouched — they were already at the correct scope. diff --git a/ci/run.sh b/ci/run.sh index 7fa469b..5c2f7e7 100755 --- a/ci/run.sh +++ b/ci/run.sh @@ -118,6 +118,15 @@ if [ ! -z ${GG_BUILD_NO_SVE} ]; then CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm" fi +# Disable native CPU optimizations for low-perf builds to ensure binary +# compatibility with the (often heterogeneous) CI runner pool. Must be applied +# at the top level so BOTH gg_run_ctest_debug and gg_run_ctest_release pick it +# up — otherwise the debug build (which runs first) compiles with -march=native +# and can SIGILL on a runner whose microarch is older than the build host. +if [ ! -z ${GG_BUILD_LOW_PERF} ]; then + CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF" +fi + if [ -n "${GG_BUILD_KLEIDIAI}" ]; then echo ">>===== Enabling KleidiAI support" @@ -236,11 +245,6 @@ function gg_run_ctest_release { # Check cmake, make and ctest are installed gg_check_build_requirements - # Disable native CPU optimizations for low-perf builds to ensure compatibility - if [ ! -z ${GG_BUILD_LOW_PERF} ]; then - CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF" - fi - (time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log (time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp index 247775b..bcc1759 100644 --- a/src/llama-graph.cpp +++ b/src/llama-graph.cpp @@ -49,6 +49,25 @@ static ggml_tensor * ggml_rotate_hadamard( return res; } +// Allocate the Hadamard rotation input used by ggml_rotate_hadamard() for a +// TurboQuant/PolarQuant K or V stream. Size is the largest power-of-two that +// divides n_embd_head (>= 64), so the rotation matches the head dim and the +// PQ/TBQ codebooks see the full d-wide rotated distribution they were fitted +// to. Returns nullptr when can_rot is false. +static ggml_tensor * build_hadamard_rot(ggml_context * ctx, bool can_rot, int n_embd_head) { + if (!can_rot) { + return nullptr; + } + + int nrot = 64; + do { nrot *= 2; } while (n_embd_head % nrot == 0); + nrot /= 2; + + ggml_tensor * rot = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nrot, nrot); + ggml_set_input(rot); + return rot; +} + void llm_graph_input_embd::set_input(const llama_ubatch * ubatch) { if (ubatch->token) { const int64_t n_tokens = ubatch->n_tokens; @@ -1626,30 +1645,13 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl( hparams.n_embd_head_k % 64 == 0 && ggml_is_quantized(mctx_cur->type_k()); - if (can_rotk) { - int nrot = 64; - do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0); - nrot /= 2; - - inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot); - ggml_set_input(inp->self_rotk); - } else { - inp->self_rotk = nullptr; - } - const bool can_rotv = !hparams.is_n_embd_v_gqa_variable() && hparams.n_embd_head_v % 64 == 0 && ggml_is_quantized(mctx_cur->type_v()); - if (can_rotv) { - int nrot = 64; - - inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot); - ggml_set_input(inp->self_rotv); - } else { - inp->self_rotv = nullptr; - } + inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k); + inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v); } return inp; @@ -1947,30 +1949,13 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const hparams.n_embd_head_k % 64 == 0 && ggml_is_quantized(mctx_cur->get_base()->type_k()); - if (can_rotk) { - int nrot = 64; - do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0); - nrot /= 2; - - inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot); - ggml_set_input(inp->self_rotk); - } else { - inp->self_rotk = nullptr; - } - const bool can_rotv = !hparams.is_n_embd_v_gqa_variable() && hparams.n_embd_head_v % 64 == 0 && ggml_is_quantized(mctx_cur->get_base()->type_v()); - if (can_rotv) { - int nrot = 64; - - inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot); - ggml_set_input(inp->self_rotv); - } else { - inp->self_rotv = nullptr; - } + inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k); + inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v); } return (llm_graph_input_attn_kv_iswa *) res->add_input(std::move(inp));

Register explicit `MUL_MAT` test cases for the TurboQuant / PolarQuant types (`tbq3_0`, `tbq4_0`, `pq3_0`, `pq4_0`) with `type_b ∈ {f32, f16}` and sizes that span both dispatch paths: - n=1, n=8 -> mul_mat_vec path (decode-like) - n=16, n=32 -> dequant + f16 matmul path (prefill-like) Motivation: the existing TBQ/PQ coverage in `test-backend-ops` only registers `FLASH_ATTN_EXT` cases, so the `tests/test-turboquant.sh` filter (`test-backend-ops test -p "tbq|pq"`) never exercises the standalone `MUL_MAT` path. That path is the one reported as `supports_op == yes` on NV `VK_NV_cooperative_matrix2` devices but has no matching pipeline created in `pipeline_dequant_mul_mat_mat_f16[]` (see `ggml-vulkan.cpp:3412` — "TBQ/PQ cm2 matmul shaders not yet generated"). With these cases in place: - On NV coopmat2 (RTX 5090): the n>=16 cases segfault, matching the external reproduction from the PR ggml-org#115 review and making the bug visible to CI. - On KHR coopmat1 (AMD gfx1150): the n>=16 cases return numerical garbage (err ~ 1.0), exposing the same missing-fallback issue in a non-crashing form. - The n=1/n=8 cases continue to pass via the existing `mul_mat_vec_tbq*_0` / `mul_mat_vec_pq*_0` shaders, so the new coverage cleanly isolates which dispatch path is broken. No source changes to the Vulkan backend; this commit only adds the test cases needed so the pre-existing bug is caught by `tests/test-turboquant.sh`.

…d=64 variants Before this patch, `supports_op` reported TBQ3_0/TBQ4_0/PQ3_0/PQ4_0 (and their `_64` / head_dim=64 variants) as supported for `GGML_OP_MUL_MAT` on the Vulkan backend, but there was no working pipeline behind it. This is the state the external review flags as Issue 3 on PR ggml-org#115: on cm2 (RTX 5090) the support probe claims support, the correctness run then segfaults. The previous `test-backend-ops` commit adds the exact repro for this. Root causes: - On cm2 (NV coopmat2) devices the slot in `pipeline_dequant_mul_mat_mat_f16[]` was empty - the shader was never generated - so dispatches crashed when flash attention was not used. - On cm1 and scalar devices the slot in `pipeline_dequant_mul_mat_mat[]` was wired up, but `mul_mm_funcs.glsl` had no `load_a_to_shmem` implementation for TBQ/PQ. The generic `mul_mm.comp` ran with uninitialized shared memory and produced near-random output. - Even once data loading was fixed, TBQ3_0/TBQ4_0 still produced a small bias for `n > mul_mat_vec_max_cols` because the QJL Stage 2 correction that `mul_mat_vec_tbq*_0.comp` applies in the vec path has no equivalent in the generic matmul shader. - For head_dim=64 models that use the `_64` block variants (TBQ*_0_64 / PQ*_0_64) the `_64` mul_mm pipelines, the `_64` QJL correction shaders, and the `_64`-sized Lloyd-Max codebook / sign arrays were missing, and the vec path is intentionally skipped for `_64` (so every `n` needs the full matmul + QJL correction pair). Scope of this patch is strictly the standalone `MUL_MAT` path with TBQ/PQ `src0` and f32 `src1` that the Issue 3 repro hits, i.e. the `-fa off` K matmul. Fused flash attention (scalar, cm1, cm2) already handles QJL correctly and is unchanged. MoE FFN weights are not affected either: TBQ/PQ are KV-cache quantizations (there is no `llama-quantize` target that produces TBQ/PQ model weights), so `MUL_MAT_ID` never sees them as `src0` - attention in MoE models is a plain `MUL_MAT` / `FLASH_ATTN_EXT` and reuses the same fix. The upstream "V cache quantization requires flash_attn" context-level guard in `src/llama-context.cpp` is intentionally left unchanged: the V matmul under `-fa off` uses a transposed quantized-V layout populated by `ggml_set_rows` with row_size=1, which corrupts any `blck_size > 1` type at write time (reproducible on CPU as well), and that is a separate KV-cache issue out of scope here. Changes: - Add `load_a_to_shmem` implementations for TBQ3_0, TBQ4_0, PQ3_0, PQ4_0 (and their `_64` variants) in `mul_mm_funcs.glsl`, reusing `tbq3_dequant_raw` / `tbq4_dequant_raw` from `tq_utils.comp`. This makes `mul_mm.comp` correct for the centroid part of dequantization (`tbq*_dequant_raw(qs) * d`) on all eight types. - `tq_utils.comp`: pick Stage-1 / QJL-Stage-2 sign bitmasks and the Lloyd-Max codebook (TBQ3_CB / TBQ4_CB) based on whether any `DATA_{A,K,V}_*_0_64` is defined. d=64 blocks use seeds 43/139 and a wider codebook (sigma = 1/sqrt(d) is larger at d=64 than at d=128); previously the shader hardcoded the d=128 constants, so the d=64 variants silently dequantized against the wrong codebook. - New shader `mul_mm_tbq_qjl_correction.comp`. It runs after the main matmul as an additive pass: one workgroup per `(row, col, batch)`, `QUANT_K` threads performing the same Walsh-Hadamard butterfly + `qjl[]` dot product as the vec shader, and accumulates `d_r * sqrt(pi/2) / QUANT_K * sum_qjl(H(B))` into D. Parameterized over `QUANT_K` so the same source emits both `_128` and `_64` SPIR-V. Only TBQ3_0 and TBQ4_0 (and `_64`) have `d_r`/`qjl`, so only those four get a correction pipeline. - `vulkan-shaders-gen.cpp`: * Register the eight correction variants (`mul_mm_qjl_{tbq3_0,tbq4_0}{,_64}_{f32,f16}`). * Emit `matmul_{tbq,pq}{3,4}_0_64_{f32,f16}[_aligned]` for `mul_mm.comp`, in a dedicated block outside the main `type_names` loop so we don't cascade through FA / MUL_MAT_ID / get_rows / ... which either already have dedicated `_64` handling (FA) or don't apply to TBQ/PQ at all. - `ggml-vulkan.cpp`: * Add `pipeline_mul_mm_tbq_qjl[GGML_TYPE_COUNT][2]` on the device and create pipelines at init time for all four TBQ types (128 and 64 block sizes). * In `ggml_vk_get_mul_mat_mat_pipeline`, let cm2 fall through to the cm1/scalar pipeline when no cm2 `_mat_f16` shader exists for a given TBQ/PQ type, so cm2 devices stop segfaulting on these types. * Register TBQ/PQ `_64` in `supports_op` `MUL_MAT` switch so d=64 models are actually routed to the new pipelines instead of falling back to CPU. * Force `split_k = 1` for TBQ `src0` - the QJL correction pass would otherwise be added once per split. * Dispatch the QJL correction pass after the main matmul for TBQ3_0 / TBQ4_0 (and `_64`). For `_128` it's gated on `n > mul_mat_vec_max_cols` (the vec path already corrects for smaller n); for `_64` it runs unconditionally because there is no vec path on this block size. Verified on AMD RDNA3.5 (RADV gfx1150, no cm1/cm2) with `test-backend-ops -o MUL_MAT -b Vulkan0`: all 32 TBQ/PQ x {128, 64} x n in {1,8,16,32} cases pass against f32 B. f16 B for standalone MUL_MAT on TBQ/PQ still reports `not supported` on this device (the scalar/cm1 pipeline consumes f32 src1), which is consistent with the matmul path shipping f32 src1 on the `-fa off` decode/prefill paths used by these tests. cm2 verification is expected to run on the reviewer's RTX 5090 via the Issue 3 repro branch. debug: sentinel in QJL correction vulkan: add dequantize-to-f16 cpy shaders so MUL_MAT with non-contiguous quantized src0 can run on GPU vulkan: add d=64 decision boundaries for TBQ/PQ copy_to_quant Commit 6e26e8b ("vulkan: fix TBQ/PQ standalone MUL_MAT path, QJL correction pass, and d=64 variants") added TQ_D64-gated d=64 Lloyd-Max codebook centroids and random sign diagonals to tq_utils.comp, so that the Vulkan encoder, FA decoder, and non-FA QJL correction pass all match the CPU reference's d=64 constants (ggml-quants.c: TQ{3,4}_CODEBOOK_64, TQ/QJL_SIGN_SEED_64). It missed the corresponding d=64 *decision boundaries* in copy_to_quant.comp, however. The boundaries are the midpoints between adjacent codebook centroids and determine which centroid an input coordinate is quantized to. CPU derives them at runtime from the selected codebook via tq_compute_boundaries(), so it automatically used the d=64 midpoints for _64 blocks. The Vulkan encoder hard-codes them as const float TBQ3_B[7] / TBQ4_B[15], and those constants remained at the d=128 midpoints even inside the #if block that accepts both DATA_A_TBQ*_0 and DATA_A_TBQ*_0_64. Net effect on a head_dim=64 model (Qwen2.5-0.5B): - copy_to_quant bucketed each coordinate by the narrower d=128 boundaries (centroids spaced ~sigma=1/sqrt(128)). - The resulting index was then dequantized with the wider d=64 centroids (spaced ~sigma=1/sqrt(64)), a completely different alphabet. - Every value near a boundary landed on the wrong centroid. Before commit 6e26e8b this was silently consistent: the codebook was also d=128, so encoder and decoder were at least in agreement (just producing a rescaled quantization). When the codebook was fixed to d=64, the boundaries had to move with it. Add a #if defined(TQ_D64) branch for TBQ3_B and TBQ4_B that uses the d=64 midpoints computed from TQ{3,4}_CODEBOOK_64. Values regenerated with scripts/compute_tq_codebooks.py, which now also emits the GLSL boundaries array alongside the C codebook array so future codebook updates keep the two in sync from one source of truth. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test offset_64 (--chunks 1), Vulkan AMD RADV gfx1150: tbq3_0 / f16 fa=off: 1659 -> 230 (7.2x better) tbq4_0 / f16 fa=off: ~ -> 300 pq3_0 / f16 fa=off: ~ -> 230 pq4_0 / f16 fa=off: ~ -> 300 pq3_0 / f16 fa=on: ~ -> 225 (now matches fa=off) pq4_0 / f16 fa=on: ~ -> 299 (now matches fa=off) tbq3_0/tbq4_0 fa=on still diverges from fa=off due to a separate issue in the FA QJL Stage-2 correction on d=64 blocks, addressed in a follow- up patch. vulkan: read raw Q from the input SSBO for the FA QJL projection The flash-attention shaders compute the QJL Stage-2 correction as correction = d_r * sqrt(pi/2) / QUANT_K * (2*pos_sum - proj_q_sum) where (pos_sum, proj_q_sum) are reductions over FHT(D_qjl * Q). The scalar (flash_attn.comp) and coopmat1 (flash_attn_cm1.comp) paths used to derive the FHT input by reading from Qf -- the shared-memory buffer that already has the attention scale (1/sqrt(head_dim)) multiplied in for the main Q*K dot -- and then dividing that value by p.scale to recover the raw Q before multiplying by the QJL sign diagonal. On cm1 Qf is f16, so the scale round-trip is lossy for the large- magnitude activations seen in e.g. Qwen2.5-0.5B's first-layer massive activations. On the scalar path Qf is f32 and p.scale is usually a power of two (1/sqrt(head_dim)), so x * p.scale / p.scale is bit-exact in principle -- but empirically the pre-scaled-then-un-scaled read still produced materially different FHT input than a raw-Q read. The standalone non-FA QJL shader (mul_mm_tbq_qjl_correction.comp) has always read Q directly from src1 and gets correct results. Match that pattern in the FA path: read Q straight from the data_qv4 SSBO into Qf_qjl_proj, bypassing Qf entirely. cm2 already reads raw Q from data_q directly, so it does not need the change. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wikitext-2 test (Vulkan AMD RADV gfx1150): wiki.test --chunks 4 -n 128, K=tbq3_0/f16: fa=off : 531 fa=on, before : ~2000 (broken) fa=on, after : 154 wiki.test --chunks 4 -n 128, K=tbq4_0/f16: fa=off : 207 fa=on, after : 77 K=f16/f16 and K=pq*/V=f16 fa=on/off stay within 1% of each other, confirming the change is confined to the TBQ QJL path. vulkan: run non-FA TBQ QJL correction on permuted src0 too The standalone MUL_MAT QJL (Stage 2) correction pass was gated on `!x_non_contig`, which silently skipped the correction on the no-FA attention path because `kq = mul_mat(k, q)` feeds in K after `ggml_permute(k, 0, 2, 1, 3)` -- a non-dim01-contiguous view of the KV cache. With that gate the TBQ attention on the no-FA path was reduced to PQ Stage 1 (centroid-only), producing bit-identical output to `pq3_0` / `pq4_0` and regressing quality vs a CPU-reference TBQ run. It was masked on the MUL_MAT test-backend-ops coverage because those tests use contiguous src0. Two changes: * mul_mm_tbq_qjl_correction.comp: index the A matrix by a real in-memory block stride instead of the `num_blocks_per_row = K / QUANT_K` shortcut. `p.stride_a` and `p.batch_stride_a` are now interpreted as strides in BLOCK units (src0->nb[1] / sizeof(block) and src0->nb[2] / sizeof(block) respectively), matching the way `ggml_vk_flash_attn` already feeds `k_stride` / `k_offset` to the FA shader. For a contiguous src0 the new stride equals the old num_blocks_per_row, so existing tests are unaffected. * ggml_vk_mul_mat_q_f16: compute `qjl_stride_a` / `qjl_stride_batch_a` from src0->nb and drop the `!x_non_contig` exclusion from both the descriptor-set request and the dispatch site. The `qx_buf_offset` still points at the original (pre-permute) TBQ blocks in d_Qx, which is exactly what the correction pass wants -- it just needed the real strides to reach the right block for each (row_a, batch_id). Both paths have to be gated identically -- if only one of them is changed, the dispatched pipeline ends up without a descriptor set and `vkCmdPushConstants` crashes with VK_NULL_HANDLE layout. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test --chunks 4 -n 128 (Vulkan AMD RADV gfx1150): K=tbq3_0 V=f16 fa=off (vs fa=off pq3_0): before fix: 561 == 561 (QJL silently skipped, TBQ==PQ) after fix : 634 != 561 (QJL running, TBQ distinct from PQ) CPU ref : 546 K=tbq4_0 V=f16 fa=off (vs fa=off pq4_0): before fix: 173 == 173 (QJL silently skipped) after fix : 133 != 173 (QJL running, 23% lower PPL) CPU ref : 154 f16/f16 and pq* paths are unchanged: the gate only opens for TBQ. FA is unaffected since it has its own inlined QJL epilogue.

…gle-GPU The per-n_ctx wikitext slices generated by tests/test-kv-cache-quantization-perp.sh were picked via an unseeded $RANDOM, so every run drew a fresh offset and PPL numbers were not directly comparable across reruns. Additionally, llama-perplexity was being invoked without --split-mode, so on multi-GPU hosts it defaulted to splitting decoder layers across all visible devices -- introducing cross-device numerical differences that could drift the baseline by more than the QJL/FA signal this sweep is meant to detect. Three changes to make PPL numbers reproducible across reruns and machines: * generate_offset_files(): seed $RANDOM with a fixed SLICE_SEED (default 42) before drawing offsets, so a fresh regeneration is byte-for-byte reproducible. Export SLICE_SEED=<n> to draw a different but still deterministic set of offsets. * generate_offset_files(): skip regeneration when wiki.test.offset_<n_ctx>.raw already exists and is non-empty; recover the original offset for the log line from the suffix size so the output is still informative ("reusing offset=... (<N> bytes)"). Pass --regen-slices (or delete the slice files) to force regeneration. * run_perplexity_once(): pass --split-mode none to llama-perplexity so PPL is computed on a single device regardless of how many GPUs are visible. Matches the default already used by tests/test-kv-cache-quantization-perf.sh (SPLIT_MODE="${SPLIT_MODE:-none}"), so perp and perf now agree on single-GPU execution. Header comment and --help output updated to document the slice knobs. No change to the perplexity args beyond --split-mode, so existing CSV schemas and result filenames are unaffected.

… regression The Vulkan CI (ubuntu-24-cmake-vulkan) fails on this pre-existing upstream backend-op case: MUL_MAT(type_a=q8_0, type_b=f32, m=16, n=1, k=256, bs=[2,3], nr=[1,1], per=[0,2,1,3], ...) with ggml-vulkan.cpp: GGML_ASSERT(ggml_vk_dim01_contiguous(src0) || src0->type == F32/F16/BF16) failed in ggml_vk_mul_mat_vec_q_f16. The underlying bug is that our change to ggml_backend_vk_device_supports_op() -- which relaxed the non-dim01- contiguous constraint on quantized src0 so the TBQ/PQ -fa off K*Q path can stay on the GPU -- also advertises support for every other quant type (q4_0, q5_0, q5_1, q8_0, iq4_nl, ...) with non-contig src0, but the small-n vec dispatcher in ggml_vk_mul_mat still routes those cases to ggml_vk_mul_mat_vec_q_f16, which does not implement the quant->f16 cpy fallback and asserts on non-contig quantized src0. This regression was not caught by tests/test-turboquant.sh because that script filters test-backend-ops with `-p 'tbq|pq'` and our added TBQ/PQ MUL_MAT coverage only uses per=[0,1,2,3] (identity). The upstream q8_0 permutation matrix exercises the exact shape that trips the assert. Add a second test-backend-ops invocation to test-turboquant.sh that targets the smallest reproducer: -p 'type_a=q8_0.*per=\[0,2,1,3\]' Picking q8_0 (instead of a TBQ/PQ variant) means this check runs on any Vulkan box without requiring a TBQ model or KV cache, and it directly reproduces the CI failure. Verified on AMD gfx1150 (KHR_coopmat) with the top-of-branch Vulkan that still has the bug: test-turboquant.sh now exits non-zero locally with "1 check(s) failed.", matching the CI failure. The follow-up Vulkan patch adds the dispatcher fix that makes this check pass.

…ix path Commit 6fd388c (vulkan-fix-tbq-pq-standalone) widened ggml_backend_vk_device_supports_op(MUL_MAT) so that any quantized src0 type with a pipeline_cpy_quant_f16 entry is accepted even when it is not dim01-contiguous. That is what lets the -fa off attention path keep kq = mul_mat(K, Q) on the GPU when K is a permuted TBQ/PQ view of the KV cache. The matrix path (ggml_vk_mul_mat_q_f16) honours this: it runs the quant->f16 cpy pipeline to dequantize the non-contig src0 before the main matmul. But the vec path (ggml_vk_mul_mat_vec_q_f16), which the dispatcher routes to when dst->ne[1] <= mul_mat_vec_max_cols (decode-like n), does not: it asserts dim01-contiguous quantized src0 at the top of the function. So any small-n MUL_MAT with a non-dim01-contiguous quantized src0 -- e.g. the upstream backend-op coverage of MUL_MAT(type_a=q8_0, m=16, n=1, k=256, bs=[2,3], per=[0,2,1,3]) -- slips through supports_op, gets routed to the vec path, and aborts on the assertion. See: tetherto#115 (comment) Fix by adding one clause to the dispatcher in ggml_vk_mul_mat: take the vec path only when src0 is either non-quantized or dim01-contiguous. Non-dim01-contiguous quantized src0 falls through to ggml_vk_mul_mat_q_f16, which already handles it via pipeline_cpy_quant_f16. This does not change hot paths: contiguous src0 still takes the vec path as before, which is the overwhelmingly common case for mul_mat in transformer graphs. Also annotate the vec path assert so a future caller that tries to send non-contig quantized src0 there gets a loud error rather than a silent wrong answer, and so the invariant between the dispatcher gate and the assert is documented in both places. Verified on AMD gfx1150 (KHR_coopmat): Before: tests/test-turboquant.sh exits 1 with GGML_ASSERT at ggml-vulkan.cpp:8105 on the q8_0 per=[0,2,1,3] smoke case added in the previous patch. After: tests/test-turboquant.sh passes; the q8_0 per=[0,2,1,3] MUL_MAT cases run on the GPU through the matrix path (f32 variants succeed, f16 variants report "not supported [CPU]" as before since backend-ops does not currently wire an f16 x f16 contiguity check for quantized src0). This also fixes the Ubuntu Vulkan CI job for the PR.

The per-thread TBQ/PQ quantize shader was single-threaded per block — one lane normalized 128 values, ran the FHT serially, and packed the QJL sketch bit-by-bit, with three float[128] private arrays spilling to GPU private memory. On a 5090 this capped the tbq3_0 / tbq4_0 write throughput at ~80 GB/s (~4 % of peak). Switch to a cooperative shader that treats one workgroup (32 lanes == one subgroup on NVIDIA) as one block: - norm, norm-correction and residual-norm reductions use subgroupAdd - the Fast Hadamard Transform runs log2(BK) passes with the BK/2 butterflies in each pass spread across the 32 threads, separated by a single barrier() each - the QJL sign sketch is packed with subgroupBallot (32 bits per call, written as four bytes directly) instead of 128 serial OR into memory - scratch moves from private arrays to shared memory (tq3_sh_x, tq3_sh_idx, tq3_sh_proj, and tq4_* analogues) On the host side, the 32 TBQ/PQ cpy_f32_quant pipelines drop their wg_denoms from {32,1,1} to {1,1,1} so that "one workgroup == one block", and the shader's CPY main() picks up a matching TQ_COOP branch that drops the *32 + gl_LocalInvocationID.x offset from the block index decode. The GGML_OP_SET_ROWS dispatch path also needs to know about the new "one workgroup per block" contract: for TBQ/PQ dst types, divide ne by ggml_blck_size(dst) instead of 32 * ggml_blck_size(dst). Without this gate the set_rows kernel dispatched only 1/32 of the required workgroups, silently leaving 31 out of 32 KV-cache blocks uninitialized and driving perplexity on Mistral-7B-Instruct-v0.3 from ~5.9 to ~1090 with no visible failure from llama-bench or test-backend-ops (the CPY tests only exercise GGML_OP_CPY, which already had the /blck_size rule). Unrelated types keep the /32/blck_size rule so q4_0, q8_0 etc. behave exactly as before. Measured on 2x RTX 5090, Vulkan 1.4.321, PR ggml-org#115 tip b23276f. test-quantize-perf -b vulkan, 4 MiB input, 500 iters: type baseline avg optimized avg avg speedup tbq3_0 187.5 us 42.7 us 4.39x tbq4_0 192.9 us 44.3 us 4.35x pq3_0 68.6 us 44.8 us 1.53x pq4_0 84.9 us 44.9 us 1.89x llama-bench on Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1, -r 3: K / V pp2048 base -> opt tg128 base -> opt tbq3_0 / pq3_0 9764 -> 9880 +1.2 % 179.1 -> 206.9 +15.5 % pq3_0 / pq3_0 15396 -> 15653 +1.7 % 190.5 -> 214.0 +12.3 % tbq4_0 / pq4_0 9568 -> 9782 +2.2 % 164.9 -> 205.1 +24.4 % llama-perplexity on wikitext-2 test, 40 chunks, seed 42, Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1: K / V baseline PPL optimized PPL f16 / f16 5.8254 +/- 0.13612 5.8254 +/- 0.13612 (control) tbq3_0 / pq3_0 5.9333 +/- 0.13811 5.9129 +/- 0.13747 pq3_0 / pq3_0 5.9806 +/- 0.13879 5.9894 +/- 0.13918 tbq4_0 / pq4_0 5.8646 +/- 0.13707 5.8570 +/- 0.13679 All PPL deltas are an order of magnitude smaller than the 95 % CI and come from FP associativity in the subgroup tree reduction vs the previous sequential sum. No algorithmic change. tests/test-turboquant.sh: 112/112 backend-op tests still pass on both GPUs. test-quantize-fns reproduces the 4 pre-existing _64 roundtrip failures with bit-identical error magnitudes (0.010805, 0.009384) — that is the wrong-codebook bug from PR ggml-org#115 review, not introduced or fixed here. Made-with: Cursor

gianni-cor · 2026-04-24T07:02:01Z

+    }
+
+    // Pack QJL sign bits with subgroupBallot: each ballot call contributes 32 bits
+    // covering positions [s*32, (s+1)*32). With WG == subgroup size, bit `lid` of


This cooperative path seems to rely on 32 threads == 1 full subgroup, but I do not see a matching required-subgroup-size request when the cpy_f32_tbq* / cpy_f32_pq* and set_rows_* pipelines are created. On devices with 8- or 16-lane subgroups, subgroupAdd() here only reduces within each subgroup and subgroupBallot() only packs part of the block, so both the norm/correction reductions and the QJL bit packing become partial. Is there a reason this is guaranteed to run only on subgroup-size-32 hardware?

This was introduced by the optimization 45d3b80 the shader is only correct when gl_SubgroupSize == gl_WorkGroupSize.x == 32 which is not true for all hardware (e.g. Intel Arc)

Working on a generic [[unroll]] + spec constant shader that should compile to similar bytecode when optimizations are enabled (with minor aesthetic differences) for 32 group-size.

d994c9b Added generic shader and tests. Software implementation of Vulkan is used to test that different group-sized variants (other than 32 or 64) are accurate (vs CPU version).

Verified on 5090 box that neither PPL nor tokens/s are affected after the change. Since the testing script re-used same texts PPL is exactly the same. Tok/s within noise or very close.

Before:

[7/45, ETA 9m45s] Running: K=tbq3_0 V=pq3_0 (coopmat1, large) ... tg=183.52±2.10 t/s [4/45, ETA 9m06s] Running: K=pq3_0 V=pq3_0 (coopmat1, large) ... tg=215.74±0.43 t/s [5/45, ETA 9m34s] Running: K=pq4_0 V=pq4_0 (coopmat1, large) ... tg=208.72±0.80 t/s [15/45, ETA 7m45s] Running: K=pq3_0 V=pq3_0 (coopmat2, large) ... tg=223.38±0.09 t/s K=tbq3_0 V=pq3_0 PPL = 5.8203 (sweep±0.5987, chunk±0.2202) (1.93±0.12s) K=pq3_0 V=pq3_0 PPL = 5.8461 (sweep±0.5701, chunk±0.2201) (1.84±0.13s)

After:

[7/45, ETA 1m57s] Running: K=tbq3_0 V=pq3_0 (coopmat1, mid) ... tg=182.05±3.79 t/s [4/45, ETA 2m06s] Running: K=pq3_0 V=pq3_0 (coopmat1, mid) ... tg=215.16±1.36 t/s [5/45, ETA 2m03s] Running: K=pq4_0 V=pq4_0 (coopmat1, mid) ... tg=209.44±1.81 t/s [15/45, ETA 1m33s] Running: K=pq3_0 V=pq3_0 (coopmat2, mid) ... tg=224.63±0.11 t/s K=tbq3_0 V=pq3_0 PPL = 5.8203 (sweep±0.5987, chunk±0.2202) (2.00±0.12s) K=pq3_0 V=pq3_0 PPL = 5.8461 (sweep±0.5701, chunk±0.2201) (1.88±0.13s)

=== Subgroup coverage summary === ┌────────────────┬───────────────────────────────────────┬───────────────────────────────┬────────────────────────────────────┐ │ Leg │ Subgroup size │ NSG │ Result │ ├────────────────┼───────────────────────────────────────┼───────────────────────────────┼────────────────────────────────────┤ │ native GPU │ device default (>=32 on typical GPUs) │ 1 (fast path on typical GPUs) │ PASSED: ran=24 skipped=24 failed=0 │ │ lavapipe W=128 │ 4 │ 8 (stitch) │ PASSED: ran=16 skipped=32 failed=0 │ │ lavapipe W=256 │ 8 │ 4 (stitch) │ PASSED: ran=16 skipped=32 failed=0 │ │ lavapipe W=512 │ 16 │ 2 (stitch) │ PASSED: ran=16 skipped=32 failed=0 │ └────────────────┴───────────────────────────────────────┴───────────────────────────────┴────────────────────────────────────┘ ========================================== All checks passed. ==========================================

On 5acb3d5 added additional "masking" software shaders to test behavior of varying group-size, since its just for experimentation (to test different GS on hardware that does not natively has it) it will be reverted. All variations work very similarly in terms of GB/s and some-times surprisingly smaller WG configuration can outperform larger configs (could be noise).

=== pq3_0 huge === wg | status | nmse(g v c) | nmse(g v s) | ms/iter | GB/s 32(prod) | OK | 2.189e-08 | 3.398e-02 | 0.088 | 761.99 2 | OK | 2.189e-08 | 3.398e-02 | 0.090 | 742.52 4 | OK | 2.189e-08 | 3.398e-02 | 0.093 | 721.45 8 | OK | 2.189e-08 | 3.398e-02 | 0.090 | 746.73 16 | OK | 2.189e-08 | 3.398e-02 | 0.092 | 732.50 cpu | REF | - | - | 127.336 | 0.53 sorted by ms/iter (informational; see header): wg=32(prod) 0.088 ms 761.99 GB/s speedup vs CPU = 1447.00x wg=2 0.090 ms 742.52 GB/s speedup vs CPU = 1414.84x wg=8 0.090 ms 746.73 GB/s speedup vs CPU = 1414.84x wg=16 0.092 ms 732.50 GB/s speedup vs CPU = 1384.09x wg=4 0.093 ms 721.45 GB/s speedup vs CPU = 1369.20x cpu (ref) 127.336 ms 0.53 GB/s (baseline) === pq3_0_64 huge === wg | status | nmse(g v c) | nmse(g v s) | ms/iter | GB/s wg | status | nmse(g v c) | nmse(g v s) | ms/iter | GB/s 32(prod) | OK | 4.587e-08 | 3.343e-02 | 0.147 | 455.67 2 | OK | 4.587e-08 | 3.343e-02 | 0.147 | 456.39 4 | OK | 4.587e-08 | 3.343e-02 | 0.154 | 435.99 8 | OK | 4.587e-08 | 3.343e-02 | 0.156 | 430.11 16 | OK | 4.587e-08 | 3.343e-02 | 0.153 | 438.12 cpu | REF | - | - | 129.420 | 0.52 sorted by ms/iter (informational; see header): wg=32(prod) 0.147 ms 455.67 GB/s speedup vs CPU = 880.41x wg=2 0.147 ms 456.39 GB/s speedup vs CPU = 880.41x wg=16 0.153 ms 438.12 GB/s speedup vs CPU = 845.88x wg=4 0.154 ms 435.99 GB/s speedup vs CPU = 840.39x wg=8 0.156 ms 430.11 GB/s speedup vs CPU = 829.62x cpu (ref) 129.420 ms 0.52 GB/s (baseline)

gianni-cor · 2026-04-24T07:02:01Z

+if [ ${#KS[@]} -gt 0 ] || [ ${#VS[@]} -gt 0 ]; then
+    # --ks / --vs override: run the Cartesian product. Missing side defaults to the
+    # set supplied on the other side (so e.g. --vs f16 on its own sweeps all built-in K:f16 pairs).
+    if [ ${#KS[@]} -eq 0 ]; then


In --no-fa mode the script still documents the scalar-path sweep as "only test K quantizations with V=f16", but this override branch now auto-fills the missing side with all cache types. For example, --no-fa --ks tbq3_0 will expand to tbq3_0:{f16,q8_0,q4_0,pq3_0,...}, and the first quantized-V row aborts at runtime with V cache quantization requires flash_attn. Because run_perplexity_once() returns non-zero under set -e, that stops the whole sweep instead of running the intended K-only comparison. Should the auto-filled side be clamped back to f16 whenever FA_FLAG=off?

gianni-cor · 2026-04-24T07:02:01Z

                    case GGML_TYPE_Q8_0:
                    case GGML_TYPE_TQ2_0:
                    case GGML_TYPE_TQ1_0:
+                    case GGML_TYPE_TBQ3_0:


This still looks too broad for GGML_OP_MUL_MAT_ID: the support predicate now whitelists the TBQ/PQ types here, but I do not see matching *_id pipelines being generated for them. ggml_vk_get_dequantize_mul_mat_vec_id() only populates the older quant types, and ggml_vk_get_mul_mat_mat_id_pipeline() still asserts if the selected pipeline_dequant_mul_mat_mat_id[src0_type] entry is empty. Since TBQ/PQ are KV-cache types this may be hard to hit in normal llama inference, but for custom graphs / backend-op surfaces this still looks like we advertise support without a backend implementation behind it.

gianni-cor · 2026-04-24T07:59:13Z

Issue 1 — Hadamard rotation engages on every quantized KV cache type, not just TBQ/PQ

This is the last behavioural concern from the original review (§2.3) that I haven't seen addressed. The can_rotk / can_rotv gates in build_attn_inp_kv_impl and build_attn_inp_kv_iswa test ggml_is_quantized(mctx_cur->type_k/v()):

qvac-fabric-llm.cpp/src/llama-graph.cpp

Lines 1643 to 1654 in 45d3b80

    
           const bool can_rotk = 
        
               !hparams.is_n_embd_k_gqa_variable() && 
        
               hparams.n_embd_head_k % 64 == 0 && 
        
               ggml_is_quantized(mctx_cur->type_k()); 
        
           const bool can_rotv = 
        
               !hparams.is_n_embd_v_gqa_variable() && 
        
               hparams.n_embd_head_v % 64 == 0 && 
        
               ggml_is_quantized(mctx_cur->type_v()); 
        
           inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k); 
        
           inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);

qvac-fabric-llm.cpp/src/llama-graph.cpp

Lines 1947 to 1958 in 45d3b80

    
           const bool can_rotk = 
        
               !hparams.is_n_embd_k_gqa_variable() && 
        
               hparams.n_embd_head_k % 64 == 0 && 
        
               ggml_is_quantized(mctx_cur->get_base()->type_k()); 
        
           const bool can_rotv = 
        
               !hparams.is_n_embd_v_gqa_variable() && 
        
               hparams.n_embd_head_v % 64 == 0 && 
        
               ggml_is_quantized(mctx_cur->get_base()->type_v()); 
        
           inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k); 
        
           inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);

ggml_is_quantized() returns true for q4_0, q4_1, q5_0, q5_1, q8_0, iq4_nl, all K-quants, all I-quants — so a user running on master with --cache-type-k q4_0 --cache-type-v q4_0 and head_dim%64 == 0 (essentially every modern model) now gets, after this PR lands:

three extra head_dim × head_dim dense mat-muls per attention call (Q · R, K · R, output · R);
quantization-error distribution that differs from pre-PR master, because the quantiser sees rotated Q/K/V instead of raw.

Attention scores remain mathematically equivalent at infinite precision (R is orthogonal, applied symmetrically to Q and K, and undone on the output for V), so end-to-end PPL should stay within the same CI, but:

it's a silent behavioural and performance change on paths that have nothing to do with TurboQuant/PolarQuant,
bit-identical regression comparisons against master on existing q4_0/q8_0 KV-cache runs are no longer possible,
the rotation would be pure overhead on those paths — the codebooks it pays for (d-specific Lloyd–Max centroids) are TBQ/PQ-only.

Suggested fix (~10 lines) — narrow both predicates to the 8 TBQ/PQ types:

auto is_tbq_pq = [](ggml_type t) {
    switch (t) {
        case GGML_TYPE_TBQ3_0:    case GGML_TYPE_TBQ4_0:
        case GGML_TYPE_PQ3_0:     case GGML_TYPE_PQ4_0:
        case GGML_TYPE_TBQ3_0_64: case GGML_TYPE_TBQ4_0_64:
        case GGML_TYPE_PQ3_0_64:  case GGML_TYPE_PQ4_0_64:
            return true;
        default:
            return false;
    }
};
const bool can_rotk = !hparams.is_n_embd_k_gqa_variable() &&
                      hparams.n_embd_head_k % 64 == 0 &&
                      is_tbq_pq(mctx_cur->type_k());

Same change in the iSWA builder. Happy to push this as a PR against turboquant if useful.

gianni-cor · 2026-04-24T07:59:25Z

+    // expects dim01-contiguous quantized src0 and would assert. supports_op
+    // advertises these cases as supported via has_quant_f16_cpy, so we must
+    // keep them on the GPU here rather than fall back to CPU.
    } else if ((dst->ne[1] == 1 || (dst->ne[1] <= mul_mat_vec_max_cols && src1->ne[2] * src1->ne[3] == 1)) &&


I can still reproduce a latest-head Vulkan correctness bug here on the NVIDIA coopmat2 box. This branch now sends non-dim01-contiguous quantized src0 to the matrix path when n is small, but the standalone TBQ QJL correction is still gated below to _64 or ne11 > mul_mat_vec_max_cols (ggml-vulkan.cpp around the is_tbq_d128_dispatch / ne11 > mul_mat_vec_max_cols check). That leaves a hole for _128 tbq3_0 / tbq4_0 with small n: they avoid the vec kernel, but also skip the Stage-2 correction pass.

Concrete repro on qvac-dev-linux-x64 (RTX 5090, VK_NV_cooperative_matrix2) against the current PR head: MUL_MAT(type_a=tbq3_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3],k_v=0,o=1) is reported as supported on Vulkan0, but the correctness run fails with ERR = 0.058072511 > 0.000500000. The analogous control case with type_a=pq3_0 and the same shape passes.

That strongly suggests the non-contiguous small-n matrix path is still missing the TBQ Stage-2 QJL correction for _128 blocks.

gianni-cor · 2026-04-24T07:59:47Z

Issue 3 — No `_64` coverage in `test-backend-ops` `MUL_MAT` / `FLASH_ATTN_EXT`

The _64 (head_dim=64) variants are only exercised by test-quantize-fns, which confirms the copy_to_quant path post-32ba81912 (RMSE dropped from ~0.010 to ~0.003 for all 4 _64 types on Vulkan — nice). But the standalone MUL_MAT and fused FLASH_ATTN_EXT paths that 32ba81912 + 7933c60ab wire up are not covered by CI:

The new MUL_MAT coverage block enumerates only d=128 types:

qvac-fabric-llm.cpp/tests/test-backend-ops.cpp

Lines 7497 to 7517 in 45d3b80

    
           // TurboQuant / PolarQuant MUL_MAT coverage. 
        
           // Intentionally exercises the standalone MUL_MAT path (not fused into FLASH_ATTN_EXT), 
        
           // which is the path that reports `supports_op == yes` on NV coopmat2 but has no 
        
           // matching pipeline created in `pipeline_dequant_mul_mat_mat_f16[]` — see 
        
           // ggml-vulkan.cpp:3412 ("TBQ/PQ cm2 matmul shaders not yet generated") and the 
        
           // supports_op switch that still lists TBQ/PQ for MUL_MAT. 
        
           { 
        
               const ggml_type tbq_pq[] = { 
        
                   GGML_TYPE_TBQ3_0, GGML_TYPE_TBQ4_0, GGML_TYPE_PQ3_0, GGML_TYPE_PQ4_0, 
        
               }; 
        
               for (ggml_type type_a : tbq_pq) { 
        
                   for (ggml_type type_b : { GGML_TYPE_F32, GGML_TYPE_F16 }) { 
        
                       // mul_mat_vec path (n small, e.g. decode-like) 
        
                       test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16,  1, 256, {1, 1}, {1, 1})); 
        
                       test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16,  8, 256, {1, 1}, {1, 1})); 
        
                       // mat-mat path (n > 8, e.g. prefill — routes through dequant + f16 matmul on Vulkan) 
        
                       test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {1, 1}, {1, 1})); 
        
                       test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 32, 256, {1, 1}, {1, 1})); 
        
                   } 
        
               } 
        
           }

The mixed-K/V flash-attention block uses hs=64 but passes type_K = TBQ3_0 (not TBQ3_0_64). Inside test_flash_attn_ext::build_graph, hsk_padded = GGML_PAD(hsk, ggml_blck_size(type_K)) rounds 64 up to 128 because ggml_blck_size(TBQ3_0) = 128 — so the _64 flash-attention pipelines are never actually dispatched:

qvac-fabric-llm.cpp/tests/test-backend-ops.cpp

Lines 8015 to 8035 in 45d3b80

    
           // Mixed K/V type flash attention (at least one side is TBQ/PQ) 
        
           { 
        
               const ggml_type tbq_pq[] = { GGML_TYPE_TBQ3_0, GGML_TYPE_TBQ4_0, GGML_TYPE_PQ3_0, GGML_TYPE_PQ4_0 }; 
        
               const ggml_type mixed[]  = { GGML_TYPE_TBQ3_0, GGML_TYPE_TBQ4_0, GGML_TYPE_PQ3_0, GGML_TYPE_PQ4_0, GGML_TYPE_Q8_0, GGML_TYPE_F16 }; 
        
               auto is_tbq_pq = [&](ggml_type t) { return std::find(std::begin(tbq_pq), std::end(tbq_pq), t) != std::end(tbq_pq); }; 
        
               for (ggml_type tk : mixed) { 
        
                   for (ggml_type tv : mixed) { 
        
                       if (tk == tv) continue; 
        
                       if (!is_tbq_pq(tk) && !is_tbq_pq(tv)) continue; 
        
                       for (int hs : { 64, 128 }) { 
        
                           for (int kv : { 113, 512 }) { 
        
                               for (int nb : { 1, 32 }) { 
        
                                   test_cases.emplace_back(new test_flash_attn_ext( 
        
                                       hs, hs, 4, {1, 1}, kv, nb, true, false, 0.0f, 0.0f, GGML_PREC_F32, tk, tv)); 
        
                               } 
        
                           } 
        
                       } 
        
                   } 
        
               } 
        
           }

Empirical confirmation on the latest PR tip (45d3b8098, 2× RTX 5090):

$ ./bin/test-backend-ops test -p "tbq3_0_64|tbq4_0_64|pq3_0_64|pq4_0_64" -o MUL_MAT
Testing 3 devices
  0/0 tests passed
  0/0 tests passed

So a regression in:

load_a_to_shmem for _64 in mul_mm_funcs.glsl,
the _64 matmul pipelines emitted in vulkan-shaders-gen.cpp,
the pipeline_mul_mm_tbq_qjl[GGML_TYPE_*_64][0/1] wiring,
the qjl_stride_a / qjl_stride_batch_a computation on permuted K views, or
the _64 FA TQ_D64 branch in tq_utils.comp,

would slip through the regular test suite. The fix the author ran locally ("test-backend-ops -o MUL_MAT -b Vulkan0 … all 32 TBQ/PQ × {128, 64} × n ∈ {1, 8, 16, 32} cases pass") is exactly what CI should do.

Suggested fix — extend both blocks to include _64 types and, where they share a loop with hs, match the type's block size:

// tests/test-backend-ops.cpp — MUL_MAT coverage
const ggml_type tbq_pq[] = {
    GGML_TYPE_TBQ3_0,    GGML_TYPE_TBQ4_0,    GGML_TYPE_PQ3_0,    GGML_TYPE_PQ4_0,
    GGML_TYPE_TBQ3_0_64, GGML_TYPE_TBQ4_0_64, GGML_TYPE_PQ3_0_64, GGML_TYPE_PQ4_0_64,
};
// and for k use e.g. k = ggml_blck_size(type_a) * 2 so the _64 blocks are actually exercised

// tests/test-backend-ops.cpp — FA coverage
// Separate arm for _64 types so hs stays at 64 (no GGML_PAD up to 128)
const ggml_type tbq_pq_64[] = { GGML_TYPE_TBQ3_0_64, GGML_TYPE_TBQ4_0_64,
                                GGML_TYPE_PQ3_0_64,  GGML_TYPE_PQ4_0_64 };
for (ggml_type tk : tbq_pq_64) {
    for (ggml_type tv : { GGML_TYPE_PQ3_0_64, GGML_TYPE_PQ4_0_64,
                          GGML_TYPE_Q8_0, GGML_TYPE_F16 }) {
        test_cases.emplace_back(new test_flash_attn_ext(
            /*hsk=*/64, /*hsv=*/64, 4, {1, 1}, 512, 32,
            true, false, 0.0f, 0.0f, GGML_PREC_F32, tk, tv));
    }
}

Small and contained, but closes the hole where the most fragile part of this PR (the d=64 Stage-1 + QJL-Stage-2 path) is un-covered by CI.

gianni-cor · 2026-04-24T08:00:12Z

Issue 4 — Thread-0 bottleneck in `mul_mm_tbq_qjl_correction.comp`

Not a correctness concern (the kernel ships working), just a perf note for the new non-FA QJL epilogue added in 32ba81912. The inner reduction over proj_b_sh[] and qjl[] is thread-0-only:

qvac-fabric-llm.cpp/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_tbq_qjl_correction.comp

Lines 109 to 135 in 45d3b80

    
           // Thread 0 reduces the block's QJL dot product. 
        
           if (has_qjl && tid == 0u) { 
        
               const float qjl_scale = d_r * sqrt(1.5707963) / float(QUANT_K); 
        
               float pos_sum = 0.0; 
        
               float total_sum = 0.0; 
        
               [[unroll]] for (uint w = 0u; w < QUANT_K / 32u; w++) { 
        
                   const uint bb = w * 4u; 
        
                   uint bits = uint(data_a[ib].qjl[bb]) 
        
                             | (uint(data_a[ib].qjl[bb + 1u]) << 8u) 
        
                             | (uint(data_a[ib].qjl[bb + 2u]) << 16u) 
        
                             | (uint(data_a[ib].qjl[bb + 3u]) << 24u); 
        
                   [[unroll]] for (uint qq = 0u; qq < 8u; qq++) { 
        
                       const uint base_idx = (w * 8u + qq) * 4u; 
        
                       const vec4 pq = vec4(proj_b_sh[base_idx], 
        
                                            proj_b_sh[base_idx + 1u], 
        
                                            proj_b_sh[base_idx + 2u], 
        
                                            proj_b_sh[base_idx + 3u]); 
        
                       const vec4 mask_v = vec4(float(bits & 1u), 
        
                                                float((bits >> 1u) & 1u), 
        
                                                float((bits >> 2u) & 1u), 
        
                                                float((bits >> 3u) & 1u)); 
        
                       pos_sum   += dot(mask_v, pq); 
        
                       total_sum += pq.x + pq.y + pq.z + pq.w; 
        
                       bits >>= 4u; 
        
                   } 
        
               } 
        
               accum += qjl_scale * (2.0 * pos_sum - total_sum);

if (has_qjl && tid == 0u) {
    const float qjl_scale = d_r * sqrt(1.5707963) / float(QUANT_K);
    float pos_sum = 0.0;
    float total_sum = 0.0;
    [[unroll]] for (uint w = 0u; w < QUANT_K / 32u; w++) {
        ...
        [[unroll]] for (uint qq = 0u; qq < 8u; qq++) {
            ...
            pos_sum   += dot(mask_v, pq);
            total_sum += pq.x + pq.y + pq.z + pq.w;
            ...
        }
    }
    accum += qjl_scale * (2.0 * pos_sum - total_sum);
}

QUANT_K threads cooperate on the preceding Walsh–Hadamard butterfly; then the other 127 (or 63 for _64) go idle while thread 0 walks proj_b_sh[] and the qjl[] byte array serially. This is the hot inner loop of the kernel — one pass per block, num_blocks_per_row blocks per (row_a, col_b, batch) triple.

This is the -fa off prefill-K path (cold path for decode), so shipping the serial version is reasonable, but it's likely the biggest single perf gain available on non-FA TBQ inference once you get to it. Straightforward fix is a subgroupAdd-based tree reduction over the 32 lanes of the first subgroup:

// every thread contributes its own slice of (sign*proj, proj)
float local_pos = 0.0;
float local_tot = 0.0;
const uint byte_idx = tid >> 3u;
const uint bit_idx  = tid & 7u;
const float sign_j  = float((uint(data_a[ib].qjl[byte_idx]) >> bit_idx) & 1u);
local_pos = sign_j * proj_b_sh[tid];
local_tot = proj_b_sh[tid];

// tree reduction — one subgroupAdd replaces the whole [[unroll]] loop
pos_sum   = subgroupAdd(local_pos);
total_sum = subgroupAdd(local_tot);
if (tid == 0u) {
    accum += qjl_scale * (2.0 * pos_sum - total_sum);
}

Same structure as the cooperative norm reduction I added in 45d3b8098 for copy_to_quant.comp. Would also let you drop proj_b_sh from shared mem (not needed past the butterfly).

Will queue this as a follow-up patch once the three correctness items (Issue 1 above, and the review leftovers: wrong bpw comments in ggml.h, dead Stage-1 sign machinery, and the 0xQJL128ULL gibberish macro in ggml-quants.c) are resolved.

jesusmb1995 · 2026-04-24T17:36:55Z

syscl-fp16 seems to fail now: https://github.com/tetherto/qvac-fabric-llm.cpp/actions/runs/24902725692/job/72924264252?pr=115

Here it was passing before: https://github.com/tetherto/qvac-fabric-llm.cpp/actions/runs/24771942879/job/72480398192

Possibly: last successful sycl-fp16 on this PR = Apr 22 (run 24771942879, commit 45d3b80). It broke today purely because apt install intel-oneapi-compiler-dpcpp-cpp now resolves to 2026.0, which dropped/moved syclcompat/math.hpp.

Fix a latent correctness bug in the TurboQuant / PolarQuant copy_to_quant cooperative shader that silently produces wrong bytes on any device whose gl_SubgroupSize is less than the 32-thread workgroup (Intel Xe/Arc at 8/16, ARM Mali 4/8/16, some Adreno configurations). Make the path cover every supported subgroup size, plumb a runtime knob for testing, and add a dedicated test suite with both real-hardware and software-Vulkan coverage. Motivation ---------- The original copy_to_quant.comp TBQ/PQ path uses subgroupAdd() for the per-block norm reductions and subgroupBallot() for the QJL sign-bit sketch, assuming gl_SubgroupSize == 32 (= the workgroup size). On devices where the native subgroup is smaller, those ops reduce only within a subgroup, not the whole workgroup, so each subgroup sees its own partial sum and the output bytes become whatever the first-subgroup partial happened to produce. The SET_ROWS path has the same issue. The bug does not reproduce on most production GPUs (NVIDIA fixed-32, AMD RDNA 32/64, Apple 32) but bites Intel and several mobile GPUs. Shader changes (copy_to_quant.comp) ----------------------------------- * New specialization constant SG_SIZE at constant_id = 1 (slot 0 is already used by generic_binary_head.glsl's `norepeat` in the SET_ROWS path). Defaults to 32 so hosts that pass no spec info get the original shader. * TQ_WG fixed at 32 (the workgroup size); NSG = TQ_WG / SG_SIZE is the number of subgroups per workgroup. * New helper tq_wg_add(x): if NSG == 1 (SG_SIZE >= TQ_WG) returns subgroupAdd(x) -- identical to the original fast path and dead-code-eliminated by spec-constant folding; if NSG > 1 the per- subgroup subgroupAdd results are written to shared memory (tq_sh_red) and stitched with an [[unroll]]-ed sum. Replaces every subgroupAdd() in the TBQ/PQ/norm-correction paths. * QJL sign-bit pack: when SG_SIZE >= TQ_WG the original subgroupBallot fast path runs; when SG_SIZE < TQ_WG it falls back to atomicOr into a shared uint array and a serial write-out. Same fast-path guard lets specialization fold the slow branch away when SG_SIZE == 32. * SG_SIZE > TQ_WG (e.g. AMD wave64 with WG=32) is treated as NSG == 1 via clamp(SG_SIZE, TQ_WG) in tq_wg_add, so those devices take the fast path even though half the wave is masked off. Host plumbing (ggml-vulkan.cpp) ------------------------------- * vk_device_struct grows a tbq_copy_sg_size field (0 = no override). * Device init reads GGML_VK_TBQ_COPY_SG_SIZE from env, validates against {4, 8, 16, 32, 64} intersected with the device's [subgroup_min_size, subgroup_max_size], and emits a structured "tbq_copy_sg_size_status requested=R applied=A reason=X" line so tests can tell whether the override was applied or rejected (distinct from success/failure of the run itself). * ggml_vk_load_shaders picks the (SG_SIZE spec const, requiredSubgroupSize) pair used for every CPY-to-quant and SET_ROWS-to-quant pipeline: - if the env override is set: that value - else if the device supports size control: mul_mat_subgroup_size - else: 0 (shader default SG_SIZE=32, no required size) -- matches pre-patch behaviour on drivers without VK_EXT_subgroup_size_control. The two-element spec-const vector is {0, SG_SIZE} for the plain CPY path (slot 0 is ignored by generic_unary_head.glsl) and {1, SG_SIZE} for SET_ROWS (slot 0 is `norepeat`, always 1). * Adds a device-selection opt-in GGML_VK_ALLOW_CPU_DEVICES=1 so tests can pick up software Vulkan ICDs (lavapipe, SwiftShader) that ggml-vulkan normally filters out. Production code never sets this env var and the behaviour is unchanged when it isn't set. New test (tests/test-copy-tbq-subgroups.cpp + CMakeLists) --------------------------------------------------------- Self-spawning C++ test that for each (SG in {0, 4, 8, 16, 32, 64}, type, shape) triple runs GPU quantize, compares against a CPU ggml_quantize_chunk reference, and reports byte-mismatch + dequant NMSE + throughput. Key design choices: * Self-spawn (popen of --child N with a different GGML_VK_TBQ_COPY_SG_SIZE value per child) because the env var is consumed once at device init and can only be changed across processes. * Parses the structured status line from the backend to distinguish "applied" from "rejected" rows. Rejected rows are labelled SKIP-<reason> in the per-case table and excluded from the NMSE-spread assertion (they are duplicates of sg=0 and don't add independent coverage). Prior phrasing that labelled them OK was misleading. * --types comma-separated filter keeps the default CI run fast by iterating only a subset of TBQ/PQ types. * Shared pass/fail rule: nmse(gpu vs cpu) <= 1e-6 for every applied SG; the per-case table stays OK on the legs that couldn't exercise the stitch path on the host GPU. Cross-subgroup-size coverage via lavapipe (tests/test-turboquant.sh) -------------------------------------------------------------------- Real desktop GPUs (NVIDIA, AMD RDNA, Apple, most Adreno) have minSubgroupSize >= 32, so VK_EXT_subgroup_size_control cannot request the smaller subgroups the stitch path was written for. To actually exercise NSG > 1 in CI, the script now also runs the test under lavapipe (Mesa's CPU Vulkan driver) at LP_NATIVE_VECTOR_WIDTH in {128, 256, 512}, which gives native subgroupSize {4, 8, 16} respectively and therefore covers every distinct NSG branch the shader supports: LP_NATIVE_VECTOR_WIDTH | lavapipe SG | NSG (= TQ_WG / SG) -----------------------+-------------+-------------------- 128 | 4 | 8 (8-way stitch) 256 | 8 | 4 (4-way stitch) 512 | 16 | 2 (2-way stitch) Combined with the native-GPU leg (NSG=1, fast path), this gives full coverage of the helper's {1, 2, 4, 8} NSG branches on any host. Usage and modes --------------- tests/test-turboquant.sh # short mode (default): CI-friendly tests/test-turboquant.sh --full # all TBQ/PQ types, full matrix Short mode restricts the SG-coverage legs to tbq3_0 / pq3_0 / *_64 to keep default CI runtime bounded; full mode covers all 8 TBQ/PQ types. Both modes render a Unicode-boxed summary table at the end covering every subgroup-coverage leg that ran.

jesusmb1995 self-assigned this Mar 27, 2026

github-actions Bot added ggml testing labels Mar 27, 2026

jesusmb1995 force-pushed the turboquant branch from e24ed9b to 586c892 Compare March 27, 2026 19:12

jesusmb1995 changed the title ~~Draft: TurboQuant~~ TurboQuant: KV cache quantization with Hadamard transform (TQ3_0 / TQ4_0) Mar 27, 2026

jesusmb1995 force-pushed the turboquant branch from 586c892 to d337170 Compare March 27, 2026 19:36

This comment was marked as outdated.

Sign in to view

jesusmb1995 force-pushed the turboquant branch from d0a38c3 to 0fd733a Compare March 27, 2026 21:54

github-actions Bot added the Vulkan label Mar 27, 2026

jesusmb1995 force-pushed the turboquant branch from 0fd733a to f98eb75 Compare March 28, 2026 16:02

github-actions Bot added python script labels Mar 28, 2026

jesusmb1995 force-pushed the turboquant branch from 121e947 to 580ee50 Compare March 28, 2026 16:33

This comment was marked as outdated.

Sign in to view

jesusmb1995 force-pushed the turboquant branch 2 times, most recently from 69522fb to 6497a86 Compare March 31, 2026 16:37

jesusmb1995 changed the title ~~TurboQuant: KV cache quantization with Hadamard transform (TQ3_0 / TQ4_0)~~ TurboQuant: KV cache quantization with Hadamard transform (TBQ3_0 / TBQ4_0) Mar 31, 2026

github-actions Bot added the examples label Apr 7, 2026

jesusmb1995 force-pushed the turboquant branch 2 times, most recently from f7ba069 to 9d2a659 Compare April 7, 2026 18:20

jesusmb1995 force-pushed the turboquant branch 2 times, most recently from 0a7559b to 7ea421d Compare April 11, 2026 20:51

github-actions Bot added Nvidia GPU build devops Apple Metal labels Apr 11, 2026

jesusmb1995 force-pushed the turboquant branch 2 times, most recently from 7ea421d to 3238522 Compare April 11, 2026 21:39

This comment was marked as resolved.

Sign in to view

jesusmb1995 added 2 commits April 20, 2026 11:15

This comment was marked as resolved.

Sign in to view

jesusmb1995 force-pushed the turboquant branch from b23276f to 06d7220 Compare April 20, 2026 10:42

jesusmb1995 force-pushed the turboquant branch from 7ad6ac1 to 6e26e8b Compare April 20, 2026 18:33

jesusmb1995 commented Apr 21, 2026

View reviewed changes

Comment thread tests/test-kv-cache-quantization-perp.sh

jesusmb1995 commented Apr 21, 2026

View reviewed changes

Comment thread ci/run.sh

jesusmb1995 added 2 commits April 21, 2026 18:17

jesusmb1995 force-pushed the turboquant branch from 46210ae to be52aea Compare April 21, 2026 16:19

jesusmb1995 and others added 3 commits April 21, 2026 18:44

This comment was marked as resolved.

Sign in to view

jesusmb1995 changed the title ~~TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0)~~ QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0) Apr 23, 2026

gianni-cor reviewed Apr 24, 2026

View reviewed changes

jesusmb1995 force-pushed the turboquant branch from 315ef44 to 221dd77 Compare April 24, 2026 18:15

Conversation

jesusmb1995 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

How does TurboQuant work?

Optimization details

Implementation overview

Example usage

Results / testing

Limitations

TBQ / PQ Vulkan support matrix

Remaining work

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

zoq commented Apr 1, 2026

Uh oh!

jesusmb1995 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Uh oh!

This comment was marked as resolved.

gianni-cor Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jesusmb1995 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jesusmb1995 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jesusmb1995 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gianni-cor Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gianni-cor Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gianni-cor commented Apr 24, 2026

Issue 1 — Hadamard rotation engages on every quantized KV cache type, not just TBQ/PQ

Uh oh!

gianni-cor Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gianni-cor commented Apr 24, 2026

Issue 3 — No _64 coverage in test-backend-ops MUL_MAT / FLASH_ATTN_EXT

Uh oh!

gianni-cor commented Apr 24, 2026

Issue 4 — Thread-0 bottleneck in mul_mm_tbq_qjl_correction.comp

Uh oh!

jesusmb1995 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jesusmb1995 commented Mar 27, 2026 •

edited

Loading

jesusmb1995 commented Apr 7, 2026 •

edited

Loading

Issue 3 — No `_64` coverage in `test-backend-ops` `MUL_MAT` / `FLASH_ATTN_EXT`

Issue 4 — Thread-0 bottleneck in `mul_mm_tbq_qjl_correction.comp`

jesusmb1995 commented Apr 24, 2026 •

edited

Loading