Skip to content

QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0)#115

Open
jesusmb1995 wants to merge 16 commits intotetherto:temp-7248from
jesusmb1995:turboquant
Open

QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0)#115
jesusmb1995 wants to merge 16 commits intotetherto:temp-7248from
jesusmb1995:turboquant

Conversation

@jesusmb1995
Copy link
Copy Markdown

@jesusmb1995 jesusmb1995 commented Mar 27, 2026

Summary

Implements TurboQuant KV cache quantization (Zandieh et al., ICLR 2026) for CPU and Vulkan backends with full Flash Attention support. Compresses KV cache to 3.25-4.25 bits per value, enabling ~4-5x larger context windows on the same hardware.

Paper: https://arxiv.org/pdf/2504.19874
Community discussion:

Recommended configurations:

  • High compression + speed: K=pq3_0 V=pq3_0 — codebook-only, no QJL overhead. Minimal PPL/speed loss at 3.25 bpw with a small retrieval quality trade-off on long contexts.
  • High compression + and long-context quality: K=tbq3_0 V=pq3_0 — QJL-corrected keys with codebook-only values. Best retrieval accuracy at 3.75 avg bpw, with a moderate speed cost from QJL correction in the FA shader.

Features

  • Full set of TurboQuant types: tbq3_0, tbq4_0, pq3_0, pq4_0 (and _64 variants)
  • Automatic head_dim detection (64 vs 128) — user specifies pq3_0, internal type auto-selects
  • Coopmat1 and Coopmat2 Flash Attention support (noticeable prefill speedup)
  • Pre-compiled fused Flash Attention shaders for mixed K/V types (asymmetric compression)
  • QJL Stage 2 correction in all FA paths (scalar, cm1, cm2)
  • Comprehensive test/benchmark scripts (perplexity, throughput, RULER)
  • Cooperative copy_to_quant Vulkan path for TBQ/PQ (faster KV writes)

How does TurboQuant work?

Random rotations spread values evenly across coordinates, preventing concentration on a few axes where zero-coordinates waste bits. In high dimensions, the marginal distribution of each coordinate of a unit-sphere vector follows a Beta distribution that converges to N(0, 1/d) as d grows. The algorithm exploits this by placing Lloyd-Max codebook centroids at optimal positions for this known distribution, minimizing MSE reconstruction error. Centroids are found by solving a continuous 1-dimensional k-means problem.

An additional QJL correction step (Stage 2) reduces bias in dot-product estimation. It quantizes the residual error from Stage 1 to 1-bit by storing only the signs of the residual vector after applying a random rotation (Hadamard × sign diagonal). Since only signs are stored (no centroid rounding), the paper proves this yields an unbiased dot-product estimator. This step is important for maintaining retrieval quality on long contexts.

Optimization details

  • Hadamard instead of dense rotation: Rotations based on Hadamard use the butterfly pattern in O(d log d) instead of O(d²). Hadamard is deterministic, but applying a random sign diagonal preserves randomness while remaining orthogonal and invertible.

  • Dense rotation for K/V/Q at graph level, FHT in shader for QJL: At block sizes d=64/128, O(d²) is negligible and utilizes better GPU parallelism for the graph-level rotation. The butterfly FHT is used inside the Flash Attention shader for the QJL projection, avoiding the need to copy a dense matrix into the shader (which would add memory pressure). Since there is no Q cache, the QJL projection of Q must be recomputed every step to apply corrections against the 1-bit signs stored in K blocks.

Type Bits/val Block size Compression vs FP16 Description
q4_0 4.50 18 B 3.5x Baseline: 16 linear values
pq3_0 3.25 52 B 4.9x 8 Lloyd-Max centroids
pq4_0 4.25 68 B 3.8x 16 Lloyd-Max centroids
tbq3_0 4.25 68 B 3.8x 8 centroids + QJL correction
tbq4_0 5.25 84 B 3.0x 16 centroids + QJL correction

Implementation overview

  • vulkan-shaders-gen.cpp — orchestrates SPIR-V compilation of all variant combos
  • ggml-vulkan.cpp — host-side: creates pipeline objects, dispatches compute

TurboQuant KV cache shader flow (TBQ/PQ is ONLY a KV cache type, never model weights):

STEP 1: Write to cache (same for all paths)

  • copy_to_quant.comp: float K/V → TBQ/PQ quantized blocks
    • L2 norm, codebook binary search, 3/4-bit index packing
    • TBQ only: also computes QJL residual (qjl[], d_r)
    • PQ only: no QJL, smaller block, faster

STEP 2: Read cache at attention time (paths diverge here)

PATH A: Scalar Flash Attention (broad HW support, baseline)

  • flash_attn.comp
  • Includes: types.glsl, tq_utils.comp (via flash_attn_base.glsl), dequant_funcs.glsl
  • Dequantizes K/V inline, element by element
  • For TBQ/PQ K: uses centroid-gather optimization (reorders Q·K into per-centroid partial sums)
  • For TBQ K only: applies QJL correction to attention scores
  • Full fused kernel: QK^T → softmax → PV → output

PATH B: Cooperative matrix v1 Flash Attention (KHR, cross-vendor)

  • flash_attn_cm1.comp
  • K is fully dequantized into shared memory, then coopMatMulAdd for K·Q^T (subgroup-scope 16×16 tiles)
  • P·V accumulation is still scalar with inline dequant
  • Same QJL correction as scalar (applied to sfsh[] after coopmat store)

PATH C: Cooperative matrix v2 Flash Attention (NV only, most efficient)

  • flash_attn_cm2.comp
  • K and V loaded via coopMatLoadTensorNV with decode callback (dequant-on-load, no shared memory staging)
  • Both K·Q^T and P·V use coopMatMulAdd (workgroup-scope matrices)
  • QJL correction via raw byte reads from data_k[] with hardcoded byte offsets per type

PATH D: No-FA fallback, small N (MUL_MAT with N ≤ 8, e.g. decode)

  • mul_mat_vec_tbq3_0.comp / mul_mat_vec_tbq4_0.comp
  • Fused dequant + dot product, no centroid gather
  • QJL correction applied in the same kernel

PATH E: No-FA fallback, large N (K·Q MUL_MAT with N > 8, e.g. prefill)

  • This is the path exercised by -fa off with a TBQ/PQ K cache. Only the K·Q matmul is affected: V stays f16 under -fa off (upstream guard), so V·A stays on the existing f16 path.
  • Stage 1: mul_mm.comp runs with TBQ/PQ load_a_to_shmem — centroid dequant × d into shared memory, then generic tiled matmul (scalar / cm1 pipelines; cm2 falls through to cm1/scalar since no _mat_f16 cm2 shader exists for TBQ/PQ).
  • Stage 2 (TBQ only): mul_mm_tbq_qjl_correction.comp is dispatched after the main matmul as an additive pass — one workgroup per (row, col, batch), QUANT_K threads running the same Walsh–Hadamard + QJL dot product as the vec shader, accumulating d_r · √(π/2) / QUANT_K · sum_qjl(H(B)) into D.
  • PQ has no Stage 2 (no qjl[] / d_r), so Stage 1 alone is exact.
  • Requires B (src1) as f32; the scheduler is expected to feed f32 on this path. f16 src1 for standalone TBQ MUL_MAT reports not supported and falls back to CPU.
  • Fixes external review Issue 3 on PR QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0) #115: before this patch supports_op claimed TBQ/PQ MUL_MAT on cm2 devices (RTX 5090) but had no pipeline behind it, so the correctness run segfaulted. tests/test-backend-ops.cpp now covers all 8 TBQ/PQ types × n ∈ {1,8,16,32} as a repro.
  • Non-dim01-contiguous quantized src0 (permuted layouts) is now routed to the matrix path as well, so TBQ/PQ MUL_MAT works regardless of src0 stride pattern.

Example usage

llama-cli -m model.gguf --cache-type-k tbq3_0 --cache-type-v pq3_0
llama-cli -m model.gguf --cache-type-k pq3_0 --cache-type-v pq3_0

Works transparently with both head_dim=128 (Llama-3.1, Qwen, Mistral) and head_dim=64 (Llama-3.2-1B/3B) — the right block size is auto-selected.

Results / testing

Please see Asana for latest available data: https://app.asana.com/1/45238840754660/project/1212638335655939/task/1214143691877486

Will comment here with a public report when results can be shared.

PR for testing integration on LLM Addon: tetherto/qvac#1564

Limitations

  • head_dim must be 64 or 128. Codebooks and Hadamard transform are pre-computed for these dimensions.
  • d=64 quality is poor on small models — expected, as KV cache quantization generally degrades more on small models.
  • Metal shaders and vectorized CPU not yet implemented.
  • Optimized Flash Attention shaders require K to be PQ or TBQ, and V to be PQ, TBQ, Q4, Q8, or F16.
  • Quantized V with -fa off is not supported by this PR. Upstream llama_init_from_model rejects quantized V when flash attention is disabled ("V cache quantization requires flash_attn"), and that guard is intentionally left in place. The -fa off K·Q MUL_MAT fix in this PR would extend cleanly to A·V for a quantized V as well, but the v_trans V-cache layout used under -fa off is populated by ggml_set_rows with row_size=1, which corrupts any blck_size > 1 type at write time (reproducible on CPU as well, independent of backend). Fixing that is a KV-cache refactor out of scope here; the guard will be revisited once that lands.

TBQ / PQ Vulkan support matrix

What runs on GPU vs. is refused by the context, across FA on/off on dense and MoE models. The MoE-KV-cache rows behave the same as dense because attention itself is plain MUL_MAT / FLASH_ATTN_EXT, not MUL_MAT_ID; MoE routing (MUL_MAT_ID) only applies to the FFN weights, which are never stored as TBQ/PQ.

Scenario FA K-type V-type Path Status
Dense / MoE — KV cache on tbq3/4_0 or pq3/4_0 pq/tbq/q4_0/q8_0/f16 Fused FA (scalar / cm1 / cm2), QJL in kernel Full GPU
Dense / MoE — KV cache on tbq3/4_0 or pq3/4_0 other quantized (q5_0, q4_1, iq4_nl, k-quants, …) No matching Vulkan FA pipeline → per-layer backend split Runs, but attention falls back to CPU
Dense / MoE — KV cache off tbq3/4_0 or pq3/4_0 f16 K·Q via mul_mm.comp + QJL correction; V·A on the existing f16 path Full GPU (Path E)
Dense / MoE — KV cache off tbq3/4_0 or pq3/4_0 any quantized type (incl. tbq/pq) Context init refused (upstream FA-off rule: "V cache quantization requires flash_attn")

Notes:

  • Head dimensions of both 128 and 64 are supported; the _64 block variants (tbq*_0_64, pq*_0_64) have their own pipelines, codebooks, and sign tables.
  • MoE FFN weights are not in this table on purpose: TBQ/PQ are KV-cache quantizations only (llama-quantize has no TBQ/PQ target, and no GGUF stores FFN experts in those types), so MUL_MAT_ID never receives TBQ/PQ src0. Attention in MoE models is a plain MUL_MAT / FLASH_ATTN_EXT and therefore falls under the "KV cache" rows above.

Remaining work

  • SIMD optimization — AVX2/NEON for CPU quantize/dequantize
  • Metal shaders — Apple GPU backend support
  • 2-bit variant — even higher compression
  • Direct cosine similarity evaluation

@jesusmb1995 jesusmb1995 self-assigned this Mar 27, 2026
@jesusmb1995 jesusmb1995 changed the title Draft: TurboQuant TurboQuant: KV cache quantization with Hadamard transform (TQ3_0 / TQ4_0) Mar 27, 2026
@jesusmb1995

This comment was marked as outdated.

@jesusmb1995

This comment was marked as outdated.

@jesusmb1995 jesusmb1995 force-pushed the turboquant branch 2 times, most recently from 69522fb to 6497a86 Compare March 31, 2026 16:37
@jesusmb1995 jesusmb1995 changed the title TurboQuant: KV cache quantization with Hadamard transform (TQ3_0 / TQ4_0) TurboQuant: KV cache quantization with Hadamard transform (TBQ3_0 / TBQ4_0) Mar 31, 2026
@zoq
Copy link
Copy Markdown

zoq commented Apr 1, 2026

Are you planning to merge this before the rebase to the latest version of llama.cpp?

@jesusmb1995 jesusmb1995 force-pushed the turboquant branch 2 times, most recently from f7ba069 to 9d2a659 Compare April 7, 2026 18:20
@jesusmb1995
Copy link
Copy Markdown
Author

jesusmb1995 commented Apr 7, 2026

Are you planning to merge this before the rebase to the latest version of llama.cpp?

Not particularly, if the rebase to latest version of llama.cpp will happen soon then I will change the target to the correct temp branch. I think its better if I target latest llama.cpp version.

Edit: @zoq Since it seems we want this merged in about 1-2 weeks, I would target this version for now. yes, planning to merge this before the rebase.

@gianni-cor

This comment was marked as resolved.

gianni-cor pushed a commit to gianni-cor/qvac-fabric-llm.cpp that referenced this pull request Apr 18, 2026
The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside
`gg_run_ctest_release`, but the top-level driver runs `gg_run
ctest_debug` first and `gg_run ctest_release` second. As a result the
debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf`
and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native`
against the build host's CPU and then executed on a different,
older-microarch runner in the pool, producing SIGILL during
ctest_debug. Ref PR tetherto#115 CI run 24463228694.

Move the append into the top-level flag-handling block, right after
`GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once,
before either ctest function is invoked, and both debug and release
builds pick it up. Remove the duplicate inside `gg_run_ctest_release`.

No workflow change required: `.github/workflows/build.yml` already
exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was
correct; the bug was purely a scoping error in `ci/run.sh`. The other
`GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the
top-level branches that skip heavier test functions) are left
untouched — they were already at the correct scope.
gianni-cor added a commit to gianni-cor/qvac-fabric-llm.cpp that referenced this pull request Apr 18, 2026
The per-thread TBQ/PQ quantize shader was single-threaded per block —
one lane normalized 128 values, ran the FHT serially, and packed the
QJL sketch bit-by-bit, with three float[128] private arrays spilling
to GPU private memory. On a 5090 this capped the tbq3_0 / tbq4_0 write
throughput at ~80 GB/s (~4 % of peak).

Switch to a cooperative shader that treats one workgroup (32 lanes ==
one subgroup on NVIDIA) as one block:

  - norm, norm-correction and residual-norm reductions use subgroupAdd
  - the Fast Hadamard Transform runs log2(BK) passes with the BK/2
    butterflies in each pass spread across the 32 threads, separated
    by a single barrier() each
  - the QJL sign sketch is packed with subgroupBallot (32 bits per
    call, written as four bytes directly) instead of 128 serial OR
    into memory
  - scratch moves from private arrays to shared memory (tq3_sh_x,
    tq3_sh_idx, tq3_sh_proj, and tq4_* analogues)

On the host side, the 32 TBQ/PQ cpy_f32_quant pipelines drop their
wg_denoms from {32,1,1} to {1,1,1} so that "one workgroup == one
block", and the shader's CPY main() picks up a matching TQ_COOP branch
that drops the *32 + gl_LocalInvocationID.x offset from the block
index decode.

The GGML_OP_SET_ROWS dispatch path also needs to know about the new
"one workgroup per block" contract: for TBQ/PQ dst types, divide ne
by ggml_blck_size(dst) instead of 32 * ggml_blck_size(dst). Without
this gate the set_rows kernel dispatched only 1/32 of the required
workgroups, silently leaving 31 out of 32 KV-cache blocks uninitialized
and driving perplexity on Mistral-7B-Instruct-v0.3 from ~5.9 to ~1090
with no visible failure from llama-bench or test-backend-ops (the
CPY tests only exercise GGML_OP_CPY, which already had the /blck_size
rule). Unrelated types keep the /32/blck_size rule so q4_0, q8_0 etc.
behave exactly as before.

Measured on 2x RTX 5090, Vulkan 1.4.321, PR tetherto#115 tip b23276f.

test-quantize-perf -b vulkan, 4 MiB input, 500 iters:

  type      baseline avg   optimized avg   avg speedup
  tbq3_0    187.5 us       42.7 us        4.39x
  tbq4_0    192.9 us       44.3 us        4.35x
  pq3_0      68.6 us       44.8 us        1.53x
  pq4_0      84.9 us       44.9 us        1.89x

llama-bench on Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1, -r 3:

  K / V              pp2048 base -> opt         tg128 base -> opt
  tbq3_0 / pq3_0     9764 -> 9880   +1.2 %      179.1 -> 206.9  +15.5 %
  pq3_0  / pq3_0    15396 -> 15653  +1.7 %      190.5 -> 214.0  +12.3 %
  tbq4_0 / pq4_0     9568 -> 9782   +2.2 %      164.9 -> 205.1  +24.4 %

llama-perplexity on wikitext-2 test, 40 chunks, seed 42,
Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1:

  K / V              baseline PPL             optimized PPL
  f16    / f16       5.8254 +/- 0.13612       5.8254 +/- 0.13612  (control)
  tbq3_0 / pq3_0     5.9333 +/- 0.13811       5.9129 +/- 0.13747
  pq3_0  / pq3_0     5.9806 +/- 0.13879       5.9894 +/- 0.13918
  tbq4_0 / pq4_0     5.8646 +/- 0.13707       5.8570 +/- 0.13679

All PPL deltas are an order of magnitude smaller than the 95 % CI and
come from FP associativity in the subgroup tree reduction vs the
previous sequential sum. No algorithmic change.

tests/test-turboquant.sh: 112/112 backend-op tests still pass on both
GPUs. test-quantize-fns reproduces the 4 pre-existing _64 roundtrip
failures with bit-identical error magnitudes (0.010805, 0.009384) —
that is the wrong-codebook bug from PR tetherto#115 review, not introduced or
fixed here.

Made-with: Cursor
@gianni-cor

This comment was marked as resolved.

build_attn_inp_kv_impl() and build_attn_inp_kv_iswa() allocated
inp->self_rotk as an nrot x nrot Hadamard where
  nrot = largest power-of-two that divides n_embd_head_k (>= 64)
but allocated inp->self_rotv as a fixed 64x64 tensor, independent of
n_embd_head_v. Because ggml_rotate_hadamard() reshapes its input using
rot->ne[0] as the inner dim, an n_embd_head_v = 128 vector was rotated
as two independent 64-d halves, i.e. block_diag(H64, H64), instead of a
full H128. The d=128 PQ/TBQ codebooks in ggml-quants.c are Lloyd-Max
fitted to the coordinate distribution of a full d=128 random orthogonal
rotation (sigma ~ 1/sqrt(128)), so the narrower-than-expected 64-d
rotation left the codebook ~2x too narrow per coordinate and silently
inflated V reconstruction error. This hits essentially every recent
dense model with head_dim_v = 128 (Llama 3, Mistral, Qwen2.5, ...)
whenever a TBQ/PQ V cache is selected, since resolve_tq_type() keeps
the non-_64 variants for head_dim = 128. Reported by @gianni-cor in
tetherto#115 with a repro showing ~1.40x worse
3-bit and ~1.50x worse 4-bit V reconstruction at alpha = 0.10.

Factor the sizing + allocation out into a file-local helper
build_hadamard_rot(ctx, can_rot, n_embd_head) that applies the same
"largest power-of-two that divides n_embd_head, starting at 64" rule
used for self_rotk, and returns nullptr when can_rot is false. Call it
for both K and V in build_attn_inp_kv_impl() and
build_attn_inp_kv_iswa(), which makes self_rotv correctly 128x128 for
head_dim_v = 128 and keeps self_rotk behavior unchanged. No other
call site allocates self_rot{k,v}, so the rotation is now symmetric
across K and V and across the SWA and non-SWA builders.

Net change is -16 lines: two 17-line if/else blocks per builder
collapse into two one-liners.
    The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside
    `gg_run_ctest_release`, but the top-level driver runs `gg_run
    ctest_debug` first and `gg_run ctest_release` second. As a result the
    debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf`
    and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native`
    against the build host's CPU and then executed on a different,
    older-microarch runner in the pool, producing SIGILL during
    ctest_debug. Ref PR ggml-org#115 CI run 24463228694.

    Move the append into the top-level flag-handling block, right after
    `GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once,
    before either ctest function is invoked, and both debug and release
    builds pick it up. Remove the duplicate inside `gg_run_ctest_release`.

    No workflow change required: `.github/workflows/build.yml` already
    exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was
    correct; the bug was purely a scoping error in `ci/run.sh`. The other
    `GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the
    top-level branches that skip heavier test functions) are left
    untouched — they were already at the correct scope.

diff --git a/ci/run.sh b/ci/run.sh
index 7fa469b..5c2f7e7 100755
--- a/ci/run.sh
+++ b/ci/run.sh
@@ -118,6 +118,15 @@ if [ ! -z ${GG_BUILD_NO_SVE} ]; then
     CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm"
 fi

+# Disable native CPU optimizations for low-perf builds to ensure binary
+# compatibility with the (often heterogeneous) CI runner pool. Must be applied
+# at the top level so BOTH gg_run_ctest_debug and gg_run_ctest_release pick it
+# up — otherwise the debug build (which runs first) compiles with -march=native
+# and can SIGILL on a runner whose microarch is older than the build host.
+if [ ! -z ${GG_BUILD_LOW_PERF} ]; then
+    CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF"
+fi
+
 if [ -n "${GG_BUILD_KLEIDIAI}" ]; then
     echo ">>===== Enabling KleidiAI support"

@@ -236,11 +245,6 @@ function gg_run_ctest_release {
     # Check cmake, make and ctest are installed
     gg_check_build_requirements

-    # Disable native CPU optimizations for low-perf builds to ensure compatibility
-    if [ ! -z ${GG_BUILD_LOW_PERF} ]; then
-        CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF"
-    fi
-
     (time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
     (time make -j$(nproc)                                    ) 2>&1 | tee -a $OUT/${ci}-make.log

diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 247775b..bcc1759 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -49,6 +49,25 @@ static ggml_tensor * ggml_rotate_hadamard(
     return res;
 }

+// Allocate the Hadamard rotation input used by ggml_rotate_hadamard() for a
+// TurboQuant/PolarQuant K or V stream. Size is the largest power-of-two that
+// divides n_embd_head (>= 64), so the rotation matches the head dim and the
+// PQ/TBQ codebooks see the full d-wide rotated distribution they were fitted
+// to. Returns nullptr when can_rot is false.
+static ggml_tensor * build_hadamard_rot(ggml_context * ctx, bool can_rot, int n_embd_head) {
+    if (!can_rot) {
+        return nullptr;
+    }
+
+    int nrot = 64;
+    do { nrot *= 2; } while (n_embd_head % nrot == 0);
+    nrot /= 2;
+
+    ggml_tensor * rot = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nrot, nrot);
+    ggml_set_input(rot);
+    return rot;
+}
+
 void llm_graph_input_embd::set_input(const llama_ubatch * ubatch) {
     if (ubatch->token) {
         const int64_t n_tokens = ubatch->n_tokens;
@@ -1626,30 +1645,13 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
             hparams.n_embd_head_k % 64 == 0 &&
             ggml_is_quantized(mctx_cur->type_k());

-        if (can_rotk) {
-            int nrot = 64;
-            do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0);
-            nrot /= 2;
-
-            inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
-            ggml_set_input(inp->self_rotk);
-        } else {
-            inp->self_rotk = nullptr;
-        }
-
         const bool can_rotv =
             !hparams.is_n_embd_v_gqa_variable() &&
             hparams.n_embd_head_v % 64 == 0 &&
             ggml_is_quantized(mctx_cur->type_v());

-        if (can_rotv) {
-            int nrot = 64;
-
-            inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
-            ggml_set_input(inp->self_rotv);
-        } else {
-            inp->self_rotv = nullptr;
-        }
+        inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k);
+        inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);
     }

     return inp;
@@ -1947,30 +1949,13 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
             hparams.n_embd_head_k % 64 == 0 &&
             ggml_is_quantized(mctx_cur->get_base()->type_k());

-        if (can_rotk) {
-            int nrot = 64;
-            do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0);
-            nrot /= 2;
-
-            inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
-            ggml_set_input(inp->self_rotk);
-        } else {
-            inp->self_rotk = nullptr;
-        }
-
         const bool can_rotv =
             !hparams.is_n_embd_v_gqa_variable() &&
             hparams.n_embd_head_v % 64 == 0 &&
             ggml_is_quantized(mctx_cur->get_base()->type_v());

-        if (can_rotv) {
-            int nrot = 64;
-
-            inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
-            ggml_set_input(inp->self_rotv);
-        } else {
-            inp->self_rotv = nullptr;
-        }
+        inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k);
+        inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);
     }

     return (llm_graph_input_attn_kv_iswa *) res->add_input(std::move(inp));
@jesusmb1995

This comment was marked as resolved.

Register explicit `MUL_MAT` test cases for the TurboQuant / PolarQuant
types (`tbq3_0`, `tbq4_0`, `pq3_0`, `pq4_0`) with `type_b ∈ {f32, f16}`
and sizes that span both dispatch paths:

  - n=1, n=8   -> mul_mat_vec path (decode-like)
  - n=16, n=32 -> dequant + f16 matmul path (prefill-like)

Motivation: the existing TBQ/PQ coverage in `test-backend-ops` only
registers `FLASH_ATTN_EXT` cases, so the `tests/test-turboquant.sh`
filter (`test-backend-ops test -p "tbq|pq"`) never exercises the
standalone `MUL_MAT` path. That path is the one reported as
`supports_op == yes` on NV `VK_NV_cooperative_matrix2` devices but has
no matching pipeline created in `pipeline_dequant_mul_mat_mat_f16[]`
(see `ggml-vulkan.cpp:3412` — "TBQ/PQ cm2 matmul shaders not yet
generated"). With these cases in place:

  - On NV coopmat2 (RTX 5090): the n>=16 cases segfault, matching the
    external reproduction from the PR ggml-org#115 review and making the bug
    visible to CI.
  - On KHR coopmat1 (AMD gfx1150): the n>=16 cases return numerical
    garbage (err ~ 1.0), exposing the same missing-fallback issue in
    a non-crashing form.
  - The n=1/n=8 cases continue to pass via the existing
    `mul_mat_vec_tbq*_0` / `mul_mat_vec_pq*_0` shaders, so the new
    coverage cleanly isolates which dispatch path is broken.

No source changes to the Vulkan backend; this commit only adds the
test cases needed so the pre-existing bug is caught by
`tests/test-turboquant.sh`.
Comment thread tests/test-kv-cache-quantization-perp.sh
Comment thread ci/run.sh
…d=64 variants

Before this patch, `supports_op` reported TBQ3_0/TBQ4_0/PQ3_0/PQ4_0 (and
their `_64` / head_dim=64 variants) as supported for `GGML_OP_MUL_MAT` on
the Vulkan backend, but there was no working pipeline behind it. This is
the state the external review flags as Issue 3 on PR ggml-org#115: on cm2 (RTX
5090) the support probe claims support, the correctness run then
segfaults. The previous `test-backend-ops` commit adds the exact repro
for this.

Root causes:

  - On cm2 (NV coopmat2) devices the slot in
    `pipeline_dequant_mul_mat_mat_f16[]` was empty - the shader was
    never generated - so dispatches crashed when flash attention was
    not used.

  - On cm1 and scalar devices the slot in
    `pipeline_dequant_mul_mat_mat[]` was wired up, but `mul_mm_funcs.glsl`
    had no `load_a_to_shmem` implementation for TBQ/PQ. The generic
    `mul_mm.comp` ran with uninitialized shared memory and produced
    near-random output.

  - Even once data loading was fixed, TBQ3_0/TBQ4_0 still produced a
    small bias for `n > mul_mat_vec_max_cols` because the QJL Stage 2
    correction that `mul_mat_vec_tbq*_0.comp` applies in the vec path
    has no equivalent in the generic matmul shader.

  - For head_dim=64 models that use the `_64` block variants
    (TBQ*_0_64 / PQ*_0_64) the `_64` mul_mm pipelines, the `_64` QJL
    correction shaders, and the `_64`-sized Lloyd-Max codebook / sign
    arrays were missing, and the vec path is intentionally skipped for
    `_64` (so every `n` needs the full matmul + QJL correction pair).

Scope of this patch is strictly the standalone `MUL_MAT` path with TBQ/PQ
`src0` and f32 `src1` that the Issue 3 repro hits, i.e. the `-fa off` K
matmul. Fused flash attention (scalar, cm1, cm2) already handles QJL
correctly and is unchanged. MoE FFN weights are not affected either:
TBQ/PQ are KV-cache quantizations (there is no `llama-quantize` target
that produces TBQ/PQ model weights), so `MUL_MAT_ID` never sees them as
`src0` - attention in MoE models is a plain `MUL_MAT` /
`FLASH_ATTN_EXT` and reuses the same fix.

The upstream "V cache quantization requires flash_attn" context-level
guard in `src/llama-context.cpp` is intentionally left unchanged: the V
matmul under `-fa off` uses a transposed quantized-V layout populated
by `ggml_set_rows` with row_size=1, which corrupts any `blck_size > 1`
type at write time (reproducible on CPU as well), and that is a
separate KV-cache issue out of scope here.

Changes:

  - Add `load_a_to_shmem` implementations for TBQ3_0, TBQ4_0, PQ3_0,
    PQ4_0 (and their `_64` variants) in `mul_mm_funcs.glsl`, reusing
    `tbq3_dequant_raw` / `tbq4_dequant_raw` from `tq_utils.comp`.
    This makes `mul_mm.comp` correct for the centroid part of
    dequantization (`tbq*_dequant_raw(qs) * d`) on all eight types.

  - `tq_utils.comp`: pick Stage-1 / QJL-Stage-2 sign bitmasks and the
    Lloyd-Max codebook (TBQ3_CB / TBQ4_CB) based on whether any
    `DATA_{A,K,V}_*_0_64` is defined. d=64 blocks use seeds 43/139 and
    a wider codebook (sigma = 1/sqrt(d) is larger at d=64 than at
    d=128); previously the shader hardcoded the d=128 constants, so
    the d=64 variants silently dequantized against the wrong codebook.

  - New shader `mul_mm_tbq_qjl_correction.comp`. It runs after the
    main matmul as an additive pass: one workgroup per
    `(row, col, batch)`, `QUANT_K` threads performing the same
    Walsh-Hadamard butterfly + `qjl[]` dot product as the vec shader,
    and accumulates `d_r * sqrt(pi/2) / QUANT_K * sum_qjl(H(B))` into
    D. Parameterized over `QUANT_K` so the same source emits both
    `_128` and `_64` SPIR-V. Only TBQ3_0 and TBQ4_0 (and `_64`) have
    `d_r`/`qjl`, so only those four get a correction pipeline.

  - `vulkan-shaders-gen.cpp`:
      * Register the eight correction variants
        (`mul_mm_qjl_{tbq3_0,tbq4_0}{,_64}_{f32,f16}`).
      * Emit `matmul_{tbq,pq}{3,4}_0_64_{f32,f16}[_aligned]` for
        `mul_mm.comp`, in a dedicated block outside the main
        `type_names` loop so we don't cascade through FA / MUL_MAT_ID /
        get_rows / ... which either already have dedicated `_64`
        handling (FA) or don't apply to TBQ/PQ at all.

  - `ggml-vulkan.cpp`:
      * Add `pipeline_mul_mm_tbq_qjl[GGML_TYPE_COUNT][2]` on the device
        and create pipelines at init time for all four TBQ types (128
        and 64 block sizes).
      * In `ggml_vk_get_mul_mat_mat_pipeline`, let cm2 fall through to
        the cm1/scalar pipeline when no cm2 `_mat_f16` shader exists
        for a given TBQ/PQ type, so cm2 devices stop segfaulting on
        these types.
      * Register TBQ/PQ `_64` in `supports_op` `MUL_MAT` switch so
        d=64 models are actually routed to the new pipelines instead
        of falling back to CPU.
      * Force `split_k = 1` for TBQ `src0` - the QJL correction pass
        would otherwise be added once per split.
      * Dispatch the QJL correction pass after the main matmul for
        TBQ3_0 / TBQ4_0 (and `_64`). For `_128` it's gated on
        `n > mul_mat_vec_max_cols` (the vec path already corrects for
        smaller n); for `_64` it runs unconditionally because there is
        no vec path on this block size.

Verified on AMD RDNA3.5 (RADV gfx1150, no cm1/cm2) with
`test-backend-ops -o MUL_MAT -b Vulkan0`: all 32 TBQ/PQ x {128, 64} x
n in {1,8,16,32} cases pass against f32 B. f16 B for standalone
MUL_MAT on TBQ/PQ still reports `not supported` on this device (the
scalar/cm1 pipeline consumes f32 src1), which is consistent with the
matmul path shipping f32 src1 on the `-fa off` decode/prefill paths
used by these tests. cm2 verification is expected to run on the
reviewer's RTX 5090 via the Issue 3 repro branch.

debug: sentinel in QJL correction

vulkan: add dequantize-to-f16 cpy shaders so MUL_MAT with non-contiguous quantized src0 can run on GPU

vulkan: add d=64 decision boundaries for TBQ/PQ copy_to_quant

Commit 6e26e8b ("vulkan: fix TBQ/PQ standalone MUL_MAT path, QJL
correction pass, and d=64 variants") added TQ_D64-gated d=64 Lloyd-Max
codebook centroids and random sign diagonals to tq_utils.comp, so that
the Vulkan encoder, FA decoder, and non-FA QJL correction pass all match
the CPU reference's d=64 constants (ggml-quants.c: TQ{3,4}_CODEBOOK_64,
TQ/QJL_SIGN_SEED_64). It missed the corresponding d=64 *decision
boundaries* in copy_to_quant.comp, however.

The boundaries are the midpoints between adjacent codebook centroids and
determine which centroid an input coordinate is quantized to. CPU derives
them at runtime from the selected codebook via tq_compute_boundaries(),
so it automatically used the d=64 midpoints for _64 blocks. The Vulkan
encoder hard-codes them as const float TBQ3_B[7] / TBQ4_B[15], and those
constants remained at the d=128 midpoints even inside the #if block that
accepts both DATA_A_TBQ*_0 and DATA_A_TBQ*_0_64.

Net effect on a head_dim=64 model (Qwen2.5-0.5B):

  - copy_to_quant bucketed each coordinate by the narrower d=128
    boundaries (centroids spaced ~sigma=1/sqrt(128)).
  - The resulting index was then dequantized with the wider d=64
    centroids (spaced ~sigma=1/sqrt(64)), a completely different
    alphabet.
  - Every value near a boundary landed on the wrong centroid.

Before commit 6e26e8b this was silently consistent: the codebook was
also d=128, so encoder and decoder were at least in agreement (just
producing a rescaled quantization). When the codebook was fixed to d=64,
the boundaries had to move with it.

Add a #if defined(TQ_D64) branch for TBQ3_B and TBQ4_B that uses the
d=64 midpoints computed from TQ{3,4}_CODEBOOK_64. Values regenerated
with scripts/compute_tq_codebooks.py, which now also emits the GLSL
boundaries array alongside the C codebook array so future codebook
updates keep the two in sync from one source of truth.

Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test offset_64
(--chunks 1), Vulkan AMD RADV gfx1150:

  tbq3_0 / f16  fa=off:   1659 -> 230   (7.2x better)
  tbq4_0 / f16  fa=off:   ~    -> 300
  pq3_0  / f16  fa=off:   ~    -> 230
  pq4_0  / f16  fa=off:   ~    -> 300
  pq3_0  / f16  fa=on:    ~    -> 225  (now matches fa=off)
  pq4_0  / f16  fa=on:    ~    -> 299  (now matches fa=off)

tbq3_0/tbq4_0 fa=on still diverges from fa=off due to a separate issue
in the FA QJL Stage-2 correction on d=64 blocks, addressed in a follow-
up patch.

vulkan: read raw Q from the input SSBO for the FA QJL projection

The flash-attention shaders compute the QJL Stage-2 correction as

    correction = d_r * sqrt(pi/2) / QUANT_K * (2*pos_sum - proj_q_sum)

where (pos_sum, proj_q_sum) are reductions over FHT(D_qjl * Q).  The
scalar (flash_attn.comp) and coopmat1 (flash_attn_cm1.comp) paths used
to derive the FHT input by reading from Qf -- the shared-memory buffer
that already has the attention scale (1/sqrt(head_dim)) multiplied in
for the main Q*K dot -- and then dividing that value by p.scale to
recover the raw Q before multiplying by the QJL sign diagonal.

On cm1 Qf is f16, so the scale round-trip is lossy for the large-
magnitude activations seen in e.g. Qwen2.5-0.5B's first-layer massive
activations.  On the scalar path Qf is f32 and p.scale is usually a
power of two (1/sqrt(head_dim)), so x * p.scale / p.scale is bit-exact
in principle -- but empirically the pre-scaled-then-un-scaled read
still produced materially different FHT input than a raw-Q read.

The standalone non-FA QJL shader (mul_mm_tbq_qjl_correction.comp) has
always read Q directly from src1 and gets correct results.  Match that
pattern in the FA path: read Q straight from the data_qv4 SSBO into
Qf_qjl_proj, bypassing Qf entirely.  cm2 already reads raw Q from
data_q directly, so it does not need the change.

Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wikitext-2 test
(Vulkan AMD RADV gfx1150):

    wiki.test --chunks 4 -n 128, K=tbq3_0/f16:
        fa=off         :   531
        fa=on, before  :  ~2000  (broken)
        fa=on, after   :   154

    wiki.test --chunks 4 -n 128, K=tbq4_0/f16:
        fa=off         :   207
        fa=on, after   :    77

    K=f16/f16 and K=pq*/V=f16 fa=on/off stay within 1% of each other,
    confirming the change is confined to the TBQ QJL path.

vulkan: run non-FA TBQ QJL correction on permuted src0 too

The standalone MUL_MAT QJL (Stage 2) correction pass was gated on
`!x_non_contig`, which silently skipped the correction on the no-FA
attention path because `kq = mul_mat(k, q)` feeds in K after
`ggml_permute(k, 0, 2, 1, 3)` -- a non-dim01-contiguous view of the
KV cache.  With that gate the TBQ attention on the no-FA path was
reduced to PQ Stage 1 (centroid-only), producing bit-identical
output to `pq3_0` / `pq4_0` and regressing quality vs a CPU-reference
TBQ run.  It was masked on the MUL_MAT test-backend-ops coverage
because those tests use contiguous src0.

Two changes:

  * mul_mm_tbq_qjl_correction.comp: index the A matrix by a real
    in-memory block stride instead of the `num_blocks_per_row = K /
    QUANT_K` shortcut.  `p.stride_a` and `p.batch_stride_a` are now
    interpreted as strides in BLOCK units (src0->nb[1] /
    sizeof(block) and src0->nb[2] / sizeof(block) respectively),
    matching the way `ggml_vk_flash_attn` already feeds `k_stride` /
    `k_offset` to the FA shader.  For a contiguous src0 the new
    stride equals the old num_blocks_per_row, so existing tests are
    unaffected.

  * ggml_vk_mul_mat_q_f16: compute `qjl_stride_a` /
    `qjl_stride_batch_a` from src0->nb and drop the
    `!x_non_contig` exclusion from both the descriptor-set request
    and the dispatch site.  The `qx_buf_offset` still points at the
    original (pre-permute) TBQ blocks in d_Qx, which is exactly what
    the correction pass wants -- it just needed the real strides to
    reach the right block for each (row_a, batch_id).

Both paths have to be gated identically -- if only one of them is
changed, the dispatched pipeline ends up without a descriptor set
and `vkCmdPushConstants` crashes with VK_NULL_HANDLE layout.

Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test --chunks 4
-n 128 (Vulkan AMD RADV gfx1150):

    K=tbq3_0 V=f16 fa=off (vs fa=off pq3_0):
        before fix:    561 == 561 (QJL silently skipped, TBQ==PQ)
        after fix :    634 != 561 (QJL running, TBQ distinct from PQ)
        CPU ref   :    546

    K=tbq4_0 V=f16 fa=off (vs fa=off pq4_0):
        before fix:    173 == 173 (QJL silently skipped)
        after fix :    133 != 173 (QJL running, 23% lower PPL)
        CPU ref   :    154

f16/f16 and pq* paths are unchanged: the gate only opens for TBQ.
FA is unaffected since it has its own inlined QJL epilogue.
…gle-GPU

The per-n_ctx wikitext slices generated by tests/test-kv-cache-quantization-perp.sh
were picked via an unseeded $RANDOM, so every run drew a fresh offset and PPL
numbers were not directly comparable across reruns. Additionally, llama-perplexity
was being invoked without --split-mode, so on multi-GPU hosts it defaulted to
splitting decoder layers across all visible devices -- introducing cross-device
numerical differences that could drift the baseline by more than the QJL/FA
signal this sweep is meant to detect.

Three changes to make PPL numbers reproducible across reruns and machines:

  * generate_offset_files(): seed $RANDOM with a fixed SLICE_SEED (default 42)
    before drawing offsets, so a fresh regeneration is byte-for-byte
    reproducible. Export SLICE_SEED=<n> to draw a different but still
    deterministic set of offsets.

  * generate_offset_files(): skip regeneration when wiki.test.offset_<n_ctx>.raw
    already exists and is non-empty; recover the original offset for the log
    line from the suffix size so the output is still informative ("reusing
    offset=... (<N> bytes)"). Pass --regen-slices (or delete the slice files)
    to force regeneration.

  * run_perplexity_once(): pass --split-mode none to llama-perplexity so PPL
    is computed on a single device regardless of how many GPUs are visible.
    Matches the default already used by tests/test-kv-cache-quantization-perf.sh
    (SPLIT_MODE="${SPLIT_MODE:-none}"), so perp and perf now agree on
    single-GPU execution.

Header comment and --help output updated to document the slice knobs. No
change to the perplexity args beyond --split-mode, so existing CSV schemas
and result filenames are unaffected.
jesusmb1995 and others added 3 commits April 21, 2026 18:44
… regression

The Vulkan CI (ubuntu-24-cmake-vulkan) fails on this pre-existing upstream
backend-op case:

  MUL_MAT(type_a=q8_0, type_b=f32, m=16, n=1, k=256, bs=[2,3],
          nr=[1,1], per=[0,2,1,3], ...)

with

  ggml-vulkan.cpp: GGML_ASSERT(ggml_vk_dim01_contiguous(src0)
                            || src0->type == F32/F16/BF16) failed

in ggml_vk_mul_mat_vec_q_f16. The underlying bug is that our change to
ggml_backend_vk_device_supports_op() -- which relaxed the non-dim01-
contiguous constraint on quantized src0 so the TBQ/PQ -fa off K*Q path can
stay on the GPU -- also advertises support for every other quant type
(q4_0, q5_0, q5_1, q8_0, iq4_nl, ...) with non-contig src0, but the small-n
vec dispatcher in ggml_vk_mul_mat still routes those cases to
ggml_vk_mul_mat_vec_q_f16, which does not implement the quant->f16 cpy
fallback and asserts on non-contig quantized src0.

This regression was not caught by tests/test-turboquant.sh because that
script filters test-backend-ops with `-p 'tbq|pq'` and our added TBQ/PQ
MUL_MAT coverage only uses per=[0,1,2,3] (identity). The upstream q8_0
permutation matrix exercises the exact shape that trips the assert.

Add a second test-backend-ops invocation to test-turboquant.sh that targets
the smallest reproducer:

  -p 'type_a=q8_0.*per=\[0,2,1,3\]'

Picking q8_0 (instead of a TBQ/PQ variant) means this check runs on any
Vulkan box without requiring a TBQ model or KV cache, and it directly
reproduces the CI failure.

Verified on AMD gfx1150 (KHR_coopmat) with the top-of-branch Vulkan that
still has the bug: test-turboquant.sh now exits non-zero locally with
"1 check(s) failed.", matching the CI failure. The follow-up Vulkan
patch adds the dispatcher fix that makes this check pass.
…ix path

Commit 6fd388c (vulkan-fix-tbq-pq-standalone) widened
ggml_backend_vk_device_supports_op(MUL_MAT) so that any quantized src0
type with a pipeline_cpy_quant_f16 entry is accepted even when it is not
dim01-contiguous. That is what lets the -fa off attention path keep
kq = mul_mat(K, Q) on the GPU when K is a permuted TBQ/PQ view of the
KV cache.

The matrix path (ggml_vk_mul_mat_q_f16) honours this: it runs the
quant->f16 cpy pipeline to dequantize the non-contig src0 before the
main matmul. But the vec path (ggml_vk_mul_mat_vec_q_f16), which the
dispatcher routes to when dst->ne[1] <= mul_mat_vec_max_cols (decode-like
n), does not: it asserts dim01-contiguous quantized src0 at the top of
the function. So any small-n MUL_MAT with a non-dim01-contiguous
quantized src0 -- e.g. the upstream backend-op coverage of
MUL_MAT(type_a=q8_0, m=16, n=1, k=256, bs=[2,3], per=[0,2,1,3]) --
slips through supports_op, gets routed to the vec path, and aborts on
the assertion. See:

  tetherto#115 (comment)

Fix by adding one clause to the dispatcher in ggml_vk_mul_mat: take the
vec path only when src0 is either non-quantized or dim01-contiguous.
Non-dim01-contiguous quantized src0 falls through to
ggml_vk_mul_mat_q_f16, which already handles it via pipeline_cpy_quant_f16.
This does not change hot paths: contiguous src0 still takes the vec path
as before, which is the overwhelmingly common case for mul_mat in
transformer graphs.

Also annotate the vec path assert so a future caller that tries to send
non-contig quantized src0 there gets a loud error rather than a silent
wrong answer, and so the invariant between the dispatcher gate and the
assert is documented in both places.

Verified on AMD gfx1150 (KHR_coopmat):

  Before: tests/test-turboquant.sh exits 1 with GGML_ASSERT at
          ggml-vulkan.cpp:8105 on the q8_0 per=[0,2,1,3] smoke case
          added in the previous patch.
  After:  tests/test-turboquant.sh passes; the q8_0 per=[0,2,1,3]
          MUL_MAT cases run on the GPU through the matrix path (f32
          variants succeed, f16 variants report "not supported [CPU]"
          as before since backend-ops does not currently wire an f16
          x f16 contiguity check for quantized src0).

This also fixes the Ubuntu Vulkan CI job for the PR.
The per-thread TBQ/PQ quantize shader was single-threaded per block —
one lane normalized 128 values, ran the FHT serially, and packed the
QJL sketch bit-by-bit, with three float[128] private arrays spilling
to GPU private memory. On a 5090 this capped the tbq3_0 / tbq4_0 write
throughput at ~80 GB/s (~4 % of peak).

Switch to a cooperative shader that treats one workgroup (32 lanes ==
one subgroup on NVIDIA) as one block:

  - norm, norm-correction and residual-norm reductions use subgroupAdd
  - the Fast Hadamard Transform runs log2(BK) passes with the BK/2
    butterflies in each pass spread across the 32 threads, separated
    by a single barrier() each
  - the QJL sign sketch is packed with subgroupBallot (32 bits per
    call, written as four bytes directly) instead of 128 serial OR
    into memory
  - scratch moves from private arrays to shared memory (tq3_sh_x,
    tq3_sh_idx, tq3_sh_proj, and tq4_* analogues)

On the host side, the 32 TBQ/PQ cpy_f32_quant pipelines drop their
wg_denoms from {32,1,1} to {1,1,1} so that "one workgroup == one
block", and the shader's CPY main() picks up a matching TQ_COOP branch
that drops the *32 + gl_LocalInvocationID.x offset from the block
index decode.

The GGML_OP_SET_ROWS dispatch path also needs to know about the new
"one workgroup per block" contract: for TBQ/PQ dst types, divide ne
by ggml_blck_size(dst) instead of 32 * ggml_blck_size(dst). Without
this gate the set_rows kernel dispatched only 1/32 of the required
workgroups, silently leaving 31 out of 32 KV-cache blocks uninitialized
and driving perplexity on Mistral-7B-Instruct-v0.3 from ~5.9 to ~1090
with no visible failure from llama-bench or test-backend-ops (the
CPY tests only exercise GGML_OP_CPY, which already had the /blck_size
rule). Unrelated types keep the /32/blck_size rule so q4_0, q8_0 etc.
behave exactly as before.

Measured on 2x RTX 5090, Vulkan 1.4.321, PR ggml-org#115 tip b23276f.

test-quantize-perf -b vulkan, 4 MiB input, 500 iters:

  type      baseline avg   optimized avg   avg speedup
  tbq3_0    187.5 us       42.7 us        4.39x
  tbq4_0    192.9 us       44.3 us        4.35x
  pq3_0      68.6 us       44.8 us        1.53x
  pq4_0      84.9 us       44.9 us        1.89x

llama-bench on Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1, -r 3:

  K / V              pp2048 base -> opt         tg128 base -> opt
  tbq3_0 / pq3_0     9764 -> 9880   +1.2 %      179.1 -> 206.9  +15.5 %
  pq3_0  / pq3_0    15396 -> 15653  +1.7 %      190.5 -> 214.0  +12.3 %
  tbq4_0 / pq4_0     9568 -> 9782   +2.2 %      164.9 -> 205.1  +24.4 %

llama-perplexity on wikitext-2 test, 40 chunks, seed 42,
Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1:

  K / V              baseline PPL             optimized PPL
  f16    / f16       5.8254 +/- 0.13612       5.8254 +/- 0.13612  (control)
  tbq3_0 / pq3_0     5.9333 +/- 0.13811       5.9129 +/- 0.13747
  pq3_0  / pq3_0     5.9806 +/- 0.13879       5.9894 +/- 0.13918
  tbq4_0 / pq4_0     5.8646 +/- 0.13707       5.8570 +/- 0.13679

All PPL deltas are an order of magnitude smaller than the 95 % CI and
come from FP associativity in the subgroup tree reduction vs the
previous sequential sum. No algorithmic change.

tests/test-turboquant.sh: 112/112 backend-op tests still pass on both
GPUs. test-quantize-fns reproduces the 4 pre-existing _64 roundtrip
failures with bit-identical error magnitudes (0.010805, 0.009384) —
that is the wrong-codebook bug from PR ggml-org#115 review, not introduced or
fixed here.

Made-with: Cursor
@jesusmb1995

This comment was marked as resolved.

@jesusmb1995 jesusmb1995 changed the title TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0) QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0) Apr 23, 2026
}

// Pack QJL sign bits with subgroupBallot: each ballot call contributes 32 bits
// covering positions [s*32, (s+1)*32). With WG == subgroup size, bit `lid` of
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cooperative path seems to rely on 32 threads == 1 full subgroup, but I do not see a matching required-subgroup-size request when the cpy_f32_tbq* / cpy_f32_pq* and set_rows_* pipelines are created. On devices with 8- or 16-lane subgroups, subgroupAdd() here only reduces within each subgroup and subgroupBallot() only packs part of the block, so both the norm/correction reductions and the QJL bit packing become partial. Is there a reason this is guaranteed to run only on subgroup-size-32 hardware?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was introduced by the optimization 45d3b80 the shader is only correct when gl_SubgroupSize == gl_WorkGroupSize.x == 32 which is not true for all hardware (e.g. Intel Arc)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on a generic [[unroll]] + spec constant shader that should compile to similar bytecode when optimizations are enabled (with minor aesthetic differences) for 32 group-size.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d994c9b Added generic shader and tests. Software implementation of Vulkan is used to test that different group-sized variants (other than 32 or 64) are accurate (vs CPU version).

Verified on 5090 box that neither PPL nor tokens/s are affected after the change. Since the testing script re-used same texts PPL is exactly the same. Tok/s within noise or very close.

Before:

[7/45, ETA 9m45s] Running: K=tbq3_0 V=pq3_0 (coopmat1, large) ...
  tg=183.52±2.10 t/s
[4/45, ETA 9m06s] Running: K=pq3_0 V=pq3_0 (coopmat1, large) ...
  tg=215.74±0.43 t/s
[5/45, ETA 9m34s] Running: K=pq4_0 V=pq4_0 (coopmat1, large) ...
  tg=208.72±0.80 t/s
[15/45, ETA 7m45s] Running: K=pq3_0 V=pq3_0 (coopmat2, large) ...
  tg=223.38±0.09 t/s

  K=tbq3_0 V=pq3_0 PPL = 5.8203 (sweep±0.5987, chunk±0.2202)  (1.93±0.12s)
  K=pq3_0 V=pq3_0 PPL = 5.8461 (sweep±0.5701, chunk±0.2201)  (1.84±0.13s)

After:

[7/45, ETA 1m57s] Running: K=tbq3_0 V=pq3_0 (coopmat1, mid) ...
  tg=182.05±3.79 t/s
[4/45, ETA 2m06s] Running: K=pq3_0 V=pq3_0 (coopmat1, mid) ...
  tg=215.16±1.36 t/s
[5/45, ETA 2m03s] Running: K=pq4_0 V=pq4_0 (coopmat1, mid) ...
  tg=209.44±1.81 t/s
[15/45, ETA 1m33s] Running: K=pq3_0 V=pq3_0 (coopmat2, mid) ...
  tg=224.63±0.11 t/s

  K=tbq3_0 V=pq3_0 PPL = 5.8203 (sweep±0.5987, chunk±0.2202)  (2.00±0.12s)
  K=pq3_0 V=pq3_0 PPL = 5.8461 (sweep±0.5701, chunk±0.2201)  (1.88±0.13s)
=== Subgroup coverage summary ===
┌────────────────┬───────────────────────────────────────┬───────────────────────────────┬────────────────────────────────────┐
│ Leg            │ Subgroup size                         │ NSG                           │ Result                             │
├────────────────┼───────────────────────────────────────┼───────────────────────────────┼────────────────────────────────────┤
│ native GPU     │ device default (>=32 on typical GPUs) │ 1 (fast path on typical GPUs) │ PASSED: ran=24 skipped=24 failed=0 │
│ lavapipe W=128 │ 4                                     │ 8 (stitch)                    │ PASSED: ran=16 skipped=32 failed=0 │
│ lavapipe W=256 │ 8                                     │ 4 (stitch)                    │ PASSED: ran=16 skipped=32 failed=0 │
│ lavapipe W=512 │ 16                                    │ 2 (stitch)                    │ PASSED: ran=16 skipped=32 failed=0 │
└────────────────┴───────────────────────────────────────┴───────────────────────────────┴────────────────────────────────────┘

==========================================
 All checks passed.
==========================================

On 5acb3d5 added additional "masking" software shaders to test behavior of varying group-size, since its just for experimentation (to test different GS on hardware that does not natively has it) it will be reverted. All variations work very similarly in terms of GB/s and some-times surprisingly smaller WG configuration can outperform larger configs (could be noise).

=== pq3_0 huge ===
  wg     | status |  nmse(g v c) |  nmse(g v s) |   ms/iter |      GB/s
  32(prod) | OK     |    2.189e-08 |    3.398e-02 |     0.088 |    761.99
  2      | OK     |    2.189e-08 |    3.398e-02 |     0.090 |    742.52
  4      | OK     |    2.189e-08 |    3.398e-02 |     0.093 |    721.45
  8      | OK     |    2.189e-08 |    3.398e-02 |     0.090 |    746.73
  16     | OK     |    2.189e-08 |    3.398e-02 |     0.092 |    732.50
  cpu    | REF    |            - |            - |   127.336 |      0.53

  sorted by ms/iter (informational; see header):
    wg=32(prod)        0.088 ms    761.99 GB/s  speedup vs CPU = 1447.00x
    wg=2               0.090 ms    742.52 GB/s  speedup vs CPU = 1414.84x
    wg=8               0.090 ms    746.73 GB/s  speedup vs CPU = 1414.84x
    wg=16              0.092 ms    732.50 GB/s  speedup vs CPU = 1384.09x
    wg=4               0.093 ms    721.45 GB/s  speedup vs CPU = 1369.20x
    cpu (ref)        127.336 ms      0.53 GB/s  (baseline)

=== pq3_0_64 huge ===
  wg     | status |  nmse(g v c) |  nmse(g v s) |   ms/iter |      GB/s
  wg     | status |  nmse(g v c) |  nmse(g v s) |   ms/iter |      GB/s
  32(prod) | OK     |    4.587e-08 |    3.343e-02 |     0.147 |    455.67
  2      | OK     |    4.587e-08 |    3.343e-02 |     0.147 |    456.39
  4      | OK     |    4.587e-08 |    3.343e-02 |     0.154 |    435.99
  8      | OK     |    4.587e-08 |    3.343e-02 |     0.156 |    430.11
  16     | OK     |    4.587e-08 |    3.343e-02 |     0.153 |    438.12
  cpu    | REF    |            - |            - |   129.420 |      0.52

  sorted by ms/iter (informational; see header):
    wg=32(prod)        0.147 ms    455.67 GB/s  speedup vs CPU = 880.41x
    wg=2               0.147 ms    456.39 GB/s  speedup vs CPU = 880.41x
    wg=16              0.153 ms    438.12 GB/s  speedup vs CPU = 845.88x
    wg=4               0.154 ms    435.99 GB/s  speedup vs CPU = 840.39x
    wg=8               0.156 ms    430.11 GB/s  speedup vs CPU = 829.62x
    cpu (ref)        129.420 ms      0.52 GB/s  (baseline)

if [ ${#KS[@]} -gt 0 ] || [ ${#VS[@]} -gt 0 ]; then
# --ks / --vs override: run the Cartesian product. Missing side defaults to the
# set supplied on the other side (so e.g. --vs f16 on its own sweeps all built-in K:f16 pairs).
if [ ${#KS[@]} -eq 0 ]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In --no-fa mode the script still documents the scalar-path sweep as "only test K quantizations with V=f16", but this override branch now auto-fills the missing side with all cache types. For example, --no-fa --ks tbq3_0 will expand to tbq3_0:{f16,q8_0,q4_0,pq3_0,...}, and the first quantized-V row aborts at runtime with V cache quantization requires flash_attn. Because run_perplexity_once() returns non-zero under set -e, that stops the whole sweep instead of running the intended K-only comparison. Should the auto-filled side be clamped back to f16 whenever FA_FLAG=off?

case GGML_TYPE_Q8_0:
case GGML_TYPE_TQ2_0:
case GGML_TYPE_TQ1_0:
case GGML_TYPE_TBQ3_0:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still looks too broad for GGML_OP_MUL_MAT_ID: the support predicate now whitelists the TBQ/PQ types here, but I do not see matching *_id pipelines being generated for them. ggml_vk_get_dequantize_mul_mat_vec_id() only populates the older quant types, and ggml_vk_get_mul_mat_mat_id_pipeline() still asserts if the selected pipeline_dequant_mul_mat_mat_id[src0_type] entry is empty. Since TBQ/PQ are KV-cache types this may be hard to hit in normal llama inference, but for custom graphs / backend-op surfaces this still looks like we advertise support without a backend implementation behind it.

@gianni-cor
Copy link
Copy Markdown

Issue 1 — Hadamard rotation engages on every quantized KV cache type, not just TBQ/PQ

This is the last behavioural concern from the original review (§2.3) that I haven't seen addressed. The can_rotk / can_rotv gates in build_attn_inp_kv_impl and build_attn_inp_kv_iswa test ggml_is_quantized(mctx_cur->type_k/v()):

  • const bool can_rotk =
    !hparams.is_n_embd_k_gqa_variable() &&
    hparams.n_embd_head_k % 64 == 0 &&
    ggml_is_quantized(mctx_cur->type_k());
    const bool can_rotv =
    !hparams.is_n_embd_v_gqa_variable() &&
    hparams.n_embd_head_v % 64 == 0 &&
    ggml_is_quantized(mctx_cur->type_v());
    inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k);
    inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);
  • const bool can_rotk =
    !hparams.is_n_embd_k_gqa_variable() &&
    hparams.n_embd_head_k % 64 == 0 &&
    ggml_is_quantized(mctx_cur->get_base()->type_k());
    const bool can_rotv =
    !hparams.is_n_embd_v_gqa_variable() &&
    hparams.n_embd_head_v % 64 == 0 &&
    ggml_is_quantized(mctx_cur->get_base()->type_v());
    inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k);
    inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);

ggml_is_quantized() returns true for q4_0, q4_1, q5_0, q5_1, q8_0, iq4_nl, all K-quants, all I-quants — so a user running on master with --cache-type-k q4_0 --cache-type-v q4_0 and head_dim%64 == 0 (essentially every modern model) now gets, after this PR lands:

  • three extra head_dim × head_dim dense mat-muls per attention call (Q · R, K · R, output · R);
  • quantization-error distribution that differs from pre-PR master, because the quantiser sees rotated Q/K/V instead of raw.

Attention scores remain mathematically equivalent at infinite precision (R is orthogonal, applied symmetrically to Q and K, and undone on the output for V), so end-to-end PPL should stay within the same CI, but:

  1. it's a silent behavioural and performance change on paths that have nothing to do with TurboQuant/PolarQuant,
  2. bit-identical regression comparisons against master on existing q4_0/q8_0 KV-cache runs are no longer possible,
  3. the rotation would be pure overhead on those paths — the codebooks it pays for (d-specific Lloyd–Max centroids) are TBQ/PQ-only.

Suggested fix (~10 lines) — narrow both predicates to the 8 TBQ/PQ types:

auto is_tbq_pq = [](ggml_type t) {
    switch (t) {
        case GGML_TYPE_TBQ3_0:    case GGML_TYPE_TBQ4_0:
        case GGML_TYPE_PQ3_0:     case GGML_TYPE_PQ4_0:
        case GGML_TYPE_TBQ3_0_64: case GGML_TYPE_TBQ4_0_64:
        case GGML_TYPE_PQ3_0_64:  case GGML_TYPE_PQ4_0_64:
            return true;
        default:
            return false;
    }
};
const bool can_rotk = !hparams.is_n_embd_k_gqa_variable() &&
                      hparams.n_embd_head_k % 64 == 0 &&
                      is_tbq_pq(mctx_cur->type_k());

Same change in the iSWA builder. Happy to push this as a PR against turboquant if useful.

// expects dim01-contiguous quantized src0 and would assert. supports_op
// advertises these cases as supported via has_quant_f16_cpy, so we must
// keep them on the GPU here rather than fall back to CPU.
} else if ((dst->ne[1] == 1 || (dst->ne[1] <= mul_mat_vec_max_cols && src1->ne[2] * src1->ne[3] == 1)) &&
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can still reproduce a latest-head Vulkan correctness bug here on the NVIDIA coopmat2 box. This branch now sends non-dim01-contiguous quantized src0 to the matrix path when n is small, but the standalone TBQ QJL correction is still gated below to _64 or ne11 > mul_mat_vec_max_cols (ggml-vulkan.cpp around the is_tbq_d128_dispatch / ne11 > mul_mat_vec_max_cols check). That leaves a hole for _128 tbq3_0 / tbq4_0 with small n: they avoid the vec kernel, but also skip the Stage-2 correction pass.

Concrete repro on qvac-dev-linux-x64 (RTX 5090, VK_NV_cooperative_matrix2) against the current PR head: MUL_MAT(type_a=tbq3_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3],k_v=0,o=1) is reported as supported on Vulkan0, but the correctness run fails with ERR = 0.058072511 > 0.000500000. The analogous control case with type_a=pq3_0 and the same shape passes.

That strongly suggests the non-contiguous small-n matrix path is still missing the TBQ Stage-2 QJL correction for _128 blocks.

@gianni-cor
Copy link
Copy Markdown

Issue 3 — No _64 coverage in test-backend-ops MUL_MAT / FLASH_ATTN_EXT

The _64 (head_dim=64) variants are only exercised by test-quantize-fns, which confirms the copy_to_quant path post-32ba81912 (RMSE dropped from ~0.010 to ~0.003 for all 4 _64 types on Vulkan — nice). But the standalone MUL_MAT and fused FLASH_ATTN_EXT paths that 32ba81912 + 7933c60ab wire up are not covered by CI:

  • The new MUL_MAT coverage block enumerates only d=128 types:

    // TurboQuant / PolarQuant MUL_MAT coverage.
    // Intentionally exercises the standalone MUL_MAT path (not fused into FLASH_ATTN_EXT),
    // which is the path that reports `supports_op == yes` on NV coopmat2 but has no
    // matching pipeline created in `pipeline_dequant_mul_mat_mat_f16[]` — see
    // ggml-vulkan.cpp:3412 ("TBQ/PQ cm2 matmul shaders not yet generated") and the
    // supports_op switch that still lists TBQ/PQ for MUL_MAT.
    {
    const ggml_type tbq_pq[] = {
    GGML_TYPE_TBQ3_0, GGML_TYPE_TBQ4_0, GGML_TYPE_PQ3_0, GGML_TYPE_PQ4_0,
    };
    for (ggml_type type_a : tbq_pq) {
    for (ggml_type type_b : { GGML_TYPE_F32, GGML_TYPE_F16 }) {
    // mul_mat_vec path (n small, e.g. decode-like)
    test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {1, 1}, {1, 1}));
    test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 8, 256, {1, 1}, {1, 1}));
    // mat-mat path (n > 8, e.g. prefill — routes through dequant + f16 matmul on Vulkan)
    test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {1, 1}, {1, 1}));
    test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 32, 256, {1, 1}, {1, 1}));
    }
    }
    }

  • The mixed-K/V flash-attention block uses hs=64 but passes type_K = TBQ3_0 (not TBQ3_0_64). Inside test_flash_attn_ext::build_graph, hsk_padded = GGML_PAD(hsk, ggml_blck_size(type_K)) rounds 64 up to 128 because ggml_blck_size(TBQ3_0) = 128 — so the _64 flash-attention pipelines are never actually dispatched:

    // Mixed K/V type flash attention (at least one side is TBQ/PQ)
    {
    const ggml_type tbq_pq[] = { GGML_TYPE_TBQ3_0, GGML_TYPE_TBQ4_0, GGML_TYPE_PQ3_0, GGML_TYPE_PQ4_0 };
    const ggml_type mixed[] = { GGML_TYPE_TBQ3_0, GGML_TYPE_TBQ4_0, GGML_TYPE_PQ3_0, GGML_TYPE_PQ4_0, GGML_TYPE_Q8_0, GGML_TYPE_F16 };
    auto is_tbq_pq = [&](ggml_type t) { return std::find(std::begin(tbq_pq), std::end(tbq_pq), t) != std::end(tbq_pq); };
    for (ggml_type tk : mixed) {
    for (ggml_type tv : mixed) {
    if (tk == tv) continue;
    if (!is_tbq_pq(tk) && !is_tbq_pq(tv)) continue;
    for (int hs : { 64, 128 }) {
    for (int kv : { 113, 512 }) {
    for (int nb : { 1, 32 }) {
    test_cases.emplace_back(new test_flash_attn_ext(
    hs, hs, 4, {1, 1}, kv, nb, true, false, 0.0f, 0.0f, GGML_PREC_F32, tk, tv));
    }
    }
    }
    }
    }
    }

Empirical confirmation on the latest PR tip (45d3b8098, 2× RTX 5090):

$ ./bin/test-backend-ops test -p "tbq3_0_64|tbq4_0_64|pq3_0_64|pq4_0_64" -o MUL_MAT
Testing 3 devices
  0/0 tests passed
  0/0 tests passed

So a regression in:

  • load_a_to_shmem for _64 in mul_mm_funcs.glsl,
  • the _64 matmul pipelines emitted in vulkan-shaders-gen.cpp,
  • the pipeline_mul_mm_tbq_qjl[GGML_TYPE_*_64][0/1] wiring,
  • the qjl_stride_a / qjl_stride_batch_a computation on permuted K views, or
  • the _64 FA TQ_D64 branch in tq_utils.comp,

would slip through the regular test suite. The fix the author ran locally ("test-backend-ops -o MUL_MAT -b Vulkan0 … all 32 TBQ/PQ × {128, 64} × n ∈ {1, 8, 16, 32} cases pass") is exactly what CI should do.

Suggested fix — extend both blocks to include _64 types and, where they share a loop with hs, match the type's block size:

// tests/test-backend-ops.cpp — MUL_MAT coverage
const ggml_type tbq_pq[] = {
    GGML_TYPE_TBQ3_0,    GGML_TYPE_TBQ4_0,    GGML_TYPE_PQ3_0,    GGML_TYPE_PQ4_0,
    GGML_TYPE_TBQ3_0_64, GGML_TYPE_TBQ4_0_64, GGML_TYPE_PQ3_0_64, GGML_TYPE_PQ4_0_64,
};
// and for k use e.g. k = ggml_blck_size(type_a) * 2 so the _64 blocks are actually exercised
// tests/test-backend-ops.cpp — FA coverage
// Separate arm for _64 types so hs stays at 64 (no GGML_PAD up to 128)
const ggml_type tbq_pq_64[] = { GGML_TYPE_TBQ3_0_64, GGML_TYPE_TBQ4_0_64,
                                GGML_TYPE_PQ3_0_64,  GGML_TYPE_PQ4_0_64 };
for (ggml_type tk : tbq_pq_64) {
    for (ggml_type tv : { GGML_TYPE_PQ3_0_64, GGML_TYPE_PQ4_0_64,
                          GGML_TYPE_Q8_0, GGML_TYPE_F16 }) {
        test_cases.emplace_back(new test_flash_attn_ext(
            /*hsk=*/64, /*hsv=*/64, 4, {1, 1}, 512, 32,
            true, false, 0.0f, 0.0f, GGML_PREC_F32, tk, tv));
    }
}

Small and contained, but closes the hole where the most fragile part of this PR (the d=64 Stage-1 + QJL-Stage-2 path) is un-covered by CI.

@gianni-cor
Copy link
Copy Markdown

Issue 4 — Thread-0 bottleneck in mul_mm_tbq_qjl_correction.comp

Not a correctness concern (the kernel ships working), just a perf note for the new non-FA QJL epilogue added in 32ba81912. The inner reduction over proj_b_sh[] and qjl[] is thread-0-only:

// Thread 0 reduces the block's QJL dot product.
if (has_qjl && tid == 0u) {
const float qjl_scale = d_r * sqrt(1.5707963) / float(QUANT_K);
float pos_sum = 0.0;
float total_sum = 0.0;
[[unroll]] for (uint w = 0u; w < QUANT_K / 32u; w++) {
const uint bb = w * 4u;
uint bits = uint(data_a[ib].qjl[bb])
| (uint(data_a[ib].qjl[bb + 1u]) << 8u)
| (uint(data_a[ib].qjl[bb + 2u]) << 16u)
| (uint(data_a[ib].qjl[bb + 3u]) << 24u);
[[unroll]] for (uint qq = 0u; qq < 8u; qq++) {
const uint base_idx = (w * 8u + qq) * 4u;
const vec4 pq = vec4(proj_b_sh[base_idx],
proj_b_sh[base_idx + 1u],
proj_b_sh[base_idx + 2u],
proj_b_sh[base_idx + 3u]);
const vec4 mask_v = vec4(float(bits & 1u),
float((bits >> 1u) & 1u),
float((bits >> 2u) & 1u),
float((bits >> 3u) & 1u));
pos_sum += dot(mask_v, pq);
total_sum += pq.x + pq.y + pq.z + pq.w;
bits >>= 4u;
}
}
accum += qjl_scale * (2.0 * pos_sum - total_sum);

if (has_qjl && tid == 0u) {
    const float qjl_scale = d_r * sqrt(1.5707963) / float(QUANT_K);
    float pos_sum = 0.0;
    float total_sum = 0.0;
    [[unroll]] for (uint w = 0u; w < QUANT_K / 32u; w++) {
        ...
        [[unroll]] for (uint qq = 0u; qq < 8u; qq++) {
            ...
            pos_sum   += dot(mask_v, pq);
            total_sum += pq.x + pq.y + pq.z + pq.w;
            ...
        }
    }
    accum += qjl_scale * (2.0 * pos_sum - total_sum);
}

QUANT_K threads cooperate on the preceding Walsh–Hadamard butterfly; then the other 127 (or 63 for _64) go idle while thread 0 walks proj_b_sh[] and the qjl[] byte array serially. This is the hot inner loop of the kernel — one pass per block, num_blocks_per_row blocks per (row_a, col_b, batch) triple.

This is the -fa off prefill-K path (cold path for decode), so shipping the serial version is reasonable, but it's likely the biggest single perf gain available on non-FA TBQ inference once you get to it. Straightforward fix is a subgroupAdd-based tree reduction over the 32 lanes of the first subgroup:

// every thread contributes its own slice of (sign*proj, proj)
float local_pos = 0.0;
float local_tot = 0.0;
const uint byte_idx = tid >> 3u;
const uint bit_idx  = tid & 7u;
const float sign_j  = float((uint(data_a[ib].qjl[byte_idx]) >> bit_idx) & 1u);
local_pos = sign_j * proj_b_sh[tid];
local_tot = proj_b_sh[tid];

// tree reduction — one subgroupAdd replaces the whole [[unroll]] loop
pos_sum   = subgroupAdd(local_pos);
total_sum = subgroupAdd(local_tot);
if (tid == 0u) {
    accum += qjl_scale * (2.0 * pos_sum - total_sum);
}

Same structure as the cooperative norm reduction I added in 45d3b8098 for copy_to_quant.comp. Would also let you drop proj_b_sh from shared mem (not needed past the butterfly).

Will queue this as a follow-up patch once the three correctness items (Issue 1 above, and the review leftovers: wrong bpw comments in ggml.h, dead Stage-1 sign machinery, and the 0xQJL128ULL gibberish macro in ggml-quants.c) are resolved.

@jesusmb1995
Copy link
Copy Markdown
Author

jesusmb1995 commented Apr 24, 2026

syscl-fp16 seems to fail now: https://github.com/tetherto/qvac-fabric-llm.cpp/actions/runs/24902725692/job/72924264252?pr=115

Here it was passing before: https://github.com/tetherto/qvac-fabric-llm.cpp/actions/runs/24771942879/job/72480398192

Possibly: last successful sycl-fp16 on this PR = Apr 22 (run 24771942879, commit 45d3b80). It broke today purely because apt install intel-oneapi-compiler-dpcpp-cpp now resolves to 2026.0, which dropped/moved syclcompat/math.hpp.

Fix a latent correctness bug in the TurboQuant / PolarQuant copy_to_quant
cooperative shader that silently produces wrong bytes on any device whose
gl_SubgroupSize is less than the 32-thread workgroup (Intel Xe/Arc at 8/16,
ARM Mali 4/8/16, some Adreno configurations). Make the path cover every
supported subgroup size, plumb a runtime knob for testing, and add a
dedicated test suite with both real-hardware and software-Vulkan coverage.

Motivation
----------
The original copy_to_quant.comp TBQ/PQ path uses subgroupAdd() for the
per-block norm reductions and subgroupBallot() for the QJL sign-bit sketch,
assuming gl_SubgroupSize == 32 (= the workgroup size). On devices where the
native subgroup is smaller, those ops reduce only within a subgroup, not the
whole workgroup, so each subgroup sees its own partial sum and the output
bytes become whatever the first-subgroup partial happened to produce. The
SET_ROWS path has the same issue. The bug does not reproduce on most
production GPUs (NVIDIA fixed-32, AMD RDNA 32/64, Apple 32) but bites Intel
and several mobile GPUs.

Shader changes (copy_to_quant.comp)
-----------------------------------
* New specialization constant SG_SIZE at constant_id = 1 (slot 0 is already
  used by generic_binary_head.glsl's `norepeat` in the SET_ROWS path).
  Defaults to 32 so hosts that pass no spec info get the original shader.
* TQ_WG fixed at 32 (the workgroup size); NSG = TQ_WG / SG_SIZE is the
  number of subgroups per workgroup.
* New helper tq_wg_add(x): if NSG == 1 (SG_SIZE >= TQ_WG) returns
  subgroupAdd(x) -- identical to the original fast path and
  dead-code-eliminated by spec-constant folding; if NSG > 1 the per-
  subgroup subgroupAdd results are written to shared memory (tq_sh_red)
  and stitched with an [[unroll]]-ed sum. Replaces every subgroupAdd() in
  the TBQ/PQ/norm-correction paths.
* QJL sign-bit pack: when SG_SIZE >= TQ_WG the original subgroupBallot
  fast path runs; when SG_SIZE < TQ_WG it falls back to atomicOr into a
  shared uint array and a serial write-out. Same fast-path guard lets
  specialization fold the slow branch away when SG_SIZE == 32.
* SG_SIZE > TQ_WG (e.g. AMD wave64 with WG=32) is treated as NSG == 1
  via clamp(SG_SIZE, TQ_WG) in tq_wg_add, so those devices take the fast
  path even though half the wave is masked off.

Host plumbing (ggml-vulkan.cpp)
-------------------------------
* vk_device_struct grows a tbq_copy_sg_size field (0 = no override).
* Device init reads GGML_VK_TBQ_COPY_SG_SIZE from env, validates against
  {4, 8, 16, 32, 64} intersected with the device's
  [subgroup_min_size, subgroup_max_size], and emits a structured
  "tbq_copy_sg_size_status requested=R applied=A reason=X" line so tests
  can tell whether the override was applied or rejected (distinct from
  success/failure of the run itself).
* ggml_vk_load_shaders picks the (SG_SIZE spec const, requiredSubgroupSize)
  pair used for every CPY-to-quant and SET_ROWS-to-quant pipeline:
    - if the env override is set: that value
    - else if the device supports size control: mul_mat_subgroup_size
    - else: 0 (shader default SG_SIZE=32, no required size) -- matches
      pre-patch behaviour on drivers without VK_EXT_subgroup_size_control.
  The two-element spec-const vector is {0, SG_SIZE} for the plain CPY
  path (slot 0 is ignored by generic_unary_head.glsl) and {1, SG_SIZE}
  for SET_ROWS (slot 0 is `norepeat`, always 1).
* Adds a device-selection opt-in GGML_VK_ALLOW_CPU_DEVICES=1 so tests can
  pick up software Vulkan ICDs (lavapipe, SwiftShader) that ggml-vulkan
  normally filters out. Production code never sets this env var and the
  behaviour is unchanged when it isn't set.

New test (tests/test-copy-tbq-subgroups.cpp + CMakeLists)
---------------------------------------------------------
Self-spawning C++ test that for each (SG in {0, 4, 8, 16, 32, 64}, type,
shape) triple runs GPU quantize, compares against a CPU
ggml_quantize_chunk reference, and reports byte-mismatch + dequant NMSE
+ throughput. Key design choices:
  * Self-spawn (popen of --child N with a different
    GGML_VK_TBQ_COPY_SG_SIZE value per child) because the env var is
    consumed once at device init and can only be changed across processes.
  * Parses the structured status line from the backend to distinguish
    "applied" from "rejected" rows. Rejected rows are labelled
    SKIP-<reason> in the per-case table and excluded from the
    NMSE-spread assertion (they are duplicates of sg=0 and don't add
    independent coverage). Prior phrasing that labelled them OK was
    misleading.
  * --types comma-separated filter keeps the default CI run fast by
    iterating only a subset of TBQ/PQ types.
  * Shared pass/fail rule: nmse(gpu vs cpu) <= 1e-6 for every applied
    SG; the per-case table stays OK on the legs that couldn't exercise
    the stitch path on the host GPU.

Cross-subgroup-size coverage via lavapipe (tests/test-turboquant.sh)
--------------------------------------------------------------------
Real desktop GPUs (NVIDIA, AMD RDNA, Apple, most Adreno) have
minSubgroupSize >= 32, so VK_EXT_subgroup_size_control cannot request the
smaller subgroups the stitch path was written for. To actually exercise
NSG > 1 in CI, the script now also runs the test under lavapipe (Mesa's
CPU Vulkan driver) at LP_NATIVE_VECTOR_WIDTH in {128, 256, 512}, which
gives native subgroupSize {4, 8, 16} respectively and therefore covers
every distinct NSG branch the shader supports:

    LP_NATIVE_VECTOR_WIDTH | lavapipe SG | NSG (= TQ_WG / SG)
    -----------------------+-------------+--------------------
         128               |      4      |  8  (8-way stitch)
         256               |      8      |  4  (4-way stitch)
         512               |     16      |  2  (2-way stitch)

Combined with the native-GPU leg (NSG=1, fast path), this gives full
coverage of the helper's {1, 2, 4, 8} NSG branches on any host.

Usage and modes
---------------
  tests/test-turboquant.sh          # short mode (default): CI-friendly
  tests/test-turboquant.sh --full   # all TBQ/PQ types, full matrix

Short mode restricts the SG-coverage legs to tbq3_0 / pq3_0 / *_64 to keep
default CI runtime bounded; full mode covers all 8 TBQ/PQ types. Both
modes render a Unicode-boxed summary table at the end covering every
subgroup-coverage leg that ran.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants