QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0)#115
QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0)#115jesusmb1995 wants to merge 16 commits intotetherto:temp-7248from
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
69522fb to
6497a86
Compare
|
Are you planning to merge this before the rebase to the latest version of llama.cpp? |
f7ba069 to
9d2a659
Compare
Edit: @zoq Since it seems we want this merged in about 1-2 weeks, I would target this version for now. yes, planning to merge this before the rebase. |
0a7559b to
7ea421d
Compare
7ea421d to
3238522
Compare
This comment was marked as resolved.
This comment was marked as resolved.
The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside `gg_run_ctest_release`, but the top-level driver runs `gg_run ctest_debug` first and `gg_run ctest_release` second. As a result the debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf` and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native` against the build host's CPU and then executed on a different, older-microarch runner in the pool, producing SIGILL during ctest_debug. Ref PR tetherto#115 CI run 24463228694. Move the append into the top-level flag-handling block, right after `GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once, before either ctest function is invoked, and both debug and release builds pick it up. Remove the duplicate inside `gg_run_ctest_release`. No workflow change required: `.github/workflows/build.yml` already exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was correct; the bug was purely a scoping error in `ci/run.sh`. The other `GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the top-level branches that skip heavier test functions) are left untouched — they were already at the correct scope.
The per-thread TBQ/PQ quantize shader was single-threaded per block —
one lane normalized 128 values, ran the FHT serially, and packed the
QJL sketch bit-by-bit, with three float[128] private arrays spilling
to GPU private memory. On a 5090 this capped the tbq3_0 / tbq4_0 write
throughput at ~80 GB/s (~4 % of peak).
Switch to a cooperative shader that treats one workgroup (32 lanes ==
one subgroup on NVIDIA) as one block:
- norm, norm-correction and residual-norm reductions use subgroupAdd
- the Fast Hadamard Transform runs log2(BK) passes with the BK/2
butterflies in each pass spread across the 32 threads, separated
by a single barrier() each
- the QJL sign sketch is packed with subgroupBallot (32 bits per
call, written as four bytes directly) instead of 128 serial OR
into memory
- scratch moves from private arrays to shared memory (tq3_sh_x,
tq3_sh_idx, tq3_sh_proj, and tq4_* analogues)
On the host side, the 32 TBQ/PQ cpy_f32_quant pipelines drop their
wg_denoms from {32,1,1} to {1,1,1} so that "one workgroup == one
block", and the shader's CPY main() picks up a matching TQ_COOP branch
that drops the *32 + gl_LocalInvocationID.x offset from the block
index decode.
The GGML_OP_SET_ROWS dispatch path also needs to know about the new
"one workgroup per block" contract: for TBQ/PQ dst types, divide ne
by ggml_blck_size(dst) instead of 32 * ggml_blck_size(dst). Without
this gate the set_rows kernel dispatched only 1/32 of the required
workgroups, silently leaving 31 out of 32 KV-cache blocks uninitialized
and driving perplexity on Mistral-7B-Instruct-v0.3 from ~5.9 to ~1090
with no visible failure from llama-bench or test-backend-ops (the
CPY tests only exercise GGML_OP_CPY, which already had the /blck_size
rule). Unrelated types keep the /32/blck_size rule so q4_0, q8_0 etc.
behave exactly as before.
Measured on 2x RTX 5090, Vulkan 1.4.321, PR tetherto#115 tip b23276f.
test-quantize-perf -b vulkan, 4 MiB input, 500 iters:
type baseline avg optimized avg avg speedup
tbq3_0 187.5 us 42.7 us 4.39x
tbq4_0 192.9 us 44.3 us 4.35x
pq3_0 68.6 us 44.8 us 1.53x
pq4_0 84.9 us 44.9 us 1.89x
llama-bench on Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1, -r 3:
K / V pp2048 base -> opt tg128 base -> opt
tbq3_0 / pq3_0 9764 -> 9880 +1.2 % 179.1 -> 206.9 +15.5 %
pq3_0 / pq3_0 15396 -> 15653 +1.7 % 190.5 -> 214.0 +12.3 %
tbq4_0 / pq4_0 9568 -> 9782 +2.2 % 164.9 -> 205.1 +24.4 %
llama-perplexity on wikitext-2 test, 40 chunks, seed 42,
Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1:
K / V baseline PPL optimized PPL
f16 / f16 5.8254 +/- 0.13612 5.8254 +/- 0.13612 (control)
tbq3_0 / pq3_0 5.9333 +/- 0.13811 5.9129 +/- 0.13747
pq3_0 / pq3_0 5.9806 +/- 0.13879 5.9894 +/- 0.13918
tbq4_0 / pq4_0 5.8646 +/- 0.13707 5.8570 +/- 0.13679
All PPL deltas are an order of magnitude smaller than the 95 % CI and
come from FP associativity in the subgroup tree reduction vs the
previous sequential sum. No algorithmic change.
tests/test-turboquant.sh: 112/112 backend-op tests still pass on both
GPUs. test-quantize-fns reproduces the 4 pre-existing _64 roundtrip
failures with bit-identical error magnitudes (0.010805, 0.009384) —
that is the wrong-codebook bug from PR tetherto#115 review, not introduced or
fixed here.
Made-with: Cursor
This comment was marked as resolved.
This comment was marked as resolved.
build_attn_inp_kv_impl() and build_attn_inp_kv_iswa() allocated inp->self_rotk as an nrot x nrot Hadamard where nrot = largest power-of-two that divides n_embd_head_k (>= 64) but allocated inp->self_rotv as a fixed 64x64 tensor, independent of n_embd_head_v. Because ggml_rotate_hadamard() reshapes its input using rot->ne[0] as the inner dim, an n_embd_head_v = 128 vector was rotated as two independent 64-d halves, i.e. block_diag(H64, H64), instead of a full H128. The d=128 PQ/TBQ codebooks in ggml-quants.c are Lloyd-Max fitted to the coordinate distribution of a full d=128 random orthogonal rotation (sigma ~ 1/sqrt(128)), so the narrower-than-expected 64-d rotation left the codebook ~2x too narrow per coordinate and silently inflated V reconstruction error. This hits essentially every recent dense model with head_dim_v = 128 (Llama 3, Mistral, Qwen2.5, ...) whenever a TBQ/PQ V cache is selected, since resolve_tq_type() keeps the non-_64 variants for head_dim = 128. Reported by @gianni-cor in tetherto#115 with a repro showing ~1.40x worse 3-bit and ~1.50x worse 4-bit V reconstruction at alpha = 0.10. Factor the sizing + allocation out into a file-local helper build_hadamard_rot(ctx, can_rot, n_embd_head) that applies the same "largest power-of-two that divides n_embd_head, starting at 64" rule used for self_rotk, and returns nullptr when can_rot is false. Call it for both K and V in build_attn_inp_kv_impl() and build_attn_inp_kv_iswa(), which makes self_rotv correctly 128x128 for head_dim_v = 128 and keeps self_rotk behavior unchanged. No other call site allocates self_rot{k,v}, so the rotation is now symmetric across K and V and across the SWA and non-SWA builders. Net change is -16 lines: two 17-line if/else blocks per builder collapse into two one-liners.
The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside
`gg_run_ctest_release`, but the top-level driver runs `gg_run
ctest_debug` first and `gg_run ctest_release` second. As a result the
debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf`
and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native`
against the build host's CPU and then executed on a different,
older-microarch runner in the pool, producing SIGILL during
ctest_debug. Ref PR ggml-org#115 CI run 24463228694.
Move the append into the top-level flag-handling block, right after
`GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once,
before either ctest function is invoked, and both debug and release
builds pick it up. Remove the duplicate inside `gg_run_ctest_release`.
No workflow change required: `.github/workflows/build.yml` already
exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was
correct; the bug was purely a scoping error in `ci/run.sh`. The other
`GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the
top-level branches that skip heavier test functions) are left
untouched — they were already at the correct scope.
diff --git a/ci/run.sh b/ci/run.sh
index 7fa469b..5c2f7e7 100755
--- a/ci/run.sh
+++ b/ci/run.sh
@@ -118,6 +118,15 @@ if [ ! -z ${GG_BUILD_NO_SVE} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm"
fi
+# Disable native CPU optimizations for low-perf builds to ensure binary
+# compatibility with the (often heterogeneous) CI runner pool. Must be applied
+# at the top level so BOTH gg_run_ctest_debug and gg_run_ctest_release pick it
+# up — otherwise the debug build (which runs first) compiles with -march=native
+# and can SIGILL on a runner whose microarch is older than the build host.
+if [ ! -z ${GG_BUILD_LOW_PERF} ]; then
+ CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF"
+fi
+
if [ -n "${GG_BUILD_KLEIDIAI}" ]; then
echo ">>===== Enabling KleidiAI support"
@@ -236,11 +245,6 @@ function gg_run_ctest_release {
# Check cmake, make and ctest are installed
gg_check_build_requirements
- # Disable native CPU optimizations for low-perf builds to ensure compatibility
- if [ ! -z ${GG_BUILD_LOW_PERF} ]; then
- CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF"
- fi
-
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 247775b..bcc1759 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -49,6 +49,25 @@ static ggml_tensor * ggml_rotate_hadamard(
return res;
}
+// Allocate the Hadamard rotation input used by ggml_rotate_hadamard() for a
+// TurboQuant/PolarQuant K or V stream. Size is the largest power-of-two that
+// divides n_embd_head (>= 64), so the rotation matches the head dim and the
+// PQ/TBQ codebooks see the full d-wide rotated distribution they were fitted
+// to. Returns nullptr when can_rot is false.
+static ggml_tensor * build_hadamard_rot(ggml_context * ctx, bool can_rot, int n_embd_head) {
+ if (!can_rot) {
+ return nullptr;
+ }
+
+ int nrot = 64;
+ do { nrot *= 2; } while (n_embd_head % nrot == 0);
+ nrot /= 2;
+
+ ggml_tensor * rot = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nrot, nrot);
+ ggml_set_input(rot);
+ return rot;
+}
+
void llm_graph_input_embd::set_input(const llama_ubatch * ubatch) {
if (ubatch->token) {
const int64_t n_tokens = ubatch->n_tokens;
@@ -1626,30 +1645,13 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
hparams.n_embd_head_k % 64 == 0 &&
ggml_is_quantized(mctx_cur->type_k());
- if (can_rotk) {
- int nrot = 64;
- do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0);
- nrot /= 2;
-
- inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
- ggml_set_input(inp->self_rotk);
- } else {
- inp->self_rotk = nullptr;
- }
-
const bool can_rotv =
!hparams.is_n_embd_v_gqa_variable() &&
hparams.n_embd_head_v % 64 == 0 &&
ggml_is_quantized(mctx_cur->type_v());
- if (can_rotv) {
- int nrot = 64;
-
- inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
- ggml_set_input(inp->self_rotv);
- } else {
- inp->self_rotv = nullptr;
- }
+ inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k);
+ inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);
}
return inp;
@@ -1947,30 +1949,13 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
hparams.n_embd_head_k % 64 == 0 &&
ggml_is_quantized(mctx_cur->get_base()->type_k());
- if (can_rotk) {
- int nrot = 64;
- do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0);
- nrot /= 2;
-
- inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
- ggml_set_input(inp->self_rotk);
- } else {
- inp->self_rotk = nullptr;
- }
-
const bool can_rotv =
!hparams.is_n_embd_v_gqa_variable() &&
hparams.n_embd_head_v % 64 == 0 &&
ggml_is_quantized(mctx_cur->get_base()->type_v());
- if (can_rotv) {
- int nrot = 64;
-
- inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
- ggml_set_input(inp->self_rotv);
- } else {
- inp->self_rotv = nullptr;
- }
+ inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k);
+ inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);
}
return (llm_graph_input_attn_kv_iswa *) res->add_input(std::move(inp));
This comment was marked as resolved.
This comment was marked as resolved.
Register explicit `MUL_MAT` test cases for the TurboQuant / PolarQuant
types (`tbq3_0`, `tbq4_0`, `pq3_0`, `pq4_0`) with `type_b ∈ {f32, f16}`
and sizes that span both dispatch paths:
- n=1, n=8 -> mul_mat_vec path (decode-like)
- n=16, n=32 -> dequant + f16 matmul path (prefill-like)
Motivation: the existing TBQ/PQ coverage in `test-backend-ops` only
registers `FLASH_ATTN_EXT` cases, so the `tests/test-turboquant.sh`
filter (`test-backend-ops test -p "tbq|pq"`) never exercises the
standalone `MUL_MAT` path. That path is the one reported as
`supports_op == yes` on NV `VK_NV_cooperative_matrix2` devices but has
no matching pipeline created in `pipeline_dequant_mul_mat_mat_f16[]`
(see `ggml-vulkan.cpp:3412` — "TBQ/PQ cm2 matmul shaders not yet
generated"). With these cases in place:
- On NV coopmat2 (RTX 5090): the n>=16 cases segfault, matching the
external reproduction from the PR ggml-org#115 review and making the bug
visible to CI.
- On KHR coopmat1 (AMD gfx1150): the n>=16 cases return numerical
garbage (err ~ 1.0), exposing the same missing-fallback issue in
a non-crashing form.
- The n=1/n=8 cases continue to pass via the existing
`mul_mat_vec_tbq*_0` / `mul_mat_vec_pq*_0` shaders, so the new
coverage cleanly isolates which dispatch path is broken.
No source changes to the Vulkan backend; this commit only adds the
test cases needed so the pre-existing bug is caught by
`tests/test-turboquant.sh`.
…d=64 variants Before this patch, `supports_op` reported TBQ3_0/TBQ4_0/PQ3_0/PQ4_0 (and their `_64` / head_dim=64 variants) as supported for `GGML_OP_MUL_MAT` on the Vulkan backend, but there was no working pipeline behind it. This is the state the external review flags as Issue 3 on PR ggml-org#115: on cm2 (RTX 5090) the support probe claims support, the correctness run then segfaults. The previous `test-backend-ops` commit adds the exact repro for this. Root causes: - On cm2 (NV coopmat2) devices the slot in `pipeline_dequant_mul_mat_mat_f16[]` was empty - the shader was never generated - so dispatches crashed when flash attention was not used. - On cm1 and scalar devices the slot in `pipeline_dequant_mul_mat_mat[]` was wired up, but `mul_mm_funcs.glsl` had no `load_a_to_shmem` implementation for TBQ/PQ. The generic `mul_mm.comp` ran with uninitialized shared memory and produced near-random output. - Even once data loading was fixed, TBQ3_0/TBQ4_0 still produced a small bias for `n > mul_mat_vec_max_cols` because the QJL Stage 2 correction that `mul_mat_vec_tbq*_0.comp` applies in the vec path has no equivalent in the generic matmul shader. - For head_dim=64 models that use the `_64` block variants (TBQ*_0_64 / PQ*_0_64) the `_64` mul_mm pipelines, the `_64` QJL correction shaders, and the `_64`-sized Lloyd-Max codebook / sign arrays were missing, and the vec path is intentionally skipped for `_64` (so every `n` needs the full matmul + QJL correction pair). Scope of this patch is strictly the standalone `MUL_MAT` path with TBQ/PQ `src0` and f32 `src1` that the Issue 3 repro hits, i.e. the `-fa off` K matmul. Fused flash attention (scalar, cm1, cm2) already handles QJL correctly and is unchanged. MoE FFN weights are not affected either: TBQ/PQ are KV-cache quantizations (there is no `llama-quantize` target that produces TBQ/PQ model weights), so `MUL_MAT_ID` never sees them as `src0` - attention in MoE models is a plain `MUL_MAT` / `FLASH_ATTN_EXT` and reuses the same fix. The upstream "V cache quantization requires flash_attn" context-level guard in `src/llama-context.cpp` is intentionally left unchanged: the V matmul under `-fa off` uses a transposed quantized-V layout populated by `ggml_set_rows` with row_size=1, which corrupts any `blck_size > 1` type at write time (reproducible on CPU as well), and that is a separate KV-cache issue out of scope here. Changes: - Add `load_a_to_shmem` implementations for TBQ3_0, TBQ4_0, PQ3_0, PQ4_0 (and their `_64` variants) in `mul_mm_funcs.glsl`, reusing `tbq3_dequant_raw` / `tbq4_dequant_raw` from `tq_utils.comp`. This makes `mul_mm.comp` correct for the centroid part of dequantization (`tbq*_dequant_raw(qs) * d`) on all eight types. - `tq_utils.comp`: pick Stage-1 / QJL-Stage-2 sign bitmasks and the Lloyd-Max codebook (TBQ3_CB / TBQ4_CB) based on whether any `DATA_{A,K,V}_*_0_64` is defined. d=64 blocks use seeds 43/139 and a wider codebook (sigma = 1/sqrt(d) is larger at d=64 than at d=128); previously the shader hardcoded the d=128 constants, so the d=64 variants silently dequantized against the wrong codebook. - New shader `mul_mm_tbq_qjl_correction.comp`. It runs after the main matmul as an additive pass: one workgroup per `(row, col, batch)`, `QUANT_K` threads performing the same Walsh-Hadamard butterfly + `qjl[]` dot product as the vec shader, and accumulates `d_r * sqrt(pi/2) / QUANT_K * sum_qjl(H(B))` into D. Parameterized over `QUANT_K` so the same source emits both `_128` and `_64` SPIR-V. Only TBQ3_0 and TBQ4_0 (and `_64`) have `d_r`/`qjl`, so only those four get a correction pipeline. - `vulkan-shaders-gen.cpp`: * Register the eight correction variants (`mul_mm_qjl_{tbq3_0,tbq4_0}{,_64}_{f32,f16}`). * Emit `matmul_{tbq,pq}{3,4}_0_64_{f32,f16}[_aligned]` for `mul_mm.comp`, in a dedicated block outside the main `type_names` loop so we don't cascade through FA / MUL_MAT_ID / get_rows / ... which either already have dedicated `_64` handling (FA) or don't apply to TBQ/PQ at all. - `ggml-vulkan.cpp`: * Add `pipeline_mul_mm_tbq_qjl[GGML_TYPE_COUNT][2]` on the device and create pipelines at init time for all four TBQ types (128 and 64 block sizes). * In `ggml_vk_get_mul_mat_mat_pipeline`, let cm2 fall through to the cm1/scalar pipeline when no cm2 `_mat_f16` shader exists for a given TBQ/PQ type, so cm2 devices stop segfaulting on these types. * Register TBQ/PQ `_64` in `supports_op` `MUL_MAT` switch so d=64 models are actually routed to the new pipelines instead of falling back to CPU. * Force `split_k = 1` for TBQ `src0` - the QJL correction pass would otherwise be added once per split. * Dispatch the QJL correction pass after the main matmul for TBQ3_0 / TBQ4_0 (and `_64`). For `_128` it's gated on `n > mul_mat_vec_max_cols` (the vec path already corrects for smaller n); for `_64` it runs unconditionally because there is no vec path on this block size. Verified on AMD RDNA3.5 (RADV gfx1150, no cm1/cm2) with `test-backend-ops -o MUL_MAT -b Vulkan0`: all 32 TBQ/PQ x {128, 64} x n in {1,8,16,32} cases pass against f32 B. f16 B for standalone MUL_MAT on TBQ/PQ still reports `not supported` on this device (the scalar/cm1 pipeline consumes f32 src1), which is consistent with the matmul path shipping f32 src1 on the `-fa off` decode/prefill paths used by these tests. cm2 verification is expected to run on the reviewer's RTX 5090 via the Issue 3 repro branch. debug: sentinel in QJL correction vulkan: add dequantize-to-f16 cpy shaders so MUL_MAT with non-contiguous quantized src0 can run on GPU vulkan: add d=64 decision boundaries for TBQ/PQ copy_to_quant Commit 6e26e8b ("vulkan: fix TBQ/PQ standalone MUL_MAT path, QJL correction pass, and d=64 variants") added TQ_D64-gated d=64 Lloyd-Max codebook centroids and random sign diagonals to tq_utils.comp, so that the Vulkan encoder, FA decoder, and non-FA QJL correction pass all match the CPU reference's d=64 constants (ggml-quants.c: TQ{3,4}_CODEBOOK_64, TQ/QJL_SIGN_SEED_64). It missed the corresponding d=64 *decision boundaries* in copy_to_quant.comp, however. The boundaries are the midpoints between adjacent codebook centroids and determine which centroid an input coordinate is quantized to. CPU derives them at runtime from the selected codebook via tq_compute_boundaries(), so it automatically used the d=64 midpoints for _64 blocks. The Vulkan encoder hard-codes them as const float TBQ3_B[7] / TBQ4_B[15], and those constants remained at the d=128 midpoints even inside the #if block that accepts both DATA_A_TBQ*_0 and DATA_A_TBQ*_0_64. Net effect on a head_dim=64 model (Qwen2.5-0.5B): - copy_to_quant bucketed each coordinate by the narrower d=128 boundaries (centroids spaced ~sigma=1/sqrt(128)). - The resulting index was then dequantized with the wider d=64 centroids (spaced ~sigma=1/sqrt(64)), a completely different alphabet. - Every value near a boundary landed on the wrong centroid. Before commit 6e26e8b this was silently consistent: the codebook was also d=128, so encoder and decoder were at least in agreement (just producing a rescaled quantization). When the codebook was fixed to d=64, the boundaries had to move with it. Add a #if defined(TQ_D64) branch for TBQ3_B and TBQ4_B that uses the d=64 midpoints computed from TQ{3,4}_CODEBOOK_64. Values regenerated with scripts/compute_tq_codebooks.py, which now also emits the GLSL boundaries array alongside the C codebook array so future codebook updates keep the two in sync from one source of truth. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test offset_64 (--chunks 1), Vulkan AMD RADV gfx1150: tbq3_0 / f16 fa=off: 1659 -> 230 (7.2x better) tbq4_0 / f16 fa=off: ~ -> 300 pq3_0 / f16 fa=off: ~ -> 230 pq4_0 / f16 fa=off: ~ -> 300 pq3_0 / f16 fa=on: ~ -> 225 (now matches fa=off) pq4_0 / f16 fa=on: ~ -> 299 (now matches fa=off) tbq3_0/tbq4_0 fa=on still diverges from fa=off due to a separate issue in the FA QJL Stage-2 correction on d=64 blocks, addressed in a follow- up patch. vulkan: read raw Q from the input SSBO for the FA QJL projection The flash-attention shaders compute the QJL Stage-2 correction as correction = d_r * sqrt(pi/2) / QUANT_K * (2*pos_sum - proj_q_sum) where (pos_sum, proj_q_sum) are reductions over FHT(D_qjl * Q). The scalar (flash_attn.comp) and coopmat1 (flash_attn_cm1.comp) paths used to derive the FHT input by reading from Qf -- the shared-memory buffer that already has the attention scale (1/sqrt(head_dim)) multiplied in for the main Q*K dot -- and then dividing that value by p.scale to recover the raw Q before multiplying by the QJL sign diagonal. On cm1 Qf is f16, so the scale round-trip is lossy for the large- magnitude activations seen in e.g. Qwen2.5-0.5B's first-layer massive activations. On the scalar path Qf is f32 and p.scale is usually a power of two (1/sqrt(head_dim)), so x * p.scale / p.scale is bit-exact in principle -- but empirically the pre-scaled-then-un-scaled read still produced materially different FHT input than a raw-Q read. The standalone non-FA QJL shader (mul_mm_tbq_qjl_correction.comp) has always read Q directly from src1 and gets correct results. Match that pattern in the FA path: read Q straight from the data_qv4 SSBO into Qf_qjl_proj, bypassing Qf entirely. cm2 already reads raw Q from data_q directly, so it does not need the change. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wikitext-2 test (Vulkan AMD RADV gfx1150): wiki.test --chunks 4 -n 128, K=tbq3_0/f16: fa=off : 531 fa=on, before : ~2000 (broken) fa=on, after : 154 wiki.test --chunks 4 -n 128, K=tbq4_0/f16: fa=off : 207 fa=on, after : 77 K=f16/f16 and K=pq*/V=f16 fa=on/off stay within 1% of each other, confirming the change is confined to the TBQ QJL path. vulkan: run non-FA TBQ QJL correction on permuted src0 too The standalone MUL_MAT QJL (Stage 2) correction pass was gated on `!x_non_contig`, which silently skipped the correction on the no-FA attention path because `kq = mul_mat(k, q)` feeds in K after `ggml_permute(k, 0, 2, 1, 3)` -- a non-dim01-contiguous view of the KV cache. With that gate the TBQ attention on the no-FA path was reduced to PQ Stage 1 (centroid-only), producing bit-identical output to `pq3_0` / `pq4_0` and regressing quality vs a CPU-reference TBQ run. It was masked on the MUL_MAT test-backend-ops coverage because those tests use contiguous src0. Two changes: * mul_mm_tbq_qjl_correction.comp: index the A matrix by a real in-memory block stride instead of the `num_blocks_per_row = K / QUANT_K` shortcut. `p.stride_a` and `p.batch_stride_a` are now interpreted as strides in BLOCK units (src0->nb[1] / sizeof(block) and src0->nb[2] / sizeof(block) respectively), matching the way `ggml_vk_flash_attn` already feeds `k_stride` / `k_offset` to the FA shader. For a contiguous src0 the new stride equals the old num_blocks_per_row, so existing tests are unaffected. * ggml_vk_mul_mat_q_f16: compute `qjl_stride_a` / `qjl_stride_batch_a` from src0->nb and drop the `!x_non_contig` exclusion from both the descriptor-set request and the dispatch site. The `qx_buf_offset` still points at the original (pre-permute) TBQ blocks in d_Qx, which is exactly what the correction pass wants -- it just needed the real strides to reach the right block for each (row_a, batch_id). Both paths have to be gated identically -- if only one of them is changed, the dispatched pipeline ends up without a descriptor set and `vkCmdPushConstants` crashes with VK_NULL_HANDLE layout. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test --chunks 4 -n 128 (Vulkan AMD RADV gfx1150): K=tbq3_0 V=f16 fa=off (vs fa=off pq3_0): before fix: 561 == 561 (QJL silently skipped, TBQ==PQ) after fix : 634 != 561 (QJL running, TBQ distinct from PQ) CPU ref : 546 K=tbq4_0 V=f16 fa=off (vs fa=off pq4_0): before fix: 173 == 173 (QJL silently skipped) after fix : 133 != 173 (QJL running, 23% lower PPL) CPU ref : 154 f16/f16 and pq* paths are unchanged: the gate only opens for TBQ. FA is unaffected since it has its own inlined QJL epilogue.
…gle-GPU
The per-n_ctx wikitext slices generated by tests/test-kv-cache-quantization-perp.sh
were picked via an unseeded $RANDOM, so every run drew a fresh offset and PPL
numbers were not directly comparable across reruns. Additionally, llama-perplexity
was being invoked without --split-mode, so on multi-GPU hosts it defaulted to
splitting decoder layers across all visible devices -- introducing cross-device
numerical differences that could drift the baseline by more than the QJL/FA
signal this sweep is meant to detect.
Three changes to make PPL numbers reproducible across reruns and machines:
* generate_offset_files(): seed $RANDOM with a fixed SLICE_SEED (default 42)
before drawing offsets, so a fresh regeneration is byte-for-byte
reproducible. Export SLICE_SEED=<n> to draw a different but still
deterministic set of offsets.
* generate_offset_files(): skip regeneration when wiki.test.offset_<n_ctx>.raw
already exists and is non-empty; recover the original offset for the log
line from the suffix size so the output is still informative ("reusing
offset=... (<N> bytes)"). Pass --regen-slices (or delete the slice files)
to force regeneration.
* run_perplexity_once(): pass --split-mode none to llama-perplexity so PPL
is computed on a single device regardless of how many GPUs are visible.
Matches the default already used by tests/test-kv-cache-quantization-perf.sh
(SPLIT_MODE="${SPLIT_MODE:-none}"), so perp and perf now agree on
single-GPU execution.
Header comment and --help output updated to document the slice knobs. No
change to the perplexity args beyond --split-mode, so existing CSV schemas
and result filenames are unaffected.
… regression
The Vulkan CI (ubuntu-24-cmake-vulkan) fails on this pre-existing upstream
backend-op case:
MUL_MAT(type_a=q8_0, type_b=f32, m=16, n=1, k=256, bs=[2,3],
nr=[1,1], per=[0,2,1,3], ...)
with
ggml-vulkan.cpp: GGML_ASSERT(ggml_vk_dim01_contiguous(src0)
|| src0->type == F32/F16/BF16) failed
in ggml_vk_mul_mat_vec_q_f16. The underlying bug is that our change to
ggml_backend_vk_device_supports_op() -- which relaxed the non-dim01-
contiguous constraint on quantized src0 so the TBQ/PQ -fa off K*Q path can
stay on the GPU -- also advertises support for every other quant type
(q4_0, q5_0, q5_1, q8_0, iq4_nl, ...) with non-contig src0, but the small-n
vec dispatcher in ggml_vk_mul_mat still routes those cases to
ggml_vk_mul_mat_vec_q_f16, which does not implement the quant->f16 cpy
fallback and asserts on non-contig quantized src0.
This regression was not caught by tests/test-turboquant.sh because that
script filters test-backend-ops with `-p 'tbq|pq'` and our added TBQ/PQ
MUL_MAT coverage only uses per=[0,1,2,3] (identity). The upstream q8_0
permutation matrix exercises the exact shape that trips the assert.
Add a second test-backend-ops invocation to test-turboquant.sh that targets
the smallest reproducer:
-p 'type_a=q8_0.*per=\[0,2,1,3\]'
Picking q8_0 (instead of a TBQ/PQ variant) means this check runs on any
Vulkan box without requiring a TBQ model or KV cache, and it directly
reproduces the CI failure.
Verified on AMD gfx1150 (KHR_coopmat) with the top-of-branch Vulkan that
still has the bug: test-turboquant.sh now exits non-zero locally with
"1 check(s) failed.", matching the CI failure. The follow-up Vulkan
patch adds the dispatcher fix that makes this check pass.
…ix path Commit 6fd388c (vulkan-fix-tbq-pq-standalone) widened ggml_backend_vk_device_supports_op(MUL_MAT) so that any quantized src0 type with a pipeline_cpy_quant_f16 entry is accepted even when it is not dim01-contiguous. That is what lets the -fa off attention path keep kq = mul_mat(K, Q) on the GPU when K is a permuted TBQ/PQ view of the KV cache. The matrix path (ggml_vk_mul_mat_q_f16) honours this: it runs the quant->f16 cpy pipeline to dequantize the non-contig src0 before the main matmul. But the vec path (ggml_vk_mul_mat_vec_q_f16), which the dispatcher routes to when dst->ne[1] <= mul_mat_vec_max_cols (decode-like n), does not: it asserts dim01-contiguous quantized src0 at the top of the function. So any small-n MUL_MAT with a non-dim01-contiguous quantized src0 -- e.g. the upstream backend-op coverage of MUL_MAT(type_a=q8_0, m=16, n=1, k=256, bs=[2,3], per=[0,2,1,3]) -- slips through supports_op, gets routed to the vec path, and aborts on the assertion. See: tetherto#115 (comment) Fix by adding one clause to the dispatcher in ggml_vk_mul_mat: take the vec path only when src0 is either non-quantized or dim01-contiguous. Non-dim01-contiguous quantized src0 falls through to ggml_vk_mul_mat_q_f16, which already handles it via pipeline_cpy_quant_f16. This does not change hot paths: contiguous src0 still takes the vec path as before, which is the overwhelmingly common case for mul_mat in transformer graphs. Also annotate the vec path assert so a future caller that tries to send non-contig quantized src0 there gets a loud error rather than a silent wrong answer, and so the invariant between the dispatcher gate and the assert is documented in both places. Verified on AMD gfx1150 (KHR_coopmat): Before: tests/test-turboquant.sh exits 1 with GGML_ASSERT at ggml-vulkan.cpp:8105 on the q8_0 per=[0,2,1,3] smoke case added in the previous patch. After: tests/test-turboquant.sh passes; the q8_0 per=[0,2,1,3] MUL_MAT cases run on the GPU through the matrix path (f32 variants succeed, f16 variants report "not supported [CPU]" as before since backend-ops does not currently wire an f16 x f16 contiguity check for quantized src0). This also fixes the Ubuntu Vulkan CI job for the PR.
The per-thread TBQ/PQ quantize shader was single-threaded per block —
one lane normalized 128 values, ran the FHT serially, and packed the
QJL sketch bit-by-bit, with three float[128] private arrays spilling
to GPU private memory. On a 5090 this capped the tbq3_0 / tbq4_0 write
throughput at ~80 GB/s (~4 % of peak).
Switch to a cooperative shader that treats one workgroup (32 lanes ==
one subgroup on NVIDIA) as one block:
- norm, norm-correction and residual-norm reductions use subgroupAdd
- the Fast Hadamard Transform runs log2(BK) passes with the BK/2
butterflies in each pass spread across the 32 threads, separated
by a single barrier() each
- the QJL sign sketch is packed with subgroupBallot (32 bits per
call, written as four bytes directly) instead of 128 serial OR
into memory
- scratch moves from private arrays to shared memory (tq3_sh_x,
tq3_sh_idx, tq3_sh_proj, and tq4_* analogues)
On the host side, the 32 TBQ/PQ cpy_f32_quant pipelines drop their
wg_denoms from {32,1,1} to {1,1,1} so that "one workgroup == one
block", and the shader's CPY main() picks up a matching TQ_COOP branch
that drops the *32 + gl_LocalInvocationID.x offset from the block
index decode.
The GGML_OP_SET_ROWS dispatch path also needs to know about the new
"one workgroup per block" contract: for TBQ/PQ dst types, divide ne
by ggml_blck_size(dst) instead of 32 * ggml_blck_size(dst). Without
this gate the set_rows kernel dispatched only 1/32 of the required
workgroups, silently leaving 31 out of 32 KV-cache blocks uninitialized
and driving perplexity on Mistral-7B-Instruct-v0.3 from ~5.9 to ~1090
with no visible failure from llama-bench or test-backend-ops (the
CPY tests only exercise GGML_OP_CPY, which already had the /blck_size
rule). Unrelated types keep the /32/blck_size rule so q4_0, q8_0 etc.
behave exactly as before.
Measured on 2x RTX 5090, Vulkan 1.4.321, PR ggml-org#115 tip b23276f.
test-quantize-perf -b vulkan, 4 MiB input, 500 iters:
type baseline avg optimized avg avg speedup
tbq3_0 187.5 us 42.7 us 4.39x
tbq4_0 192.9 us 44.3 us 4.35x
pq3_0 68.6 us 44.8 us 1.53x
pq4_0 84.9 us 44.9 us 1.89x
llama-bench on Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1, -r 3:
K / V pp2048 base -> opt tg128 base -> opt
tbq3_0 / pq3_0 9764 -> 9880 +1.2 % 179.1 -> 206.9 +15.5 %
pq3_0 / pq3_0 15396 -> 15653 +1.7 % 190.5 -> 214.0 +12.3 %
tbq4_0 / pq4_0 9568 -> 9782 +2.2 % 164.9 -> 205.1 +24.4 %
llama-perplexity on wikitext-2 test, 40 chunks, seed 42,
Mistral-7B-Instruct-v0.3 Q4_K_S, -fa 1:
K / V baseline PPL optimized PPL
f16 / f16 5.8254 +/- 0.13612 5.8254 +/- 0.13612 (control)
tbq3_0 / pq3_0 5.9333 +/- 0.13811 5.9129 +/- 0.13747
pq3_0 / pq3_0 5.9806 +/- 0.13879 5.9894 +/- 0.13918
tbq4_0 / pq4_0 5.8646 +/- 0.13707 5.8570 +/- 0.13679
All PPL deltas are an order of magnitude smaller than the 95 % CI and
come from FP associativity in the subgroup tree reduction vs the
previous sequential sum. No algorithmic change.
tests/test-turboquant.sh: 112/112 backend-op tests still pass on both
GPUs. test-quantize-fns reproduces the 4 pre-existing _64 roundtrip
failures with bit-identical error magnitudes (0.010805, 0.009384) —
that is the wrong-codebook bug from PR ggml-org#115 review, not introduced or
fixed here.
Made-with: Cursor
This comment was marked as resolved.
This comment was marked as resolved.
| } | ||
|
|
||
| // Pack QJL sign bits with subgroupBallot: each ballot call contributes 32 bits | ||
| // covering positions [s*32, (s+1)*32). With WG == subgroup size, bit `lid` of |
There was a problem hiding this comment.
This cooperative path seems to rely on 32 threads == 1 full subgroup, but I do not see a matching required-subgroup-size request when the cpy_f32_tbq* / cpy_f32_pq* and set_rows_* pipelines are created. On devices with 8- or 16-lane subgroups, subgroupAdd() here only reduces within each subgroup and subgroupBallot() only packs part of the block, so both the norm/correction reductions and the QJL bit packing become partial. Is there a reason this is guaranteed to run only on subgroup-size-32 hardware?
There was a problem hiding this comment.
This was introduced by the optimization 45d3b80 the shader is only correct when gl_SubgroupSize == gl_WorkGroupSize.x == 32 which is not true for all hardware (e.g. Intel Arc)
There was a problem hiding this comment.
Working on a generic [[unroll]] + spec constant shader that should compile to similar bytecode when optimizations are enabled (with minor aesthetic differences) for 32 group-size.
There was a problem hiding this comment.
d994c9b Added generic shader and tests. Software implementation of Vulkan is used to test that different group-sized variants (other than 32 or 64) are accurate (vs CPU version).
Verified on 5090 box that neither PPL nor tokens/s are affected after the change. Since the testing script re-used same texts PPL is exactly the same. Tok/s within noise or very close.
Before:
[7/45, ETA 9m45s] Running: K=tbq3_0 V=pq3_0 (coopmat1, large) ...
tg=183.52±2.10 t/s
[4/45, ETA 9m06s] Running: K=pq3_0 V=pq3_0 (coopmat1, large) ...
tg=215.74±0.43 t/s
[5/45, ETA 9m34s] Running: K=pq4_0 V=pq4_0 (coopmat1, large) ...
tg=208.72±0.80 t/s
[15/45, ETA 7m45s] Running: K=pq3_0 V=pq3_0 (coopmat2, large) ...
tg=223.38±0.09 t/s
K=tbq3_0 V=pq3_0 PPL = 5.8203 (sweep±0.5987, chunk±0.2202) (1.93±0.12s)
K=pq3_0 V=pq3_0 PPL = 5.8461 (sweep±0.5701, chunk±0.2201) (1.84±0.13s)
After:
[7/45, ETA 1m57s] Running: K=tbq3_0 V=pq3_0 (coopmat1, mid) ...
tg=182.05±3.79 t/s
[4/45, ETA 2m06s] Running: K=pq3_0 V=pq3_0 (coopmat1, mid) ...
tg=215.16±1.36 t/s
[5/45, ETA 2m03s] Running: K=pq4_0 V=pq4_0 (coopmat1, mid) ...
tg=209.44±1.81 t/s
[15/45, ETA 1m33s] Running: K=pq3_0 V=pq3_0 (coopmat2, mid) ...
tg=224.63±0.11 t/s
K=tbq3_0 V=pq3_0 PPL = 5.8203 (sweep±0.5987, chunk±0.2202) (2.00±0.12s)
K=pq3_0 V=pq3_0 PPL = 5.8461 (sweep±0.5701, chunk±0.2201) (1.88±0.13s)
=== Subgroup coverage summary ===
┌────────────────┬───────────────────────────────────────┬───────────────────────────────┬────────────────────────────────────┐
│ Leg │ Subgroup size │ NSG │ Result │
├────────────────┼───────────────────────────────────────┼───────────────────────────────┼────────────────────────────────────┤
│ native GPU │ device default (>=32 on typical GPUs) │ 1 (fast path on typical GPUs) │ PASSED: ran=24 skipped=24 failed=0 │
│ lavapipe W=128 │ 4 │ 8 (stitch) │ PASSED: ran=16 skipped=32 failed=0 │
│ lavapipe W=256 │ 8 │ 4 (stitch) │ PASSED: ran=16 skipped=32 failed=0 │
│ lavapipe W=512 │ 16 │ 2 (stitch) │ PASSED: ran=16 skipped=32 failed=0 │
└────────────────┴───────────────────────────────────────┴───────────────────────────────┴────────────────────────────────────┘
==========================================
All checks passed.
==========================================
On 5acb3d5 added additional "masking" software shaders to test behavior of varying group-size, since its just for experimentation (to test different GS on hardware that does not natively has it) it will be reverted. All variations work very similarly in terms of GB/s and some-times surprisingly smaller WG configuration can outperform larger configs (could be noise).
=== pq3_0 huge ===
wg | status | nmse(g v c) | nmse(g v s) | ms/iter | GB/s
32(prod) | OK | 2.189e-08 | 3.398e-02 | 0.088 | 761.99
2 | OK | 2.189e-08 | 3.398e-02 | 0.090 | 742.52
4 | OK | 2.189e-08 | 3.398e-02 | 0.093 | 721.45
8 | OK | 2.189e-08 | 3.398e-02 | 0.090 | 746.73
16 | OK | 2.189e-08 | 3.398e-02 | 0.092 | 732.50
cpu | REF | - | - | 127.336 | 0.53
sorted by ms/iter (informational; see header):
wg=32(prod) 0.088 ms 761.99 GB/s speedup vs CPU = 1447.00x
wg=2 0.090 ms 742.52 GB/s speedup vs CPU = 1414.84x
wg=8 0.090 ms 746.73 GB/s speedup vs CPU = 1414.84x
wg=16 0.092 ms 732.50 GB/s speedup vs CPU = 1384.09x
wg=4 0.093 ms 721.45 GB/s speedup vs CPU = 1369.20x
cpu (ref) 127.336 ms 0.53 GB/s (baseline)
=== pq3_0_64 huge ===
wg | status | nmse(g v c) | nmse(g v s) | ms/iter | GB/s
wg | status | nmse(g v c) | nmse(g v s) | ms/iter | GB/s
32(prod) | OK | 4.587e-08 | 3.343e-02 | 0.147 | 455.67
2 | OK | 4.587e-08 | 3.343e-02 | 0.147 | 456.39
4 | OK | 4.587e-08 | 3.343e-02 | 0.154 | 435.99
8 | OK | 4.587e-08 | 3.343e-02 | 0.156 | 430.11
16 | OK | 4.587e-08 | 3.343e-02 | 0.153 | 438.12
cpu | REF | - | - | 129.420 | 0.52
sorted by ms/iter (informational; see header):
wg=32(prod) 0.147 ms 455.67 GB/s speedup vs CPU = 880.41x
wg=2 0.147 ms 456.39 GB/s speedup vs CPU = 880.41x
wg=16 0.153 ms 438.12 GB/s speedup vs CPU = 845.88x
wg=4 0.154 ms 435.99 GB/s speedup vs CPU = 840.39x
wg=8 0.156 ms 430.11 GB/s speedup vs CPU = 829.62x
cpu (ref) 129.420 ms 0.52 GB/s (baseline)
| if [ ${#KS[@]} -gt 0 ] || [ ${#VS[@]} -gt 0 ]; then | ||
| # --ks / --vs override: run the Cartesian product. Missing side defaults to the | ||
| # set supplied on the other side (so e.g. --vs f16 on its own sweeps all built-in K:f16 pairs). | ||
| if [ ${#KS[@]} -eq 0 ]; then |
There was a problem hiding this comment.
In --no-fa mode the script still documents the scalar-path sweep as "only test K quantizations with V=f16", but this override branch now auto-fills the missing side with all cache types. For example, --no-fa --ks tbq3_0 will expand to tbq3_0:{f16,q8_0,q4_0,pq3_0,...}, and the first quantized-V row aborts at runtime with V cache quantization requires flash_attn. Because run_perplexity_once() returns non-zero under set -e, that stops the whole sweep instead of running the intended K-only comparison. Should the auto-filled side be clamped back to f16 whenever FA_FLAG=off?
| case GGML_TYPE_Q8_0: | ||
| case GGML_TYPE_TQ2_0: | ||
| case GGML_TYPE_TQ1_0: | ||
| case GGML_TYPE_TBQ3_0: |
There was a problem hiding this comment.
This still looks too broad for GGML_OP_MUL_MAT_ID: the support predicate now whitelists the TBQ/PQ types here, but I do not see matching *_id pipelines being generated for them. ggml_vk_get_dequantize_mul_mat_vec_id() only populates the older quant types, and ggml_vk_get_mul_mat_mat_id_pipeline() still asserts if the selected pipeline_dequant_mul_mat_mat_id[src0_type] entry is empty. Since TBQ/PQ are KV-cache types this may be hard to hit in normal llama inference, but for custom graphs / backend-op surfaces this still looks like we advertise support without a backend implementation behind it.
Issue 1 — Hadamard rotation engages on every quantized KV cache type, not just TBQ/PQThis is the last behavioural concern from the original review (§2.3) that I haven't seen addressed. The
Attention scores remain mathematically equivalent at infinite precision (R is orthogonal, applied symmetrically to Q and K, and undone on the output for V), so end-to-end PPL should stay within the same CI, but:
Suggested fix (~10 lines) — narrow both predicates to the 8 TBQ/PQ types: auto is_tbq_pq = [](ggml_type t) {
switch (t) {
case GGML_TYPE_TBQ3_0: case GGML_TYPE_TBQ4_0:
case GGML_TYPE_PQ3_0: case GGML_TYPE_PQ4_0:
case GGML_TYPE_TBQ3_0_64: case GGML_TYPE_TBQ4_0_64:
case GGML_TYPE_PQ3_0_64: case GGML_TYPE_PQ4_0_64:
return true;
default:
return false;
}
};
const bool can_rotk = !hparams.is_n_embd_k_gqa_variable() &&
hparams.n_embd_head_k % 64 == 0 &&
is_tbq_pq(mctx_cur->type_k());Same change in the iSWA builder. Happy to push this as a PR against |
| // expects dim01-contiguous quantized src0 and would assert. supports_op | ||
| // advertises these cases as supported via has_quant_f16_cpy, so we must | ||
| // keep them on the GPU here rather than fall back to CPU. | ||
| } else if ((dst->ne[1] == 1 || (dst->ne[1] <= mul_mat_vec_max_cols && src1->ne[2] * src1->ne[3] == 1)) && |
There was a problem hiding this comment.
I can still reproduce a latest-head Vulkan correctness bug here on the NVIDIA coopmat2 box. This branch now sends non-dim01-contiguous quantized src0 to the matrix path when n is small, but the standalone TBQ QJL correction is still gated below to _64 or ne11 > mul_mat_vec_max_cols (ggml-vulkan.cpp around the is_tbq_d128_dispatch / ne11 > mul_mat_vec_max_cols check). That leaves a hole for _128 tbq3_0 / tbq4_0 with small n: they avoid the vec kernel, but also skip the Stage-2 correction pass.
Concrete repro on qvac-dev-linux-x64 (RTX 5090, VK_NV_cooperative_matrix2) against the current PR head: MUL_MAT(type_a=tbq3_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3],k_v=0,o=1) is reported as supported on Vulkan0, but the correctness run fails with ERR = 0.058072511 > 0.000500000. The analogous control case with type_a=pq3_0 and the same shape passes.
That strongly suggests the non-contiguous small-n matrix path is still missing the TBQ Stage-2 QJL correction for _128 blocks.
Issue 3 — No
|
Issue 4 — Thread-0 bottleneck in
|
|
syscl-fp16 seems to fail now: https://github.com/tetherto/qvac-fabric-llm.cpp/actions/runs/24902725692/job/72924264252?pr=115 Here it was passing before: https://github.com/tetherto/qvac-fabric-llm.cpp/actions/runs/24771942879/job/72480398192 Possibly: last successful sycl-fp16 on this PR = Apr 22 (run 24771942879, commit 45d3b80). It broke today purely because |
Fix a latent correctness bug in the TurboQuant / PolarQuant copy_to_quant
cooperative shader that silently produces wrong bytes on any device whose
gl_SubgroupSize is less than the 32-thread workgroup (Intel Xe/Arc at 8/16,
ARM Mali 4/8/16, some Adreno configurations). Make the path cover every
supported subgroup size, plumb a runtime knob for testing, and add a
dedicated test suite with both real-hardware and software-Vulkan coverage.
Motivation
----------
The original copy_to_quant.comp TBQ/PQ path uses subgroupAdd() for the
per-block norm reductions and subgroupBallot() for the QJL sign-bit sketch,
assuming gl_SubgroupSize == 32 (= the workgroup size). On devices where the
native subgroup is smaller, those ops reduce only within a subgroup, not the
whole workgroup, so each subgroup sees its own partial sum and the output
bytes become whatever the first-subgroup partial happened to produce. The
SET_ROWS path has the same issue. The bug does not reproduce on most
production GPUs (NVIDIA fixed-32, AMD RDNA 32/64, Apple 32) but bites Intel
and several mobile GPUs.
Shader changes (copy_to_quant.comp)
-----------------------------------
* New specialization constant SG_SIZE at constant_id = 1 (slot 0 is already
used by generic_binary_head.glsl's `norepeat` in the SET_ROWS path).
Defaults to 32 so hosts that pass no spec info get the original shader.
* TQ_WG fixed at 32 (the workgroup size); NSG = TQ_WG / SG_SIZE is the
number of subgroups per workgroup.
* New helper tq_wg_add(x): if NSG == 1 (SG_SIZE >= TQ_WG) returns
subgroupAdd(x) -- identical to the original fast path and
dead-code-eliminated by spec-constant folding; if NSG > 1 the per-
subgroup subgroupAdd results are written to shared memory (tq_sh_red)
and stitched with an [[unroll]]-ed sum. Replaces every subgroupAdd() in
the TBQ/PQ/norm-correction paths.
* QJL sign-bit pack: when SG_SIZE >= TQ_WG the original subgroupBallot
fast path runs; when SG_SIZE < TQ_WG it falls back to atomicOr into a
shared uint array and a serial write-out. Same fast-path guard lets
specialization fold the slow branch away when SG_SIZE == 32.
* SG_SIZE > TQ_WG (e.g. AMD wave64 with WG=32) is treated as NSG == 1
via clamp(SG_SIZE, TQ_WG) in tq_wg_add, so those devices take the fast
path even though half the wave is masked off.
Host plumbing (ggml-vulkan.cpp)
-------------------------------
* vk_device_struct grows a tbq_copy_sg_size field (0 = no override).
* Device init reads GGML_VK_TBQ_COPY_SG_SIZE from env, validates against
{4, 8, 16, 32, 64} intersected with the device's
[subgroup_min_size, subgroup_max_size], and emits a structured
"tbq_copy_sg_size_status requested=R applied=A reason=X" line so tests
can tell whether the override was applied or rejected (distinct from
success/failure of the run itself).
* ggml_vk_load_shaders picks the (SG_SIZE spec const, requiredSubgroupSize)
pair used for every CPY-to-quant and SET_ROWS-to-quant pipeline:
- if the env override is set: that value
- else if the device supports size control: mul_mat_subgroup_size
- else: 0 (shader default SG_SIZE=32, no required size) -- matches
pre-patch behaviour on drivers without VK_EXT_subgroup_size_control.
The two-element spec-const vector is {0, SG_SIZE} for the plain CPY
path (slot 0 is ignored by generic_unary_head.glsl) and {1, SG_SIZE}
for SET_ROWS (slot 0 is `norepeat`, always 1).
* Adds a device-selection opt-in GGML_VK_ALLOW_CPU_DEVICES=1 so tests can
pick up software Vulkan ICDs (lavapipe, SwiftShader) that ggml-vulkan
normally filters out. Production code never sets this env var and the
behaviour is unchanged when it isn't set.
New test (tests/test-copy-tbq-subgroups.cpp + CMakeLists)
---------------------------------------------------------
Self-spawning C++ test that for each (SG in {0, 4, 8, 16, 32, 64}, type,
shape) triple runs GPU quantize, compares against a CPU
ggml_quantize_chunk reference, and reports byte-mismatch + dequant NMSE
+ throughput. Key design choices:
* Self-spawn (popen of --child N with a different
GGML_VK_TBQ_COPY_SG_SIZE value per child) because the env var is
consumed once at device init and can only be changed across processes.
* Parses the structured status line from the backend to distinguish
"applied" from "rejected" rows. Rejected rows are labelled
SKIP-<reason> in the per-case table and excluded from the
NMSE-spread assertion (they are duplicates of sg=0 and don't add
independent coverage). Prior phrasing that labelled them OK was
misleading.
* --types comma-separated filter keeps the default CI run fast by
iterating only a subset of TBQ/PQ types.
* Shared pass/fail rule: nmse(gpu vs cpu) <= 1e-6 for every applied
SG; the per-case table stays OK on the legs that couldn't exercise
the stitch path on the host GPU.
Cross-subgroup-size coverage via lavapipe (tests/test-turboquant.sh)
--------------------------------------------------------------------
Real desktop GPUs (NVIDIA, AMD RDNA, Apple, most Adreno) have
minSubgroupSize >= 32, so VK_EXT_subgroup_size_control cannot request the
smaller subgroups the stitch path was written for. To actually exercise
NSG > 1 in CI, the script now also runs the test under lavapipe (Mesa's
CPU Vulkan driver) at LP_NATIVE_VECTOR_WIDTH in {128, 256, 512}, which
gives native subgroupSize {4, 8, 16} respectively and therefore covers
every distinct NSG branch the shader supports:
LP_NATIVE_VECTOR_WIDTH | lavapipe SG | NSG (= TQ_WG / SG)
-----------------------+-------------+--------------------
128 | 4 | 8 (8-way stitch)
256 | 8 | 4 (4-way stitch)
512 | 16 | 2 (2-way stitch)
Combined with the native-GPU leg (NSG=1, fast path), this gives full
coverage of the helper's {1, 2, 4, 8} NSG branches on any host.
Usage and modes
---------------
tests/test-turboquant.sh # short mode (default): CI-friendly
tests/test-turboquant.sh --full # all TBQ/PQ types, full matrix
Short mode restricts the SG-coverage legs to tbq3_0 / pq3_0 / *_64 to keep
default CI runtime bounded; full mode covers all 8 TBQ/PQ types. Both
modes render a Unicode-boxed summary table at the end covering every
subgroup-coverage leg that ran.
Summary
Implements TurboQuant KV cache quantization (Zandieh et al., ICLR 2026) for CPU and Vulkan backends with full Flash Attention support. Compresses KV cache to 3.25-4.25 bits per value, enabling ~4-5x larger context windows on the same hardware.
Paper: https://arxiv.org/pdf/2504.19874
Community discussion:
Related upstream PR: llama : rotate activations for better quantization ggml-org/llama.cpp#21038 (graph-level rotation for existing quant types)
Recommended configurations:
K=pq3_0 V=pq3_0— codebook-only, no QJL overhead. Minimal PPL/speed loss at 3.25 bpw with a small retrieval quality trade-off on long contexts.K=tbq3_0 V=pq3_0— QJL-corrected keys with codebook-only values. Best retrieval accuracy at 3.75 avg bpw, with a moderate speed cost from QJL correction in the FA shader.Features
tbq3_0,tbq4_0,pq3_0,pq4_0(and_64variants)pq3_0, internal type auto-selectscopy_to_quantVulkan path for TBQ/PQ (faster KV writes)How does TurboQuant work?
Random rotations spread values evenly across coordinates, preventing concentration on a few axes where zero-coordinates waste bits. In high dimensions, the marginal distribution of each coordinate of a unit-sphere vector follows a Beta distribution that converges to N(0, 1/d) as d grows. The algorithm exploits this by placing Lloyd-Max codebook centroids at optimal positions for this known distribution, minimizing MSE reconstruction error. Centroids are found by solving a continuous 1-dimensional k-means problem.
An additional QJL correction step (Stage 2) reduces bias in dot-product estimation. It quantizes the residual error from Stage 1 to 1-bit by storing only the signs of the residual vector after applying a random rotation (Hadamard × sign diagonal). Since only signs are stored (no centroid rounding), the paper proves this yields an unbiased dot-product estimator. This step is important for maintaining retrieval quality on long contexts.
Optimization details
Hadamard instead of dense rotation: Rotations based on Hadamard use the butterfly pattern in O(d log d) instead of O(d²). Hadamard is deterministic, but applying a random sign diagonal preserves randomness while remaining orthogonal and invertible.
Dense rotation for K/V/Q at graph level, FHT in shader for QJL: At block sizes d=64/128, O(d²) is negligible and utilizes better GPU parallelism for the graph-level rotation. The butterfly FHT is used inside the Flash Attention shader for the QJL projection, avoiding the need to copy a dense matrix into the shader (which would add memory pressure). Since there is no Q cache, the QJL projection of Q must be recomputed every step to apply corrections against the 1-bit signs stored in K blocks.
q4_0pq3_0pq4_0tbq3_0tbq4_0Implementation overview
vulkan-shaders-gen.cpp— orchestrates SPIR-V compilation of all variant combosggml-vulkan.cpp— host-side: creates pipeline objects, dispatches computeTurboQuant KV cache shader flow (TBQ/PQ is ONLY a KV cache type, never model weights):
STEP 1: Write to cache (same for all paths)
copy_to_quant.comp: float K/V → TBQ/PQ quantized blocksqjl[],d_r)STEP 2: Read cache at attention time (paths diverge here)
PATH A: Scalar Flash Attention (broad HW support, baseline)
flash_attn.comptypes.glsl,tq_utils.comp(viaflash_attn_base.glsl),dequant_funcs.glslPATH B: Cooperative matrix v1 Flash Attention (KHR, cross-vendor)
flash_attn_cm1.compcoopMatMulAddfor K·Q^T (subgroup-scope 16×16 tiles)sfsh[]after coopmat store)PATH C: Cooperative matrix v2 Flash Attention (NV only, most efficient)
flash_attn_cm2.compcoopMatLoadTensorNVwith decode callback (dequant-on-load, no shared memory staging)coopMatMulAdd(workgroup-scope matrices)data_k[]with hardcoded byte offsets per typePATH D: No-FA fallback, small N (MUL_MAT with N ≤ 8, e.g. decode)
mul_mat_vec_tbq3_0.comp/mul_mat_vec_tbq4_0.compPATH E: No-FA fallback, large N (K·Q
MUL_MATwith N > 8, e.g. prefill)-fa offwith a TBQ/PQ K cache. Only the K·Q matmul is affected: V stays f16 under-fa off(upstream guard), so V·A stays on the existing f16 path.mul_mm.compruns with TBQ/PQload_a_to_shmem— centroid dequant ×dinto shared memory, then generic tiled matmul (scalar / cm1 pipelines; cm2 falls through to cm1/scalar since no_mat_f16cm2 shader exists for TBQ/PQ).mul_mm_tbq_qjl_correction.compis dispatched after the main matmul as an additive pass — one workgroup per(row, col, batch),QUANT_Kthreads running the same Walsh–Hadamard + QJL dot product as the vec shader, accumulatingd_r · √(π/2) / QUANT_K · sum_qjl(H(B))intoD.qjl[]/d_r), so Stage 1 alone is exact.not supportedand falls back to CPU.supports_opclaimed TBQ/PQMUL_MATon cm2 devices (RTX 5090) but had no pipeline behind it, so the correctness run segfaulted.tests/test-backend-ops.cppnow covers all 8 TBQ/PQ types ×n ∈ {1,8,16,32}as a repro.src0(permuted layouts) is now routed to the matrix path as well, so TBQ/PQMUL_MATworks regardless ofsrc0stride pattern.Example usage
Works transparently with both head_dim=128 (Llama-3.1, Qwen, Mistral) and head_dim=64 (Llama-3.2-1B/3B) — the right block size is auto-selected.
Results / testing
Please see Asana for latest available data: https://app.asana.com/1/45238840754660/project/1212638335655939/task/1214143691877486
Will comment here with a public report when results can be shared.
PR for testing integration on LLM Addon: tetherto/qvac#1564
Limitations
-fa offis not supported by this PR. Upstreamllama_init_from_modelrejects quantized V when flash attention is disabled ("V cache quantization requires flash_attn"), and that guard is intentionally left in place. The-fa offK·Q MUL_MAT fix in this PR would extend cleanly to A·V for a quantized V as well, but thev_transV-cache layout used under-fa offis populated byggml_set_rowswith row_size=1, which corrupts anyblck_size > 1type at write time (reproducible on CPU as well, independent of backend). Fixing that is a KV-cache refactor out of scope here; the guard will be revisited once that lands.TBQ / PQ Vulkan support matrix
What runs on GPU vs. is refused by the context, across FA on/off on dense and MoE models. The MoE-KV-cache rows behave the same as dense because attention itself is plain
MUL_MAT/FLASH_ATTN_EXT, notMUL_MAT_ID; MoE routing (MUL_MAT_ID) only applies to the FFN weights, which are never stored as TBQ/PQ.mul_mm.comp+ QJL correction; V·A on the existing f16 pathNotes:
_64block variants (tbq*_0_64,pq*_0_64) have their own pipelines, codebooks, and sign tables.llama-quantizehas no TBQ/PQ target, and no GGUF stores FFN experts in those types), soMUL_MAT_IDnever receives TBQ/PQsrc0. Attention in MoE models is a plainMUL_MAT/FLASH_ATTN_EXTand therefore falls under the "KV cache" rows above.Remaining work