Skip to content

Improve utilization (and tg t/s) by reducing K_QUANTS_PER_ITERATION to 1 on DMMV path#25063

Open
malsbat wants to merge 4 commits into
ggml-org:masterfrom
aicss-genai:reduce-k-quants-per-iteration
Open

Improve utilization (and tg t/s) by reducing K_QUANTS_PER_ITERATION to 1 on DMMV path#25063
malsbat wants to merge 4 commits into
ggml-org:masterfrom
aicss-genai:reduce-k-quants-per-iteration

Conversation

@malsbat

@malsbat malsbat commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Overview

The DMMV path recently changed the warp size from 32 to 16. When combined with the reordered layout and reducing K_QUANTS_PER_ITERATION to 1 the work group size is increased from 16 to 32, resulting in reduced stalls and better utilization.

This gives a significant performance boost to tg t/s, up to 1.53x for Q4_K.

Additional information

Smoke tests on B70 comparing baseline (master branch @ 3fc4e10), KQPI=1 with the AOS layout, and finally with SOA layout.

export GGML_SYCL_PRIORITIZE_DMMV=1
for model in Qwen3.5-27B-Q2 Gemma-4-26B-Q3 Qwen3.5-27B-Q4 Qwen3.5-27B-Q5 Qwen3.5-27B-Q6; do ./build/bin/llama-bench -p 64 -n 16 -r 1 -ngl 999 -dev SYCL0 -m /models/${model}.gguf; done
model test baseline t/s KQPI=1 t/s KQPI=1 reordered t/s
gemma4 26B.A4B Q3_K - Medium pp64 294.82 ± 0.00 292.14 ± 0.00 298.22 ± 0.00 1.023
gemma4 26B.A4B Q3_K - Medium tg16 42.09 ± 0.00 37.99 ± 0.00 40.90 ± 0.00 0.972
qwen35 27B Q4_K - Medium pp64 188.58 ± 0.00 186.15 ± 0.00 189.23 ± 0.00 1.003
qwen35 27B Q4_K - Medium tg16 13.27 ± 0.00 12.89 ± 0.00 20.41 ± 0.00 1.538
qwen35 27B Q5_K - Medium pp64 231.61 ± 0.00 227.71 ± 0.00 231.78 ± 0.00 1.001
qwen35 27B Q5_K - Medium tg16 12.16 ± 0.00 8.68 ± 0.00 22.53 ± 0.00 1.853
qwen35 27B Q6_K pp64 216.41 ± 0.00 213.56 ± 0.00 218.05 ± 0.00 1.008
qwen35 27B Q6_K tg16 10.97 ± 0.00 5.42 ± 0.00 14.84 ± 0.00 1.353

More benchmarks with different models.

for model in Qwen3.5-9B-Q4 Qwen3.5-9B-Q8 Gemma-4-31B-Q6; do ./build/bin/llama-bench -p 128 -n 32 -ngl 999 -m /models/${model}.gguf; done
model test baseline t/s KQPI=1 t/s KQPI=1 reordered t/s
qwen35 9B Q4_K - Medium pp128 1110.94 ± 12.29 1076.49 ± 9.31 1118.67 ± 8.56 1.007
qwen35 9B Q4_K - Medium tg32 41.14 ± 0.17 36.73 ± 0.18 60.57 ± 1.19 1.472
qwen35 9B Q8_0 pp128 1234.18 ± 14.57 1234.70 ± 15.98 1233.31 ± 17.19 0.999
qwen35 9B Q8_0 tg32 38.76 ± 0.04 38.76 ± 0.04 38.76 ± 0.04 1.000
gemma4 31B Q6_K pp128 331.03 ± 2.62 325.97 ± 1.89 332.20 ± 1.60 1.004
gemma4 31B Q6_K tg32 9.26 ± 0.02 4.42 ± 0.00 12.82 ± 0.07 1.384

Requirements

malsbat added 4 commits June 24, 2026 23:15
The reordered feature is implemented in ggml_sycl_op_dequantize_mul_mat_vec,
but gated by ggml_sycl_supports_reorder_dmmv. This commit fixes the gate.

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
When combined with opening the reorder gate, this improves GPU
utilization on B70, giving a significant boost to tg t/s.

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
Without this, the extra field is not allocated and the reorder path
will not take effect.

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
@malsbat malsbat requested a review from a team as a code owner June 26, 2026 18:42
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant