Improve utilization (and tg t/s) by reducing K_QUANTS_PER_ITERATION to 1 on DMMV path by malsbat · Pull Request #25063 · ggml-org/llama.cpp

malsbat · 2026-06-26T18:42:51Z

Overview

The DMMV path recently changed the warp size from 32 to 16. When combined with the reordered layout and reducing K_QUANTS_PER_ITERATION to 1 the work group size is increased from 16 to 32, resulting in reduced stalls and better utilization.

This gives a significant performance boost to tg t/s, up to 1.53x for Q4_K.

Additional information

Smoke tests on B70 comparing baseline (master branch @ 3fc4e10), KQPI=1 with the AOS layout, and finally with SOA layout.

export GGML_SYCL_PRIORITIZE_DMMV=1
for model in Qwen3.5-27B-Q2 Gemma-4-26B-Q3 Qwen3.5-27B-Q4 Qwen3.5-27B-Q5 Qwen3.5-27B-Q6; do ./build/bin/llama-bench -p 64 -n 16 -r 1 -ngl 999 -dev SYCL0 -m /models/${model}.gguf; done

model	test	baseline t/s	KQPI=1 t/s	KQPI=1 reordered t/s
gemma4 26B.A4B Q3_K - Medium	pp64	294.82 ± 0.00	292.14 ± 0.00	298.22 ± 0.00	1.023
gemma4 26B.A4B Q3_K - Medium	tg16	42.09 ± 0.00	37.99 ± 0.00	40.90 ± 0.00	0.972
qwen35 27B Q4_K - Medium	pp64	188.58 ± 0.00	186.15 ± 0.00	189.23 ± 0.00	1.003
qwen35 27B Q4_K - Medium	tg16	13.27 ± 0.00	12.89 ± 0.00	20.41 ± 0.00	1.538
qwen35 27B Q5_K - Medium	pp64	231.61 ± 0.00	227.71 ± 0.00	231.78 ± 0.00	1.001
qwen35 27B Q5_K - Medium	tg16	12.16 ± 0.00	8.68 ± 0.00	22.53 ± 0.00	1.853
qwen35 27B Q6_K	pp64	216.41 ± 0.00	213.56 ± 0.00	218.05 ± 0.00	1.008
qwen35 27B Q6_K	tg16	10.97 ± 0.00	5.42 ± 0.00	14.84 ± 0.00	1.353

More benchmarks with different models.

for model in Qwen3.5-9B-Q4 Qwen3.5-9B-Q8 Gemma-4-31B-Q6; do ./build/bin/llama-bench -p 128 -n 32 -ngl 999 -m /models/${model}.gguf; done

model	test	baseline t/s	KQPI=1 t/s	KQPI=1 reordered t/s
qwen35 9B Q4_K - Medium	pp128	1110.94 ± 12.29	1076.49 ± 9.31	1118.67 ± 8.56	1.007
qwen35 9B Q4_K - Medium	tg32	41.14 ± 0.17	36.73 ± 0.18	60.57 ± 1.19	1.472
qwen35 9B Q8_0	pp128	1234.18 ± 14.57	1234.70 ± 15.98	1233.31 ± 17.19	0.999
qwen35 9B Q8_0	tg32	38.76 ± 0.04	38.76 ± 0.04	38.76 ± 0.04	1.000
gemma4 31B Q6_K	pp128	331.03 ± 2.62	325.97 ± 1.89	332.20 ± 1.60	1.004
gemma4 31B Q6_K	tg32	9.26 ± 0.02	4.42 ± 0.00	12.82 ± 0.07	1.384

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, to navigate code

The reordered feature is implemented in ggml_sycl_op_dequantize_mul_mat_vec, but gated by ggml_sycl_supports_reorder_dmmv. This commit fixes the gate. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

When combined with opening the reorder gate, this improves GPU utilization on B70, giving a significant boost to tg t/s. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

Without this, the extra field is not allocated and the reorder path will not take effect. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

malsbat added 4 commits June 24, 2026 23:15

sycl: add supported types to ggml_sycl_supports_reorder_dmmv

0c4df1c

The reordered feature is implemented in ggml_sycl_op_dequantize_mul_mat_vec, but gated by ggml_sycl_supports_reorder_dmmv. This commit fixes the gate. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

sycl: set K_QUANTS_PER_ITERATION=1 to improve utilization

0fab1f3

When combined with opening the reorder gate, this improves GPU utilization on B70, giving a significant boost to tg t/s. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

sycl: replace QK_WARP_SIZE with WARP_SIZE for QK_5

af50310

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

sycl: add missing types to ggml_backend_sycl_buffer_init_tensor

47a9754

Without this, the extra field is not allocated and the reorder path will not take effect. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

malsbat requested a review from a team as a code owner June 26, 2026 18:42

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 26, 2026

malsbat mentioned this pull request Jun 26, 2026

sycl: add Q2_K to DMMV reorder path #25064

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve utilization (and tg t/s) by reducing K_QUANTS_PER_ITERATION to 1 on DMMV path#25063

Improve utilization (and tg t/s) by reducing K_QUANTS_PER_ITERATION to 1 on DMMV path#25063
malsbat wants to merge 4 commits into
ggml-org:masterfrom
aicss-genai:reduce-k-quants-per-iteration

malsbat commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

malsbat commented Jun 26, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant