SYCL: add oneMKL GEMM flash attention for XMX-accelerated prompt proc… by johnkarlhill · Pull Request #25025 · ggml-org/llama.cpp

johnkarlhill · 2026-06-26T01:24:12Z

Adds a flash attention path that routes Q·K^T and S·V matrix multiplies
through oneMKL GEMM, enabling XMX hardware acceleration on Intel GPUs.

Motivation

The existing SYCL flash attention kernels (VEC, TILE) run entirely in
SYCL subgroup operations. On Intel Arc GPUs with XMX matrix engines
(Battlemage and later), oneMKL GEMM can process the large matmuls in
attention significantly faster — particularly at high context lengths
where the KV cache is quantized.

When it activates

KV cache is quantized (q8_0, q4_0, q4_1, q5_0, q5_1, or any K-quant)
K sequence length ≥ 1024 tokens (covers the full --batch-size)
Q sequence length ≥ 128

These thresholds route prompt processing through MKL while leaving
single-token decode to the existing TG-optimized kernels. The path is
never activated for f16/bf16 KV cache — those already perform well
with the TILE kernel and graph capture.

Implementation

All logic is in one new file, fattn-mkl.cpp (567 lines). The pipeline:

Dequantize K/V to fp16
For each KV head: pack all GQA query heads into a single fp16 buffer
Chunked KV loop (8192-token chunks):
- MKL GEMM: KQ = Q_batched × K_chunk^T
- Online softmax SYCL kernel (row-wise, with running max/sum)
- MKL GEMM: VKQ_chunk = S × V_chunk
- Accumulate: VKQ_accum += VKQ_chunk
Normalize each GQA head by KQ_sum and scatter to output

GQA groups sharing a KV head are batched into single GEMM calls —
6 query heads × 1020 tokens = 6120 rows in one MKL call, amortizing
launch overhead.

Performance (Arc Pro B70 / BMG-G21, 32 GB)

Context	KV Cache	PP t/s	TG t/s
8K	q8_0	~812	~23
110K	q8_0	~335	~17

For comparison, stock bf16 KV cache + FA off on the same GPU achieves
~822 t/s PP at 8K — the MKL path with q8_0 is within 1% while using
quantized memory.

Testing

test-backend-ops: 3605/3605 FLASH_ATTN_EXT tests pass (all
quant types, head sizes 64–512, causal/non-causal masks, sinks,
max_bias, GQA ratios, multi-batch)
Multi-batch: parallel-2 at 32K context, stable throughput,
no coherence errors
Build isolation: all code is SYCL-only, gated behind
BEST_FATTN_KERNEL_MKL enum; other backends and non-quantized
paths are completely unaffected

Known limitations

No graph capture: MKL GEMM's internal queue management is
incompatible with SYCL command graph replay. The existing
GGML_SYCL_DISABLE_GRAPH default (1) handles this.
No ALiBi: max_bias == 0.0f asserted; models needing ALiBi
will fall through to the TILE kernel.
No sinks tensor: dst->src[4] is not yet supported.

Debug output

Timing instrumentation is gated behind MKL_FA_DEBUG=1. In normal
operation the MKL path produces no output.

AI disclosure

Claude Code was used for SYCL boilerplate (ND-range kernel launches,
ggml_sycl_pool_alloc patterns) and initial drafting of the chunked KV
loop. All algorithmic decisions — oneMKL GEMM integration, online
softmax with GQA batching, activation thresholds, chunk sizing — were
human-directed. Comprehensive testing (3605 test-backend-ops,
multi-quant and multi-batch coherence validation, performance
benchmarking at contexts up to 110K) was performed manually.

🤖 Generated with Claude Code using DeepSeek-V4-Pro

…essing

ggml-gh-bot · 2026-06-26T01:28:39Z

Hi @johnkarlhill, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

arthw

@johnkarlhill

It's good to see this PR to enable XMX in FA.

Could you share which LLM show good performance increasing by this PR?
I use Qwen3.6 and can't trigger the oneMKL path on FA.

Thank you!

johnkarlhill · 2026-06-26T03:19:22Z

Adding before and after... both compiled with arch flags to show side-by-side. Compiling without arch flags will degrade performance from these numbers but should still be better than stock.
PR25025 - Qwen3.6-27B-MTP-UD-Q5_K_XL on B70.txt
b9752 - Qwen3.6-27B-MTP-UD-Q5_K_XL on B70.txt

I'll add more models if needed.

arthw · 2026-06-26T07:36:25Z

@johnkarlhill
1.
Could you provide a smaller LLM case to show the perf increase for this PR?
Including the whole cmd.

For user, how to trigger the new code in usage?

Thank you!

maxious · 2026-06-26T09:05:24Z

Tested on dual Intel Arc Pro B60 (Battlemage, 24GB each), oneAPI 2026.0, MKL 2026.0, targeting bmg-g31 AOT.

Short context (pp512 — MKL path NOT active)

No regression vs master:

Model	Size	Master pp512	PR pp512
gpt-oss 20B Q8_0	11.3G	854	851

Long context (pp2048 — MKL path active)

Model	Size	Master pp2048	PR pp2048	Delta
llama-2 7B Q2_K	2.6G	950	1,102	+16%
Llama-3 8B Q4_0	4.3G	888	968	+9%
gpt-oss 20B Q8_0	11.3G	503	504	0%
Qwen3.6-35B-A3B MoE Q3_K	12.8G	575	575	0%

Decode speeds (tg128) unchanged across all models (±2%).

Full commands

# Master
./build-master/bin/llama-bench -m model.gguf -p 512,2048 -n 128 -ngl -1 -fa 1

# PR #25025
./build-pr/bin/llama-bench -m model.gguf -p 512,2048 -n 128 -ngl -1 -fa 1

johnkarlhill · 2026-06-26T11:45:15Z

Updated arch table:

GPU	-DGGML_SYCL_DEVICE_ARCH
Arc A770 / A750	acm_g10
Arc A580	acm_g12
Arc A380 / A310	acm_g11
Arc B580 / B570 / Pro B70	bmg_g21
Flex / Data Center Max	pvc
Integrated (Meteor Lake)	mtl_u
Integrated (Lunar Lake)	lnl_m

All of these use underscores (_), never hyphens (-).

I incorrectly listed Arc A770 / A750 as acm_g12.

"For user, how to trigger the new code in usage?"
Use "--batch-size N" where N is a value >= 1024.

jlionhan · 2026-06-26T16:35:39Z

I hope this helps.

255H, Arc 140T, 32GB RAM, llama-cli

Build options:

cmake --fresh -B build -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL=1 -DBUILD_SHARED_LIBS=0 -DGGML_SYCL_F16=1

after PR:

model	size	params	backend	ngl	threads	type_k	type_v	fa	test	t/s
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	99	6	q8_0	q8_0	1	pp512	176.84 ± 3.13
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	99	6	q8_0	q8_0	1	pp1024	202.22 ± 2.48
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	99	6	q8_0	q8_0	1	pp2048	212.19 ± 1.95
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	99	6	q8_0	q8_0	1	tg128	12.13 ± 0.22

before PR:

model	size	params	backend	ngl	threads	type_k	type_v	fa	test	t/s
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	99	6	q8_0	q8_0	1	pp512	178.96 ± 6.26
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	99	6	q8_0	q8_0	1	pp1024	165.24 ± 3.03
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	99	6	q8_0	q8_0	1	pp2048	139.39 ± 1.68
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	99	6	q8_0	q8_0	1	tg128	12.02 ± 0.38

build: f728ada (9793)

Thank you.

However, I am observing intermittent behavior where the most recently entered prompt is not being processed, and the model instead generates a response to the previous prompt. I believe further testing is needed to confirm whether this issue is reproducible and to identify the underlying cause. This behavior may be unrelated to this PR.

johnkarlhill · 2026-06-26T22:26:46Z

However, I am observing intermittent behavior where the most recently entered prompt is not being processed, and the model instead generates a response to the previous prompt. I believe further testing is needed to confirm whether this issue is reproducible and to identify the underlying cause. This behavior may be unrelated to this PR.

I can reproduce the behavior on Gemma4 models and working on a fix. This behavior does not exist on Qwen models. Multiple folks have tested with a few different Qwen models and can't reproduce this. It seems specific to Gemma4.

And a huge THANK YOU for testing this. It is very much appreciated!!!

arthw · 2026-06-27T08:08:11Z

@johnkarlhill
There are several code to call wait().
Are they necessary to get the correct result?
Reduce or remove them will be quicker.

Thank you!

…env vars - Fix mkl_fa_normalize_head: use interleaved dst layout ((query * n_q_heads + head) * DV) matching TILE's flash_attn_combine_results. Previously used dense head-major layout which wrote head outputs to wrong addresses, corrupting attention for all models except Qwen3.6-27B (where GQA=6 heads were sparse enough to avoid visible overlap). - Remove unused dst_row_stride parameter from normalize_head. - Clean up diagnostic clutter (DIAG_F32, FA-CK, softmax row-sum). - Add MKL_FA_DISABLE=1 env var to fattn.cpp dispatcher for A/B testing against TILE path. - Add FA-DISP watchdog (MKL_FA_DEBUG=1) to log n_kv deltas per FA call, and FA-DIAG output fingerprint (MKL_FA_DIAG=1) to dump first 64 output floats for cross-kernel comparison. Tested: Gemma-4-26B, Gemma-4-31B, Qwen3.6-27B, Qwen3.6-35B-A3B Perf (B70/Battlemage, 32K, q8_0 KV): Gemma-4-26B: 1473 t/s MKL vs 746 TILE (1.97x) Qwen3.6-27B: 606 t/s MKL vs 330 TILE (1.84x) Co-Authored-By: Claude Code on DeepSeek-v4-Pro

johnkarlhill · 2026-06-28T07:29:54Z

Bug fix

The MKL normalize kernel was writing output using a dense head-major layout (head * n_queries * DV), but llama.cpp's flash attention output uses an interleaved layout (query * n_heads + head per row, matching TILE's flash_attn_combine_results). Head 0 row 0 happened to alias at offset 0 in both layouts, so the first layer's first 64 output floats matched TILE. Everything else landed at wrong addresses. One-line fix in mkl_fa_normalize_head.

Tested models (all pass multi-turn coherence)

Gemma-4-26B-A4B-it (Q5_K_M, gqa=2)
Gemma-4-31B-it-qat (Q4_K_XL, dense)
Qwen3.6-27B (Q5_K_XL, gqa=6)
Qwen3.6-35B-A3B (Q4_K_XL, gqa=2)

Performance (Intel Arc Pro B70, Battlemage BMG-G21, 32K context, q8_0 KV cache)

Model	MKL PP (t/s)	TILE PP (t/s)	Speedup
Gemma-4-26B	1473	746	1.97×
Qwen3.6-27B	606	330	1.84×

Token generation unaffected (±1 t/s, within noise) — MKL only activates for prompt processing (n_kv ≥ 1024 with quantized KV).

Updated PR25025 - Qwen3.6-27B-MTP-UD-Q5_K_XL on B70.txt
Updated PR25025 - Gemma-4-26B-A4B-it-UD-Q5_K_M on B70.txt

How to test

cmake --preset x64-windows-sycl-release -DGGML_SYCL_F16=ON -DGGML_SYCL_GRAPH=ON -DGGML_SYCL_DNN=ON -DGGML_SYCL_DEVICE_ARCH=bmg_g21
cmake --build build-x64-windows-sycl-release --config Release -j 16

# Run (MKL activates automatically with flash-attn + quantized KV + n_kv ≥ 1024)
llama-server --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --batch-size 1024 ...

# Disable for A/B comparison
set MKL_FA_DISABLE=1```

SYCL: add oneMKL GEMM flash attention for XMX-accelerated prompt proc…

86b44b0

…essing

johnkarlhill requested a review from a team as a code owner June 26, 2026 01:24

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 26, 2026

arthw reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SYCL: add oneMKL GEMM flash attention for XMX-accelerated prompt proc…#25025

SYCL: add oneMKL GEMM flash attention for XMX-accelerated prompt proc…#25025
johnkarlhill wants to merge 2 commits into
ggml-org:masterfrom
johnkarlhill:sycl-mkl-flash-attn

johnkarlhill commented Jun 26, 2026

Uh oh!

ggml-gh-bot Bot commented Jun 26, 2026

Uh oh!

arthw left a comment •

edited

Loading

Uh oh!

johnkarlhill commented Jun 26, 2026

Uh oh!

arthw commented Jun 26, 2026

Uh oh!

maxious commented Jun 26, 2026

Uh oh!

johnkarlhill commented Jun 26, 2026 •

edited

Loading

Uh oh!

jlionhan commented Jun 26, 2026 •

edited

Loading

Uh oh!

johnkarlhill commented Jun 26, 2026 •

edited

Loading

Uh oh!

arthw commented Jun 27, 2026

Uh oh!

johnkarlhill commented Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

johnkarlhill commented Jun 26, 2026

Motivation

When it activates

Implementation

Performance (Arc Pro B70 / BMG-G21, 32 GB)

Testing

Known limitations

Debug output

AI disclosure

Uh oh!

ggml-gh-bot Bot commented Jun 26, 2026

Uh oh!

arthw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnkarlhill commented Jun 26, 2026

Uh oh!

arthw commented Jun 26, 2026

Uh oh!

maxious commented Jun 26, 2026

Short context (pp512 — MKL path NOT active)

Long context (pp2048 — MKL path active)

Full commands

Uh oh!

johnkarlhill commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jlionhan commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnkarlhill commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthw commented Jun 27, 2026

Uh oh!

johnkarlhill commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arthw left a comment •

edited

Loading

johnkarlhill commented Jun 26, 2026 •

edited

Loading

jlionhan commented Jun 26, 2026 •

edited

Loading

johnkarlhill commented Jun 26, 2026 •

edited

Loading

johnkarlhill commented Jun 28, 2026 •

edited

Loading