SYCL: add oneMKL GEMM flash attention for XMX-accelerated prompt proc…#25025
SYCL: add oneMKL GEMM flash attention for XMX-accelerated prompt proc…#25025johnkarlhill wants to merge 2 commits into
Conversation
|
Hi @johnkarlhill, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Adding before and after... both compiled with arch flags to show side-by-side. Compiling without arch flags will degrade performance from these numbers but should still be better than stock. I'll add more models if needed. |
|
@johnkarlhill For user, how to trigger the new code in usage? Thank you! |
|
Tested on dual Intel Arc Pro B60 (Battlemage, 24GB each), oneAPI 2026.0, MKL 2026.0, targeting Short context (pp512 — MKL path NOT active)No regression vs master:
Long context (pp2048 — MKL path active)
Decode speeds (tg128) unchanged across all models (±2%). Full commands# Master
./build-master/bin/llama-bench -m model.gguf -p 512,2048 -n 128 -ngl -1 -fa 1
# PR #25025
./build-pr/bin/llama-bench -m model.gguf -p 512,2048 -n 128 -ngl -1 -fa 1 |
|
Updated arch table:
All of these use underscores ( I incorrectly listed Arc A770 / A750 as acm_g12. "For user, how to trigger the new code in usage?" |
|
I hope this helps. 255H, Arc 140T, 32GB RAM, llama-cli Build options: after PR:
before PR:
build: f728ada (9793) Thank you. However, I am observing intermittent behavior where the most recently entered prompt is not being processed, and the model instead generates a response to the previous prompt. I believe further testing is needed to confirm whether this issue is reproducible and to identify the underlying cause. This behavior may be unrelated to this PR. |
I can reproduce the behavior on Gemma4 models and working on a fix. This behavior does not exist on Qwen models. Multiple folks have tested with a few different Qwen models and can't reproduce this. It seems specific to Gemma4. And a huge THANK YOU for testing this. It is very much appreciated!!! |
|
@johnkarlhill Thank you! |
…env vars - Fix mkl_fa_normalize_head: use interleaved dst layout ((query * n_q_heads + head) * DV) matching TILE's flash_attn_combine_results. Previously used dense head-major layout which wrote head outputs to wrong addresses, corrupting attention for all models except Qwen3.6-27B (where GQA=6 heads were sparse enough to avoid visible overlap). - Remove unused dst_row_stride parameter from normalize_head. - Clean up diagnostic clutter (DIAG_F32, FA-CK, softmax row-sum). - Add MKL_FA_DISABLE=1 env var to fattn.cpp dispatcher for A/B testing against TILE path. - Add FA-DISP watchdog (MKL_FA_DEBUG=1) to log n_kv deltas per FA call, and FA-DIAG output fingerprint (MKL_FA_DIAG=1) to dump first 64 output floats for cross-kernel comparison. Tested: Gemma-4-26B, Gemma-4-31B, Qwen3.6-27B, Qwen3.6-35B-A3B Perf (B70/Battlemage, 32K, q8_0 KV): Gemma-4-26B: 1473 t/s MKL vs 746 TILE (1.97x) Qwen3.6-27B: 606 t/s MKL vs 330 TILE (1.84x) Co-Authored-By: Claude Code on DeepSeek-v4-Pro
|
Bug fix The MKL normalize kernel was writing output using a dense head-major layout (head * n_queries * DV), but llama.cpp's flash attention output uses an interleaved layout (query * n_heads + head per row, matching TILE's flash_attn_combine_results). Head 0 row 0 happened to alias at offset 0 in both layouts, so the first layer's first 64 output floats matched TILE. Everything else landed at wrong addresses. One-line fix in mkl_fa_normalize_head. Tested models (all pass multi-turn coherence)
Performance (Intel Arc Pro B70, Battlemage BMG-G21, 32K context, q8_0 KV cache)
Token generation unaffected (±1 t/s, within noise) — MKL only activates for prompt processing (n_kv ≥ 1024 with quantized KV). Updated PR25025 - Qwen3.6-27B-MTP-UD-Q5_K_XL on B70.txt How to test |
Adds a flash attention path that routes Q·K^T and S·V matrix multiplies
through oneMKL GEMM, enabling XMX hardware acceleration on Intel GPUs.
Motivation
The existing SYCL flash attention kernels (VEC, TILE) run entirely in
SYCL subgroup operations. On Intel Arc GPUs with XMX matrix engines
(Battlemage and later), oneMKL GEMM can process the large matmuls in
attention significantly faster — particularly at high context lengths
where the KV cache is quantized.
When it activates
These thresholds route prompt processing through MKL while leaving
single-token decode to the existing TG-optimized kernels. The path is
never activated for f16/bf16 KV cache — those already perform well
with the TILE kernel and graph capture.
Implementation
All logic is in one new file,
fattn-mkl.cpp(567 lines). The pipeline:GQA groups sharing a KV head are batched into single GEMM calls —
6 query heads × 1020 tokens = 6120 rows in one MKL call, amortizing
launch overhead.
Performance (Arc Pro B70 / BMG-G21, 32 GB)
For comparison, stock bf16 KV cache + FA off on the same GPU achieves
~822 t/s PP at 8K — the MKL path with q8_0 is within 1% while using
quantized memory.
Testing
quant types, head sizes 64–512, causal/non-causal masks, sinks,
max_bias, GQA ratios, multi-batch)
no coherence errors
BEST_FATTN_KERNEL_MKLenum; other backends and non-quantizedpaths are completely unaffected
Known limitations
incompatible with SYCL command graph replay. The existing
GGML_SYCL_DISABLE_GRAPHdefault (1) handles this.max_bias == 0.0fasserted; models needing ALiBiwill fall through to the TILE kernel.
dst->src[4]is not yet supported.Debug output
Timing instrumentation is gated behind
MKL_FA_DEBUG=1. In normaloperation the MKL path produces no output.
AI disclosure
Claude Code was used for SYCL boilerplate (ND-range kernel launches,
ggml_sycl_pool_alloc patterns) and initial drafting of the chunked KV
loop. All algorithmic decisions — oneMKL GEMM integration, online
softmax with GQA batching, activation thresholds, chunk sizing — were
human-directed. Comprehensive testing (3605 test-backend-ops,
multi-quant and multi-batch coherence validation, performance
benchmarking at contexts up to 110K) was performed manually.
🤖 Generated with Claude Code using DeepSeek-V4-Pro