sycl: fix check_graph_compatibility() to allow graphs for MoE decode (CONCAT dim!=3, MUL_MAT_ID fused path) by Captain-Tripps · Pull Request #25089 · ggml-org/llama.cpp

Captain-Tripps · 2026-06-28T04:19:31Z

Summary

check_graph_compatibility() was unconditionally rejecting GGML_OP_CONCAT and GGML_OP_MUL_MAT_ID, blocking SYCL command graph capture for any model with MoE routing or SSM-style concatenation — even when those ops are fully async.

GGML_OP_CONCAT

Only the dim==3 contiguous path does stream->memcpy(...).wait(). All other dims use async GPU kernels (concat_T_sycl) and are graph-compatible. Models such as qwen3.6-35B use dim=0 for SSM conv state concatenation.

GGML_OP_MUL_MAT_ID

The non-fused prefill path (ne12 > 1) copies expert IDs to host with stream->wait() — correctly rejected. The fused single-token decode path in ggml_sycl_mul_mat_id_mmvq_fused() (ne12==1, FP32 src1) runs ggml_sycl_mul_mat_vec_q_id() entirely on GPU with no host wait — graph-compatible.

Pool address stability: ggml_sycl_pool_vmm uses a fixed base address with LIFO linear allocation. src1_q8_alloc in the fused path always gets pool_addr+0, so addresses are stable across graph replays when g_ggml_sycl_use_async_mem_op is set.

Test plan

Intel Arc Pro B70 (Xe2/Battlemage), qwen3.6-35B-A3B Q4_K_M, GGML_SYCL_DISABLE_GRAPH=0, GGML_SYCL_DISABLE_OPT=1
Graph capture succeeded (no "disabling SYCL graphs" log messages)
Decode output correct on small test requests
3678-token sustained decode ran cleanly (207s)
Prefill unaffected — falls back gracefully for ne12>1

Related: #24810

🤖 Generated with Claude Code

The compatibility check was unconditionally rejecting GGML_OP_CONCAT and GGML_OP_MUL_MAT_ID, but only specific sub-cases actually block graph capture: GGML_OP_CONCAT: only the dim==3 contiguous path uses a blocking stream->memcpy(...).wait(). All other dims use async GPU kernels (concat_T_sycl) and are fully graph-compatible. Models such as qwen3.6-35B use dim=0 for SSM conv state concatenation. GGML_OP_MUL_MAT_ID: the non-fused prefill path (ne12 > 1) copies expert IDs to host with stream->wait() and cannot be captured. But the fused single-token decode path in ggml_sycl_mul_mat_id_mmvq_fused() (ne12==1, FP32 src1) runs ggml_sycl_mul_mat_vec_q_id() entirely on GPU with no host wait, and is graph-compatible. Pool address stability: ggml_sycl_pool_vmm uses a fixed base address with LIFO linear allocation. The src1_q8_alloc temporary in the fused MUL_MAT_ID path always gets pool_addr+0, making addresses stable across graph replays when g_ggml_sycl_use_async_mem_op is set. Verified on Intel Arc Pro B70 (Xe2/Battlemage) with qwen3.6-35B-A3B Q4_K_M: graph capture succeeded, decode output correct, 3678-token sustained decode ran cleanly.

ggml-gh-bot · 2026-06-28T04:23:48Z

Hi @Captain-Tripps, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Captain-Tripps · 2026-06-28T04:31:50Z

Yes - Claude Code has been helping me track down the issue with Intel Battlemage cards wedging. Claude helped me submit this PR.

arthw · 2026-06-28T09:24:20Z

@Captain-Tripps
Does this PR fix the issue: #24810?

The issue run with GGML_SYCL_DISABLE_GRAPH=0, GGML_SYCL_DISABLE_OPT=1.
Is it right?

What's the benefit of above setting? or which case could get benefit from this PR?
How about the detailed perf increase?

Thank you!

Captain-Tripps requested a review from a team as a code owner June 28, 2026 04:19

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sycl: fix check_graph_compatibility() to allow graphs for MoE decode (CONCAT dim!=3, MUL_MAT_ID fused path)#25089

sycl: fix check_graph_compatibility() to allow graphs for MoE decode (CONCAT dim!=3, MUL_MAT_ID fused path)#25089
Captain-Tripps wants to merge 1 commit into
ggml-org:masterfrom
Captain-Tripps:sycl-graph-moe-compat-fix

Captain-Tripps commented Jun 28, 2026

Uh oh!

ggml-gh-bot Bot commented Jun 28, 2026

Uh oh!

Captain-Tripps commented Jun 28, 2026

Uh oh!

arthw commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Captain-Tripps commented Jun 28, 2026

Summary

GGML_OP_CONCAT

GGML_OP_MUL_MAT_ID

Test plan

Uh oh!

ggml-gh-bot Bot commented Jun 28, 2026

Uh oh!

Captain-Tripps commented Jun 28, 2026

Uh oh!

arthw commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants