Skip to content

sycl: fix check_graph_compatibility() to allow graphs for MoE decode (CONCAT dim!=3, MUL_MAT_ID fused path)#25089

Open
Captain-Tripps wants to merge 1 commit into
ggml-org:masterfrom
Captain-Tripps:sycl-graph-moe-compat-fix
Open

sycl: fix check_graph_compatibility() to allow graphs for MoE decode (CONCAT dim!=3, MUL_MAT_ID fused path)#25089
Captain-Tripps wants to merge 1 commit into
ggml-org:masterfrom
Captain-Tripps:sycl-graph-moe-compat-fix

Conversation

@Captain-Tripps

Copy link
Copy Markdown

Summary

check_graph_compatibility() was unconditionally rejecting GGML_OP_CONCAT and GGML_OP_MUL_MAT_ID, blocking SYCL command graph capture for any model with MoE routing or SSM-style concatenation — even when those ops are fully async.

GGML_OP_CONCAT

Only the dim==3 contiguous path does stream->memcpy(...).wait(). All other dims use async GPU kernels (concat_T_sycl) and are graph-compatible. Models such as qwen3.6-35B use dim=0 for SSM conv state concatenation.

GGML_OP_MUL_MAT_ID

The non-fused prefill path (ne12 > 1) copies expert IDs to host with stream->wait() — correctly rejected. The fused single-token decode path in ggml_sycl_mul_mat_id_mmvq_fused() (ne12==1, FP32 src1) runs ggml_sycl_mul_mat_vec_q_id() entirely on GPU with no host wait — graph-compatible.

Pool address stability: ggml_sycl_pool_vmm uses a fixed base address with LIFO linear allocation. src1_q8_alloc in the fused path always gets pool_addr+0, so addresses are stable across graph replays when g_ggml_sycl_use_async_mem_op is set.

Test plan

  • Intel Arc Pro B70 (Xe2/Battlemage), qwen3.6-35B-A3B Q4_K_M, GGML_SYCL_DISABLE_GRAPH=0, GGML_SYCL_DISABLE_OPT=1
  • Graph capture succeeded (no "disabling SYCL graphs" log messages)
  • Decode output correct on small test requests
  • 3678-token sustained decode ran cleanly (207s)
  • Prefill unaffected — falls back gracefully for ne12>1

Related: #24810

🤖 Generated with Claude Code

The compatibility check was unconditionally rejecting GGML_OP_CONCAT and
GGML_OP_MUL_MAT_ID, but only specific sub-cases actually block graph
capture:

GGML_OP_CONCAT: only the dim==3 contiguous path uses a blocking
stream->memcpy(...).wait(). All other dims use async GPU kernels
(concat_T_sycl) and are fully graph-compatible. Models such as
qwen3.6-35B use dim=0 for SSM conv state concatenation.

GGML_OP_MUL_MAT_ID: the non-fused prefill path (ne12 > 1) copies expert
IDs to host with stream->wait() and cannot be captured. But the fused
single-token decode path in ggml_sycl_mul_mat_id_mmvq_fused() (ne12==1,
FP32 src1) runs ggml_sycl_mul_mat_vec_q_id() entirely on GPU with no
host wait, and is graph-compatible.

Pool address stability: ggml_sycl_pool_vmm uses a fixed base address
with LIFO linear allocation. The src1_q8_alloc temporary in the fused
MUL_MAT_ID path always gets pool_addr+0, making addresses stable across
graph replays when g_ggml_sycl_use_async_mem_op is set.

Verified on Intel Arc Pro B70 (Xe2/Battlemage) with qwen3.6-35B-A3B
Q4_K_M: graph capture succeeded, decode output correct, 3678-token
sustained decode ran cleanly.
@Captain-Tripps Captain-Tripps requested a review from a team as a code owner June 28, 2026 04:19
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 28, 2026
@ggml-gh-bot

ggml-gh-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

Hi @Captain-Tripps, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@Captain-Tripps

Copy link
Copy Markdown
Author

Yes - Claude Code has been helping me track down the issue with Intel Battlemage cards wedging. Claude helped me submit this PR.

@arthw

arthw commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

@Captain-Tripps
Does this PR fix the issue: #24810?

The issue run with GGML_SYCL_DISABLE_GRAPH=0, GGML_SYCL_DISABLE_OPT=1.
Is it right?

What's the benefit of above setting? or which case could get benefit from this PR?
How about the detailed perf increase?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants