sycl: fix check_graph_compatibility() to allow graphs for MoE decode (CONCAT dim!=3, MUL_MAT_ID fused path)#25089
Open
Captain-Tripps wants to merge 1 commit into
Open
Conversation
The compatibility check was unconditionally rejecting GGML_OP_CONCAT and GGML_OP_MUL_MAT_ID, but only specific sub-cases actually block graph capture: GGML_OP_CONCAT: only the dim==3 contiguous path uses a blocking stream->memcpy(...).wait(). All other dims use async GPU kernels (concat_T_sycl) and are fully graph-compatible. Models such as qwen3.6-35B use dim=0 for SSM conv state concatenation. GGML_OP_MUL_MAT_ID: the non-fused prefill path (ne12 > 1) copies expert IDs to host with stream->wait() and cannot be captured. But the fused single-token decode path in ggml_sycl_mul_mat_id_mmvq_fused() (ne12==1, FP32 src1) runs ggml_sycl_mul_mat_vec_q_id() entirely on GPU with no host wait, and is graph-compatible. Pool address stability: ggml_sycl_pool_vmm uses a fixed base address with LIFO linear allocation. The src1_q8_alloc temporary in the fused MUL_MAT_ID path always gets pool_addr+0, making addresses stable across graph replays when g_ggml_sycl_use_async_mem_op is set. Verified on Intel Arc Pro B70 (Xe2/Battlemage) with qwen3.6-35B-A3B Q4_K_M: graph capture succeeded, decode output correct, 3678-token sustained decode ran cleanly.
|
Hi @Captain-Tripps, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Author
|
Yes - Claude Code has been helping me track down the issue with Intel Battlemage cards wedging. Claude helped me submit this PR. |
Contributor
|
@Captain-Tripps The issue run with What's the benefit of above setting? or which case could get benefit from this PR? Thank you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
check_graph_compatibility()was unconditionally rejectingGGML_OP_CONCATandGGML_OP_MUL_MAT_ID, blocking SYCL command graph capture for any model with MoE routing or SSM-style concatenation — even when those ops are fully async.GGML_OP_CONCAT
Only the
dim==3contiguous path doesstream->memcpy(...).wait(). All other dims use async GPU kernels (concat_T_sycl) and are graph-compatible. Models such as qwen3.6-35B usedim=0for SSM conv state concatenation.GGML_OP_MUL_MAT_ID
The non-fused prefill path (
ne12 > 1) copies expert IDs to host withstream->wait()— correctly rejected. The fused single-token decode path inggml_sycl_mul_mat_id_mmvq_fused()(ne12==1, FP32 src1) runsggml_sycl_mul_mat_vec_q_id()entirely on GPU with no host wait — graph-compatible.Pool address stability:
ggml_sycl_pool_vmmuses a fixed base address with LIFO linear allocation.src1_q8_allocin the fused path always getspool_addr+0, so addresses are stable across graph replays wheng_ggml_sycl_use_async_mem_opis set.Test plan
GGML_SYCL_DISABLE_GRAPH=0,GGML_SYCL_DISABLE_OPT=1ne12>1Related: #24810
🤖 Generated with Claude Code