Feat/moe nsp blocking all models#1016
Conversation
Add expert-blocked NSP-parallel prefill forward to QEffPrefillChunkedQwen3MoeSparseMoeBlock and QEffPrefillOnlyChunkedGptOssMLP. Controlled via EXPERT_BLOCKING_NUM_NSP env var. Fix CtxScatterFunc3D/CtxGatherFunc3D eager forward for INT32_MAX sentinel handling. Add disagg-mode tests for both models with tiny configs. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…prefill Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
- Root cause: CtxGather3D ONNX symbolic expanded ctx_indices to Shape(data)[:2] ([batch, seq_len]), which is wrong for packed dispatch.
- In expert-blocked MoE prefill, ctx_indices is intentionally [batch, packed_chunk_size] (e.g. [16, 256]) while data stays [batch, seq_len, ...] (e.g. [16, 512, ...]).
- This caused invalid Expand attempts ([16,256] -> [16,512]) and QAIC compile/runtime failure on /model/layers.0/mlp/CtxGather3D/....
Fix:
- Update CtxGather3D expand target to:
- batch dim from data
- index-seq dim from ctx_indices
- New expand shape is [batch_size(data), idx_seq_len(ctx_indices)], preserving packed chunk length.
Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>
…port Add missing CustomOpTransform mappings for CtxScatterFunc3DInt and generalized 3D scatter/gather ops, plus a prefill-only subfunction export regression test to verify the ONNX graph includes the required CtxScatter3DInt/CtxScatter3D/CtxGather3D ops. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…on export Replace MoE prefill sum reductions with equivalent einsum forms and rewrite int32 clamp bounds using where to avoid QAIC subfunction compile failures for GPT-OSS and Qwen3-MoE. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Trace chunked prefill exports with the requested prefill_seq_len so packed MoE dispatch unrolls all packed chunks, restore torch.full_like index init, and add ONNX coverage for the second packed chunk slice. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
- gpt_oss/modeling_gpt_oss.py: add num_expert_chunks dynamic-loop mechanism to _cumsum_scatter_gather_update_gptoss_expert_blocked and _forward_expert_blocked; read _num_expert_chunks from module in forward - qwen3_moe/modeling_qwen3_moe.py: same num_expert_chunks dynamic-loop mechanism; replace fallback inline loop with orig_forward call - granitemoe/modeling_granitemoe.py: add QEffPrefillChunkedGraniteMoeAttention, QEffPrefillChunkedGraniteMoeMoE with full ONNX-friendly cumsum-scatter-gather dispatch; update get_submodules_for_export to return set() when chunked prefill MoE is active - pytorch_transforms.py: register GraniteMoe forward/reverse mappings in PrefillOnlyChunkedTransform and RevertPrefillKeepAttentionTransform - modeling_auto.py: compute num_expert_chunks from prefill_seq_len and EXPERT_BLOCKING_PACKED_CHUNK_SIZE; setattr _num_expert_chunks on every MoE layer when enable_chunking=True
torch.clamp on int32 tensors exports to ONNX Clip op which QAIC compiler does not support (Unhandled ElemKind in Clip operation). Replace with torch.where in all three models: gpt_oss, qwen3_moe, granitemoe.
…raniteMoE - modeling_qwen3_vl_moe.py: Add QEffPrefillChunkedQwen3VLMoeTextSparseMoeBlock with NSP-blocked cumsum-scatter-gather dispatch; supports_moe_prefill_blocking=True; use expert_blocking_num_nsp/packed_chunk_size/num_packed_chunks instance attrs - modeling_qwen3_moe.py, modeling_gpt_oss.py, modeling_granitemoe.py: Replace EXPERT_BLOCKING_NUM_NSP/EXPERT_BLOCKING_PACKED_CHUNK_SIZE env vars with API-driven instance attributes set via compile() params - pytorch_transforms.py: Register QEffPrefillChunkedQwen3VLMoeTextSparseMoeBlock in RevertPrefillKeepAttentionTransform - modeling_auto.py: QEFFAutoModelForCausalLM.get_seq_len_and_handle_specialized_prefill_model iterates modules with supports_moe_prefill_blocking=True and sets instance attrs; QEffCausalLMForTextImageToTextModel.export() uses same API-driven pattern for VLM; QEFFAutoModelForImageTextToText.compile() accepts moe_prefill_packed_chunk_size param - modeling_qeff.py: Uncomment self.onnx_path fallback in _compile so pre-exported ONNX is reused without hitting get_onnx_path; pass moe_prefill_packed_chunk_size through get_onnx_path and _compile - constants.py: Add MOE_PREFILL_PACKED_CHUNK_SIZE = 256
…ms to QEFFAutoModelForCausalLM.export() compiler_options is only available in compile(), not export(). Add num_cores and moe_prefill_packed_chunk_size as explicit named params to export() so they are directly accessible, matching the pattern in vbaddi/feat/prefill_moe.
- modeling_qeff.py: save moe_prefill_num_nsp from compiler_options in _compile and pass through get_onnx_path to export() - modeling_utils.py: add granitemoe to SPECIALIZED_DISAGG_SERVING_MODEL_ARCH - modeling_granitemoe.py: fix supports_moe_prefill_blocking moved out of docstring into class body; fix reshape order to match Qwen3-MoE/GPT-OSS - modeling_auto.py: add moe_prefill_num_nsp param to compile()/export()/ get_seq_len; pass moe_prefill_num_nsp to lang_model.export() in VLM path; fix sliding_window AttributeError for models without sliding_window attr
|
@vbaddi can you review this PR ? |
There was a problem hiding this comment.
There is ongoing effort to put all the changes in 1.22_tmp branch which should soon come up in mainline QEff. Let's rebase once that is done to streamline this work item. thanks @divytrip3005
| mla_absorption: Optional[Dict[str, bool]] = None, | ||
| qaic_config: Optional[dict] = None, | ||
| moe_prefill_packed_chunk_size: Optional[int] = None, | ||
| moe_prefill_num_nsp: Optional[int] = None, |
There was a problem hiding this comment.
nit: no need of this imo, this would be same as num_cores already.
| return g.onnxscript_op(CtxGather3D, data, ctx_indices).setTypeAs(data) | ||
|
|
||
|
|
||
| class CtxGatherFunc3DGeneralized(torch.autograd.Function): |
There was a problem hiding this comment.
nit: Let's rebase to 1.22_tmp, these changes should catch up in there.
| return q_embed.to(q.dtype), k_embed.to(k.dtype) | ||
|
|
||
|
|
||
| class QEffPrefillChunkedGraniteMoeAttention(GraniteMoeAttention): |
There was a problem hiding this comment.
nit: what's the purpose of this? why do we need different chunked attention? does this solve anything unique?
| packed_chunk_size = seq_len // num_expert_chunks | ||
| else: | ||
| packed_chunk_size = max(1, min(packed_chunk_size, seq_len)) | ||
| num_expert_chunks = seq_len // packed_chunk_size |
There was a problem hiding this comment.
nit: let's rebase to 1.22_tmp branch, this logic should be aligned w/gptoss and qwen3-moe.
| # ----------------------------------------------------------------------------- | ||
| # | ||
| # Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries. | ||
| # SPDX-License-Identifier: BSD-3-Clause |
There was a problem hiding this comment.
nit: what changed here? why so much diff? what this formatting issue earlier?
| def test_qwen3moe_prefill_chunked_export(tmp_path): | ||
| config = AutoConfig.for_model("qwen3_moe", **QWEN3_MOE_CFG) | ||
| model = AutoModelForCausalLM.from_config(config, **MODEL_KWARGS) | ||
| qeff = QEFFAutoModelForCausalLM(model, continuous_batching=False) |
There was a problem hiding this comment.
nit: how are we verifying the chunked export here? should be either via customops (CtxGeneralized*) in onnx or checking the module presence in pytorch no?
NSP-Blocked MoE Prefill Dispatch
What Is This?
Large MoE (Mixture-of-Experts) models like Qwen3-MoE, GPT-OSS, GraniteMoE, and Qwen3-VL-MoE route each input token to a small subset of expert FFN networks (e.g. 8 out of 128 experts). During prefill, all tokens are processed simultaneously — this is compute-bound and benefits from parallelism.
On Qualcomm AI100 hardware, the chip has 16 NSPs that can execute in parallel. The standard per-expert sequential loop underutilizes this parallelism. NSP blocking restructures the expert dispatch so that each NSP handles a dedicated group of experts simultaneously, dramatically improving prefill throughput.
How It Works
Standard MoE Dispatch (Sequential)
Every expert is processed one at a time.
NSP-Blocked Dispatch (Parallel)
The inner loop runs local_experts = E / num_nsp times (e.g. 8 instead of 128), with all 16 NSPs working in parallel on their expert group.
Implementation
Key Components
1. Weight Splitting (
__qeff_init__)The fused
gate_up_proj [E, H, 2I]is split into separate gate and up projections and reshaped for NSP-parallel access:2. NSP-Grouped Reshape
Experts are grouped across NSPs using a double-transpose pattern that correctly interleaves experts:
3. Cumsum-Scatter-Gather Kernel
For each slot, only tokens routed to that expert group are processed:
4. Forward Dispatch
The class attribute
supports_moe_prefill_blocking = Truesignals that this module supports API-driven config injection. The forward method dispatches to the NSP path only whenexpert_blocking_num_nspis set:Supported Models
QEffPrefillChunkedQwen3MoeSparseMoeBlockQEffPrefillOnlyChunkedGptOssMLPQEffPrefillChunkedGraniteMoeMoEQEffPrefillChunkedQwen3VLMoeTextSparseMoeBlockHow to Enable
CausalLM Models (Qwen3-MoE, GPT-OSS, GraniteMoE)
VLM Models (Qwen3-VL-MoE)
Disabling NSP Blocking (Baseline)
To compile without NSP blocking for comparison:
Key Parameters
moe_prefill_num_nspOptional[int]Nonenum_cores(typically 16) to enable blocking.Nonedisables blocking.moe_prefill_packed_chunk_sizeint256Notes
moe_prefill_num_nspproduce different ONNX/QPC hashes, so switching between blocking configs does not cause cache conflicts.index_select-based dispatch which is already efficient for single-token generation.