Feat/moe nsp blocking all models by divytrip3005 · Pull Request #1016 · quic/efficient-transformers

divytrip3005 · 2026-06-01T05:46:03Z

NSP-Blocked MoE Prefill Dispatch

What Is This?

Large MoE (Mixture-of-Experts) models like Qwen3-MoE, GPT-OSS, GraniteMoE, and Qwen3-VL-MoE route each input token to a small subset of expert FFN networks (e.g. 8 out of 128 experts). During prefill, all tokens are processed simultaneously — this is compute-bound and benefits from parallelism.

On Qualcomm AI100 hardware, the chip has 16 NSPs that can execute in parallel. The standard per-expert sequential loop underutilizes this parallelism. NSP blocking restructures the expert dispatch so that each NSP handles a dedicated group of experts simultaneously, dramatically improving prefill throughput.

How It Works

Standard MoE Dispatch (Sequential)

Tokens [T, H]
    │
    ▼
Router → selects top-k experts per token
    │
    ▼
for expert_e in range(num_experts):          # sequential loop over all E experts
    mask = tokens routed to expert_e
    out += expert_e(tokens[mask]) * weight
    │
    ▼
Output [T, H]

Every expert is processed one at a time.

NSP-Blocked Dispatch (Parallel)

Tokens [T, H]
    │
    ▼
Router → routing_weights [T, E]
    │
    ▼
Reshape experts into NSP groups:
  E experts → num_nsp groups × local_experts each
  e.g. 128 experts, NSP=16 → 16 groups × 8 experts

    NSP-0    NSP-1    NSP-2  ...  NSP-15
   [E0..E7] [E8..E15] [E16..E23]  [E120..E127]
      │         │         │              │
      ▼         ▼         ▼              ▼
   (parallel execution across all 16 NSPs)

for slot in range(local_experts):            # loop over 8 slots (not 128!)
    T2Ei = tokens routed to this slot's expert in each NSP group
    │
    ├─ cumsum → compute packed indices of active tokens
    ├─ gather → pack only active tokens [T_active, H]
    ├─ matmul → run expert FFN on packed tokens
    └─ scatter → write results back to output buffer
    │
    ▼
Output [T, H]  (sum across NSP dimension)

The inner loop runs local_experts = E / num_nsp times (e.g. 8 instead of 128), with all 16 NSPs working in parallel on their expert group.

Implementation

Key Components

1. Weight Splitting (__qeff_init__)

The fused gate_up_proj [E, H, 2I] is split into separate gate and up projections and reshaped for NSP-parallel access:

def __qeff_init__(self):
    W_gate_up = self.experts.gate_up_proj  # [E, H, 2I]
    I = W_gate_up.shape[2] // 2
    self._W_g = nn.Parameter(W_gate_up[:, :, :I].contiguous())  # [E, H, I]
    self._W_u = nn.Parameter(W_gate_up[:, :, I:].contiguous())  # [E, H, I]
    self._W_d = nn.Parameter(self.experts.down_proj.contiguous())  # [E, I, H]

2. NSP-Grouped Reshape

Experts are grouped across NSPs using a double-transpose pattern that correctly interleaves experts:

local_experts = num_experts // num_nsp  # e.g. 128 // 16 = 8

# routing weights: [T, E] → [num_nsp, local_experts, T]
rw = routing_weights.transpose(0,1).contiguous()
     .view(local_experts, num_nsp, T).transpose(0,1).contiguous()

# weights: [E, H, I] → [num_nsp, local_experts, H, I]
W_g = self._W_g.view(local_experts, num_nsp, H, I).transpose(0,1).contiguous()

3. Cumsum-Scatter-Gather Kernel

For each slot, only tokens routed to that expert group are processed:

for slot in range(local_experts):
    routing_weight = rw[:, slot, :]      # [num_nsp, T]
    T2Ei = routing_weight > 0            # boolean mask of active tokens

    # Pack active tokens using cumsum-based indexing
    matched_idx = _build_matched_idx_from_cumsum(T2Ei)

    # Gather: pick only active tokens → [num_nsp, T_active, H]
    x_chunk = CtxGatherFunc3DGeneralized.apply(x_expanded, matched_idx)

    # Expert FFN on packed tokens
    gate_prime = x_chunk @ W_g[:, slot]
    up_prime   = x_chunk @ W_u[:, slot]
    down_chunk = (up_prime * act_fn(gate_prime)) @ W_d[:, slot]

    # Scatter: write results back
    expert_out = CtxScatterFunc3DGeneralized.apply(expert_out, matched_idx, down_chunk * rw_chunk)

4. Forward Dispatch

The class attribute supports_moe_prefill_blocking = True signals that this module supports API-driven config injection. The forward method dispatches to the NSP path only when expert_blocking_num_nsp is set:

class QEffPrefillChunkedQwen3VLMoeTextSparseMoeBlock(Qwen3VLMoeTextSparseMoeBlock):
    supports_moe_prefill_blocking = True

    def forward(self, hidden_states):
        ...
        if hasattr(self, "expert_blocking_num_nsp"):
            expert_out = self._forward_expert_blocked(x, routing_weights)
            return expert_out.view(B, S, H), router_logits
        return self.orig_forward(hidden_states)  # fallback

Supported Models

Model	Class	Experts	NSP Groups (NSP=16)
Qwen3-MoE	`QEffPrefillChunkedQwen3MoeSparseMoeBlock`	128	16 × 8
GPT-OSS	`QEffPrefillOnlyChunkedGptOssMLP`	128	16 × 8
GraniteMoE	`QEffPrefillChunkedGraniteMoeMoE`	32	16 × 2
Qwen3-VL-MoE	`QEffPrefillChunkedQwen3VLMoeTextSparseMoeBlock`	128	16 × 8

How to Enable

CausalLM Models (Qwen3-MoE, GPT-OSS, GraniteMoE)

from QEfficient import QEFFAutoModelForCausalLM

model = QEFFAutoModelForCausalLM.from_pretrained("Qwen/Qwen3-30B-A3B-Instruct-2507")

# Compile prefill with NSP blocking enabled
prefill_qpc = model.compile(
    batch_size=1,
    prefill_seq_len=512,
    ctx_len=4096,
    num_cores=16,                        # number of NSPs on the chip
    num_devices=1,
    mxfp6_matmul=True,
    mxint8_kv_cache=True,
    prefill_only=True,
    enable_chunking=True,
    moe_prefill_num_nsp=16,              # ← enable NSP blocking (set to num_cores)
    moe_prefill_packed_chunk_size=256,   # ← token chunk size per iteration
)

# Compile decode (no blocking needed for decode)
decode_qpc = model.compile(
    batch_size=1,
    prefill_seq_len=1,
    ctx_len=4096,
    num_cores=16,
    num_devices=1,
    mxfp6_matmul=True,
    mxint8_kv_cache=True,
)

VLM Models (Qwen3-VL-MoE)

from QEfficient import QEFFAutoModelForImageTextToText

model = QEFFAutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-30B-A3B-Instruct",
    attn_implementation="eager",
    kv_offload=True,
)

# Compile prefill with NSP blocking
prefill_qpc = model.compile(
    batch_size=1,
    prefill_seq_len=512,
    ctx_len=4096,
    height=354,
    width=536,
    num_cores=16,
    num_devices=1,
    mxfp6_matmul=True,
    mxint8_kv_cache=True,
    prefill_only=True,
    enable_chunking=True,
    skip_vision=True,                    # compile lang model only
    moe_prefill_num_nsp=16,              # ← enable NSP blocking
    moe_prefill_packed_chunk_size=256,
)

# Compile decode
decode_qpc = model.compile(
    batch_size=1,
    prefill_seq_len=1,
    ctx_len=4096,
    height=354,
    width=536,
    num_cores=16,
    num_devices=1,
    mxfp6_matmul=True,
    mxint8_kv_cache=True,
    prefill_only=False,
    skip_vision=True,
)

Disabling NSP Blocking (Baseline)

To compile without NSP blocking for comparison:

prefill_qpc = model.compile(
    ...,
    prefill_only=True,
    enable_chunking=True,
    moe_prefill_num_nsp=None,   # ← None disables NSP blocking
)

Key Parameters

Parameter	Type	Default	Description
`moe_prefill_num_nsp`	`Optional[int]`	`None`	Number of NSP groups to split experts across. Set to `num_cores` (typically 16) to enable blocking. `None` disables blocking.
`moe_prefill_packed_chunk_size`	`int`	`256`	Number of token rows per packed chunk in the scatter-gather loop. Controls the tradeoff between loop iterations and chunk size.

Notes

Token correctness: NSP blocking produces numerically identical outputs to the standard path for shallow models (2 layers). For full-depth models, minor FP16 rounding differences may accumulate across layers due to non-associativity of floating point addition — this is expected and does not affect model quality in practice.
Cache isolation: Different values of moe_prefill_num_nsp produce different ONNX/QPC hashes, so switching between blocking configs does not cause cache conflicts.
Decode path: NSP blocking applies only to the prefill phase. The decode path uses a standard index_select-based dispatch which is already efficient for single-token generation.

Add expert-blocked NSP-parallel prefill forward to QEffPrefillChunkedQwen3MoeSparseMoeBlock and QEffPrefillOnlyChunkedGptOssMLP. Controlled via EXPERT_BLOCKING_NUM_NSP env var. Fix CtxScatterFunc3D/CtxGatherFunc3D eager forward for INT32_MAX sentinel handling. Add disagg-mode tests for both models with tiny configs. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

…prefill Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

- Root cause: CtxGather3D ONNX symbolic expanded ctx_indices to Shape(data)[:2] ([batch, seq_len]), which is wrong for packed dispatch. - In expert-blocked MoE prefill, ctx_indices is intentionally [batch, packed_chunk_size] (e.g. [16, 256]) while data stays [batch, seq_len, ...] (e.g. [16, 512, ...]). - This caused invalid Expand attempts ([16,256] -> [16,512]) and QAIC compile/runtime failure on /model/layers.0/mlp/CtxGather3D/.... Fix: - Update CtxGather3D expand target to: - batch dim from data - index-seq dim from ctx_indices - New expand shape is [batch_size(data), idx_seq_len(ctx_indices)], preserving packed chunk length. Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>

…port Add missing CustomOpTransform mappings for CtxScatterFunc3DInt and generalized 3D scatter/gather ops, plus a prefill-only subfunction export regression test to verify the ONNX graph includes the required CtxScatter3DInt/CtxScatter3D/CtxGather3D ops. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

…on export Replace MoE prefill sum reductions with equivalent einsum forms and rewrite int32 clamp bounds using where to avoid QAIC subfunction compile failures for GPT-OSS and Qwen3-MoE. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Trace chunked prefill exports with the requested prefill_seq_len so packed MoE dispatch unrolls all packed chunks, restore torch.full_like index init, and add ONNX coverage for the second packed chunk slice. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

- gpt_oss/modeling_gpt_oss.py: add num_expert_chunks dynamic-loop mechanism to _cumsum_scatter_gather_update_gptoss_expert_blocked and _forward_expert_blocked; read _num_expert_chunks from module in forward - qwen3_moe/modeling_qwen3_moe.py: same num_expert_chunks dynamic-loop mechanism; replace fallback inline loop with orig_forward call - granitemoe/modeling_granitemoe.py: add QEffPrefillChunkedGraniteMoeAttention, QEffPrefillChunkedGraniteMoeMoE with full ONNX-friendly cumsum-scatter-gather dispatch; update get_submodules_for_export to return set() when chunked prefill MoE is active - pytorch_transforms.py: register GraniteMoe forward/reverse mappings in PrefillOnlyChunkedTransform and RevertPrefillKeepAttentionTransform - modeling_auto.py: compute num_expert_chunks from prefill_seq_len and EXPERT_BLOCKING_PACKED_CHUNK_SIZE; setattr _num_expert_chunks on every MoE layer when enable_chunking=True

torch.clamp on int32 tensors exports to ONNX Clip op which QAIC compiler does not support (Unhandled ElemKind in Clip operation). Replace with torch.where in all three models: gpt_oss, qwen3_moe, granitemoe.

…raniteMoE - modeling_qwen3_vl_moe.py: Add QEffPrefillChunkedQwen3VLMoeTextSparseMoeBlock with NSP-blocked cumsum-scatter-gather dispatch; supports_moe_prefill_blocking=True; use expert_blocking_num_nsp/packed_chunk_size/num_packed_chunks instance attrs - modeling_qwen3_moe.py, modeling_gpt_oss.py, modeling_granitemoe.py: Replace EXPERT_BLOCKING_NUM_NSP/EXPERT_BLOCKING_PACKED_CHUNK_SIZE env vars with API-driven instance attributes set via compile() params - pytorch_transforms.py: Register QEffPrefillChunkedQwen3VLMoeTextSparseMoeBlock in RevertPrefillKeepAttentionTransform - modeling_auto.py: QEFFAutoModelForCausalLM.get_seq_len_and_handle_specialized_prefill_model iterates modules with supports_moe_prefill_blocking=True and sets instance attrs; QEffCausalLMForTextImageToTextModel.export() uses same API-driven pattern for VLM; QEFFAutoModelForImageTextToText.compile() accepts moe_prefill_packed_chunk_size param - modeling_qeff.py: Uncomment self.onnx_path fallback in _compile so pre-exported ONNX is reused without hitting get_onnx_path; pass moe_prefill_packed_chunk_size through get_onnx_path and _compile - constants.py: Add MOE_PREFILL_PACKED_CHUNK_SIZE = 256

…ms to QEFFAutoModelForCausalLM.export() compiler_options is only available in compile(), not export(). Add num_cores and moe_prefill_packed_chunk_size as explicit named params to export() so they are directly accessible, matching the pattern in vbaddi/feat/prefill_moe.

- modeling_qeff.py: save moe_prefill_num_nsp from compiler_options in _compile and pass through get_onnx_path to export() - modeling_utils.py: add granitemoe to SPECIALIZED_DISAGG_SERVING_MODEL_ARCH - modeling_granitemoe.py: fix supports_moe_prefill_blocking moved out of docstring into class body; fix reshape order to match Qwen3-MoE/GPT-OSS - modeling_auto.py: add moe_prefill_num_nsp param to compile()/export()/ get_seq_len; pass moe_prefill_num_nsp to lang_model.export() in VLM path; fix sliding_window AttributeError for models without sliding_window attr

divytrip3005 · 2026-06-01T05:52:06Z

@vbaddi can you review this PR ?

vbaddi

There is ongoing effort to put all the changes in 1.22_tmp branch which should soon come up in mainline QEff. Let's rebase once that is done to streamline this work item. thanks @divytrip3005

cc: @quic-rishinr @anujgupt-github

vbaddi · 2026-06-04T03:11:32Z

+        mla_absorption: Optional[Dict[str, bool]] = None,
        qaic_config: Optional[dict] = None,
+        moe_prefill_packed_chunk_size: Optional[int] = None,
+        moe_prefill_num_nsp: Optional[int] = None,


nit: no need of this imo, this would be same as num_cores already.

vbaddi · 2026-06-04T03:12:41Z

        return g.onnxscript_op(CtxGather3D, data, ctx_indices).setTypeAs(data)


+class CtxGatherFunc3DGeneralized(torch.autograd.Function):


nit: Let's rebase to 1.22_tmp, these changes should catch up in there.

vbaddi · 2026-06-04T03:13:33Z

    return q_embed.to(q.dtype), k_embed.to(k.dtype)


+class QEffPrefillChunkedGraniteMoeAttention(GraniteMoeAttention):


nit: what's the purpose of this? why do we need different chunked attention? does this solve anything unique?

vbaddi · 2026-06-04T03:14:34Z

+        packed_chunk_size = seq_len // num_expert_chunks
+    else:
+        packed_chunk_size = max(1, min(packed_chunk_size, seq_len))
+        num_expert_chunks = seq_len // packed_chunk_size


nit: let's rebase to 1.22_tmp branch, this logic should be aligned w/gptoss and qwen3-moe.

vbaddi · 2026-06-04T03:15:47Z

-# -----------------------------------------------------------------------------
-#
-# Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
-# SPDX-License-Identifier: BSD-3-Clause


nit: what changed here? why so much diff? what this formatting issue earlier?

vbaddi · 2026-06-04T03:17:03Z

+def test_qwen3moe_prefill_chunked_export(tmp_path):
+    config = AutoConfig.for_model("qwen3_moe", **QWEN3_MOE_CFG)
+    model = AutoModelForCausalLM.from_config(config, **MODEL_KWARGS)
+    qeff = QEFFAutoModelForCausalLM(model, continuous_batching=False)


nit: how are we verifying the chunked export here? should be either via customops (CtxGeneralized*) in onnx or checking the module presence in pytorch no?

vbaddi and others added 16 commits April 30, 2026 07:17

nit: weights re-route fixes

a5bd93a

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

nit: weights re-route fixes v1

c4ef4c8

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

nit(0423): gpt oss moe fixed and nit

290839e

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

nit(0424): ctx batch idx cast to int32

2804851

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

nit(0429): qwen3_moe, gpt_oss: port cumsum scatter-gather-update MoE …

6b049bc

…prefill Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

nit(0429): update modeling files

1ae7b23

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

fix: replace torch.clamp with torch.where for int32 chunk_valid_rows

3a1873b

torch.clamp on int32 tensors exports to ONNX Clip op which QAIC compiler does not support (Unhandled ElemKind in Clip operation). Replace with torch.where in all three models: gpt_oss, qwen3_moe, granitemoe.

vbaddi requested changes Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/moe nsp blocking all models#1016

Feat/moe nsp blocking all models#1016
divytrip3005 wants to merge 16 commits into
quic:mainfrom
divytrip3005:feat/moe-nsp-blocking-all-models

divytrip3005 commented Jun 1, 2026 •

edited by vbaddi

Loading

Uh oh!

divytrip3005 commented Jun 1, 2026

Uh oh!

vbaddi left a comment •

edited

Loading

Uh oh!

vbaddi Jun 4, 2026

Uh oh!

vbaddi Jun 4, 2026

Uh oh!

vbaddi Jun 4, 2026

Uh oh!

vbaddi Jun 4, 2026

Uh oh!

vbaddi Jun 4, 2026

Uh oh!

vbaddi Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return g.onnxscript_op(CtxGather3D, data, ctx_indices).setTypeAs(data)


		class CtxGatherFunc3DGeneralized(torch.autograd.Function):

		return q_embed.to(q.dtype), k_embed.to(k.dtype)


		class QEffPrefillChunkedGraniteMoeAttention(GraniteMoeAttention):

Conversation

divytrip3005 commented Jun 1, 2026 • edited by vbaddi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NSP-Blocked MoE Prefill Dispatch

What Is This?

How It Works

Standard MoE Dispatch (Sequential)

Every expert is processed one at a time.

NSP-Blocked Dispatch (Parallel)

Implementation

Key Components

Supported Models

How to Enable

CausalLM Models (Qwen3-MoE, GPT-OSS, GraniteMoE)

VLM Models (Qwen3-VL-MoE)

Disabling NSP Blocking (Baseline)

Key Parameters

Notes

Uh oh!

divytrip3005 commented Jun 1, 2026

Uh oh!

vbaddi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbaddi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

vbaddi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

vbaddi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

vbaddi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

vbaddi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

vbaddi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

divytrip3005 commented Jun 1, 2026 •

edited by vbaddi

Loading

vbaddi left a comment •

edited

Loading