Skip to content

Feat/moe nsp blocking all models#1016

Open
divytrip3005 wants to merge 16 commits into
quic:mainfrom
divytrip3005:feat/moe-nsp-blocking-all-models
Open

Feat/moe nsp blocking all models#1016
divytrip3005 wants to merge 16 commits into
quic:mainfrom
divytrip3005:feat/moe-nsp-blocking-all-models

Conversation

@divytrip3005
Copy link
Copy Markdown

@divytrip3005 divytrip3005 commented Jun 1, 2026

NSP-Blocked MoE Prefill Dispatch

What Is This?

Large MoE (Mixture-of-Experts) models like Qwen3-MoE, GPT-OSS, GraniteMoE, and Qwen3-VL-MoE route each input token to a small subset of expert FFN networks (e.g. 8 out of 128 experts). During prefill, all tokens are processed simultaneously — this is compute-bound and benefits from parallelism.

On Qualcomm AI100 hardware, the chip has 16 NSPs that can execute in parallel. The standard per-expert sequential loop underutilizes this parallelism. NSP blocking restructures the expert dispatch so that each NSP handles a dedicated group of experts simultaneously, dramatically improving prefill throughput.


How It Works

Standard MoE Dispatch (Sequential)

Tokens [T, H]
    │
    ▼
Router → selects top-k experts per token
    │
    ▼
for expert_e in range(num_experts):          # sequential loop over all E experts
    mask = tokens routed to expert_e
    out += expert_e(tokens[mask]) * weight
    │
    ▼
Output [T, H]

Every expert is processed one at a time.

NSP-Blocked Dispatch (Parallel)

Tokens [T, H]
    │
    ▼
Router → routing_weights [T, E]
    │
    ▼
Reshape experts into NSP groups:
  E experts → num_nsp groups × local_experts each
  e.g. 128 experts, NSP=16 → 16 groups × 8 experts

    NSP-0    NSP-1    NSP-2  ...  NSP-15
   [E0..E7] [E8..E15] [E16..E23]  [E120..E127]
      │         │         │              │
      ▼         ▼         ▼              ▼
   (parallel execution across all 16 NSPs)

for slot in range(local_experts):            # loop over 8 slots (not 128!)
    T2Ei = tokens routed to this slot's expert in each NSP group
    │
    ├─ cumsum → compute packed indices of active tokens
    ├─ gather → pack only active tokens [T_active, H]
    ├─ matmul → run expert FFN on packed tokens
    └─ scatter → write results back to output buffer
    │
    ▼
Output [T, H]  (sum across NSP dimension)

The inner loop runs local_experts = E / num_nsp times (e.g. 8 instead of 128), with all 16 NSPs working in parallel on their expert group.


Implementation

Key Components

1. Weight Splitting (__qeff_init__)

The fused gate_up_proj [E, H, 2I] is split into separate gate and up projections and reshaped for NSP-parallel access:

def __qeff_init__(self):
    W_gate_up = self.experts.gate_up_proj  # [E, H, 2I]
    I = W_gate_up.shape[2] // 2
    self._W_g = nn.Parameter(W_gate_up[:, :, :I].contiguous())  # [E, H, I]
    self._W_u = nn.Parameter(W_gate_up[:, :, I:].contiguous())  # [E, H, I]
    self._W_d = nn.Parameter(self.experts.down_proj.contiguous())  # [E, I, H]

2. NSP-Grouped Reshape

Experts are grouped across NSPs using a double-transpose pattern that correctly interleaves experts:

local_experts = num_experts // num_nsp  # e.g. 128 // 16 = 8

# routing weights: [T, E] → [num_nsp, local_experts, T]
rw = routing_weights.transpose(0,1).contiguous()
     .view(local_experts, num_nsp, T).transpose(0,1).contiguous()

# weights: [E, H, I] → [num_nsp, local_experts, H, I]
W_g = self._W_g.view(local_experts, num_nsp, H, I).transpose(0,1).contiguous()

3. Cumsum-Scatter-Gather Kernel

For each slot, only tokens routed to that expert group are processed:

for slot in range(local_experts):
    routing_weight = rw[:, slot, :]      # [num_nsp, T]
    T2Ei = routing_weight > 0            # boolean mask of active tokens

    # Pack active tokens using cumsum-based indexing
    matched_idx = _build_matched_idx_from_cumsum(T2Ei)

    # Gather: pick only active tokens → [num_nsp, T_active, H]
    x_chunk = CtxGatherFunc3DGeneralized.apply(x_expanded, matched_idx)

    # Expert FFN on packed tokens
    gate_prime = x_chunk @ W_g[:, slot]
    up_prime   = x_chunk @ W_u[:, slot]
    down_chunk = (up_prime * act_fn(gate_prime)) @ W_d[:, slot]

    # Scatter: write results back
    expert_out = CtxScatterFunc3DGeneralized.apply(expert_out, matched_idx, down_chunk * rw_chunk)

4. Forward Dispatch

The class attribute supports_moe_prefill_blocking = True signals that this module supports API-driven config injection. The forward method dispatches to the NSP path only when expert_blocking_num_nsp is set:

class QEffPrefillChunkedQwen3VLMoeTextSparseMoeBlock(Qwen3VLMoeTextSparseMoeBlock):
    supports_moe_prefill_blocking = True

    def forward(self, hidden_states):
        ...
        if hasattr(self, "expert_blocking_num_nsp"):
            expert_out = self._forward_expert_blocked(x, routing_weights)
            return expert_out.view(B, S, H), router_logits
        return self.orig_forward(hidden_states)  # fallback

Supported Models

Model Class Experts NSP Groups (NSP=16)
Qwen3-MoE QEffPrefillChunkedQwen3MoeSparseMoeBlock 128 16 × 8
GPT-OSS QEffPrefillOnlyChunkedGptOssMLP 128 16 × 8
GraniteMoE QEffPrefillChunkedGraniteMoeMoE 32 16 × 2
Qwen3-VL-MoE QEffPrefillChunkedQwen3VLMoeTextSparseMoeBlock 128 16 × 8

How to Enable

CausalLM Models (Qwen3-MoE, GPT-OSS, GraniteMoE)

from QEfficient import QEFFAutoModelForCausalLM

model = QEFFAutoModelForCausalLM.from_pretrained("Qwen/Qwen3-30B-A3B-Instruct-2507")

# Compile prefill with NSP blocking enabled
prefill_qpc = model.compile(
    batch_size=1,
    prefill_seq_len=512,
    ctx_len=4096,
    num_cores=16,                        # number of NSPs on the chip
    num_devices=1,
    mxfp6_matmul=True,
    mxint8_kv_cache=True,
    prefill_only=True,
    enable_chunking=True,
    moe_prefill_num_nsp=16,              # ← enable NSP blocking (set to num_cores)
    moe_prefill_packed_chunk_size=256,   # ← token chunk size per iteration
)

# Compile decode (no blocking needed for decode)
decode_qpc = model.compile(
    batch_size=1,
    prefill_seq_len=1,
    ctx_len=4096,
    num_cores=16,
    num_devices=1,
    mxfp6_matmul=True,
    mxint8_kv_cache=True,
)

VLM Models (Qwen3-VL-MoE)

from QEfficient import QEFFAutoModelForImageTextToText

model = QEFFAutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-30B-A3B-Instruct",
    attn_implementation="eager",
    kv_offload=True,
)

# Compile prefill with NSP blocking
prefill_qpc = model.compile(
    batch_size=1,
    prefill_seq_len=512,
    ctx_len=4096,
    height=354,
    width=536,
    num_cores=16,
    num_devices=1,
    mxfp6_matmul=True,
    mxint8_kv_cache=True,
    prefill_only=True,
    enable_chunking=True,
    skip_vision=True,                    # compile lang model only
    moe_prefill_num_nsp=16,              # ← enable NSP blocking
    moe_prefill_packed_chunk_size=256,
)

# Compile decode
decode_qpc = model.compile(
    batch_size=1,
    prefill_seq_len=1,
    ctx_len=4096,
    height=354,
    width=536,
    num_cores=16,
    num_devices=1,
    mxfp6_matmul=True,
    mxint8_kv_cache=True,
    prefill_only=False,
    skip_vision=True,
)

Disabling NSP Blocking (Baseline)

To compile without NSP blocking for comparison:

prefill_qpc = model.compile(
    ...,
    prefill_only=True,
    enable_chunking=True,
    moe_prefill_num_nsp=None,   # ← None disables NSP blocking
)

Key Parameters

Parameter Type Default Description
moe_prefill_num_nsp Optional[int] None Number of NSP groups to split experts across. Set to num_cores (typically 16) to enable blocking. None disables blocking.
moe_prefill_packed_chunk_size int 256 Number of token rows per packed chunk in the scatter-gather loop. Controls the tradeoff between loop iterations and chunk size.

Notes

  • Token correctness: NSP blocking produces numerically identical outputs to the standard path for shallow models (2 layers). For full-depth models, minor FP16 rounding differences may accumulate across layers due to non-associativity of floating point addition — this is expected and does not affect model quality in practice.
  • Cache isolation: Different values of moe_prefill_num_nsp produce different ONNX/QPC hashes, so switching between blocking configs does not cause cache conflicts.
  • Decode path: NSP blocking applies only to the prefill phase. The decode path uses a standard index_select-based dispatch which is already efficient for single-token generation.

vbaddi and others added 16 commits April 30, 2026 07:17
Add expert-blocked NSP-parallel prefill forward to QEffPrefillChunkedQwen3MoeSparseMoeBlock
and QEffPrefillOnlyChunkedGptOssMLP. Controlled via EXPERT_BLOCKING_NUM_NSP env var.
Fix CtxScatterFunc3D/CtxGatherFunc3D eager forward for INT32_MAX sentinel handling.
Add disagg-mode tests for both models with tiny configs.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…prefill

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
  - Root cause: CtxGather3D ONNX symbolic expanded ctx_indices to Shape(data)[:2] ([batch, seq_len]), which is wrong for packed dispatch.
  - In expert-blocked MoE prefill, ctx_indices is intentionally [batch, packed_chunk_size] (e.g. [16, 256]) while data stays [batch, seq_len, ...] (e.g. [16, 512, ...]).
  - This caused invalid Expand attempts ([16,256] -> [16,512]) and QAIC compile/runtime failure on /model/layers.0/mlp/CtxGather3D/....

  Fix:

  - Update CtxGather3D expand target to:
      - batch dim from data
      - index-seq dim from ctx_indices
  - New expand shape is [batch_size(data), idx_seq_len(ctx_indices)], preserving packed chunk length.

Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>
…port

Add missing CustomOpTransform mappings for CtxScatterFunc3DInt and generalized 3D scatter/gather ops,
plus a prefill-only subfunction export regression test to verify the ONNX graph includes the required CtxScatter3DInt/CtxScatter3D/CtxGather3D ops.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…on export

Replace MoE prefill sum reductions with equivalent einsum forms and rewrite int32 clamp bounds using where to avoid QAIC subfunction compile
failures for GPT-OSS and Qwen3-MoE.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Trace chunked prefill exports with the requested prefill_seq_len so packed MoE dispatch unrolls all packed chunks, restore torch.full_like
index init, and add ONNX coverage for the second packed chunk slice.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
- gpt_oss/modeling_gpt_oss.py: add num_expert_chunks dynamic-loop mechanism
  to _cumsum_scatter_gather_update_gptoss_expert_blocked and
  _forward_expert_blocked; read _num_expert_chunks from module in forward

- qwen3_moe/modeling_qwen3_moe.py: same num_expert_chunks dynamic-loop
  mechanism; replace fallback inline loop with orig_forward call

- granitemoe/modeling_granitemoe.py: add QEffPrefillChunkedGraniteMoeAttention,
  QEffPrefillChunkedGraniteMoeMoE with full ONNX-friendly cumsum-scatter-gather
  dispatch; update get_submodules_for_export to return set() when chunked
  prefill MoE is active

- pytorch_transforms.py: register GraniteMoe forward/reverse mappings in
  PrefillOnlyChunkedTransform and RevertPrefillKeepAttentionTransform

- modeling_auto.py: compute num_expert_chunks from prefill_seq_len and
  EXPERT_BLOCKING_PACKED_CHUNK_SIZE; setattr _num_expert_chunks on every
  MoE layer when enable_chunking=True
torch.clamp on int32 tensors exports to ONNX Clip op which QAIC compiler
does not support (Unhandled ElemKind in Clip operation). Replace with
torch.where in all three models: gpt_oss, qwen3_moe, granitemoe.
…raniteMoE

- modeling_qwen3_vl_moe.py: Add QEffPrefillChunkedQwen3VLMoeTextSparseMoeBlock with
  NSP-blocked cumsum-scatter-gather dispatch; supports_moe_prefill_blocking=True;
  use expert_blocking_num_nsp/packed_chunk_size/num_packed_chunks instance attrs

- modeling_qwen3_moe.py, modeling_gpt_oss.py, modeling_granitemoe.py: Replace
  EXPERT_BLOCKING_NUM_NSP/EXPERT_BLOCKING_PACKED_CHUNK_SIZE env vars with
  API-driven instance attributes set via compile() params

- pytorch_transforms.py: Register QEffPrefillChunkedQwen3VLMoeTextSparseMoeBlock
  in RevertPrefillKeepAttentionTransform

- modeling_auto.py: QEFFAutoModelForCausalLM.get_seq_len_and_handle_specialized_prefill_model
  iterates modules with supports_moe_prefill_blocking=True and sets instance attrs;
  QEffCausalLMForTextImageToTextModel.export() uses same API-driven pattern for VLM;
  QEFFAutoModelForImageTextToText.compile() accepts moe_prefill_packed_chunk_size param

- modeling_qeff.py: Uncomment self.onnx_path fallback in _compile so pre-exported
  ONNX is reused without hitting get_onnx_path; pass moe_prefill_packed_chunk_size
  through get_onnx_path and _compile

- constants.py: Add MOE_PREFILL_PACKED_CHUNK_SIZE = 256
…ms to QEFFAutoModelForCausalLM.export()

compiler_options is only available in compile(), not export(). Add num_cores
and moe_prefill_packed_chunk_size as explicit named params to export() so they
are directly accessible, matching the pattern in vbaddi/feat/prefill_moe.
- modeling_qeff.py: save moe_prefill_num_nsp from compiler_options in
  _compile and pass through get_onnx_path to export()

- modeling_utils.py: add granitemoe to SPECIALIZED_DISAGG_SERVING_MODEL_ARCH

- modeling_granitemoe.py: fix supports_moe_prefill_blocking moved out of
  docstring into class body; fix reshape order to match Qwen3-MoE/GPT-OSS

- modeling_auto.py: add moe_prefill_num_nsp param to compile()/export()/
  get_seq_len; pass moe_prefill_num_nsp to lang_model.export() in VLM path;
  fix sliding_window AttributeError for models without sliding_window attr
@divytrip3005
Copy link
Copy Markdown
Author

@vbaddi can you review this PR ?

Copy link
Copy Markdown
Contributor

@vbaddi vbaddi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is ongoing effort to put all the changes in 1.22_tmp branch which should soon come up in mainline QEff. Let's rebase once that is done to streamline this work item. thanks @divytrip3005

cc: @quic-rishinr @anujgupt-github

mla_absorption: Optional[Dict[str, bool]] = None,
qaic_config: Optional[dict] = None,
moe_prefill_packed_chunk_size: Optional[int] = None,
moe_prefill_num_nsp: Optional[int] = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need of this imo, this would be same as num_cores already.

return g.onnxscript_op(CtxGather3D, data, ctx_indices).setTypeAs(data)


class CtxGatherFunc3DGeneralized(torch.autograd.Function):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's rebase to 1.22_tmp, these changes should catch up in there.

return q_embed.to(q.dtype), k_embed.to(k.dtype)


class QEffPrefillChunkedGraniteMoeAttention(GraniteMoeAttention):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what's the purpose of this? why do we need different chunked attention? does this solve anything unique?

packed_chunk_size = seq_len // num_expert_chunks
else:
packed_chunk_size = max(1, min(packed_chunk_size, seq_len))
num_expert_chunks = seq_len // packed_chunk_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's rebase to 1.22_tmp branch, this logic should be aligned w/gptoss and qwen3-moe.

# -----------------------------------------------------------------------------
#
# Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
# SPDX-License-Identifier: BSD-3-Clause
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what changed here? why so much diff? what this formatting issue earlier?

def test_qwen3moe_prefill_chunked_export(tmp_path):
config = AutoConfig.for_model("qwen3_moe", **QWEN3_MOE_CFG)
model = AutoModelForCausalLM.from_config(config, **MODEL_KWARGS)
qeff = QEFFAutoModelForCausalLM(model, continuous_batching=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how are we verifying the chunked export here? should be either via customops (CtxGeneralized*) in onnx or checking the module presence in pytorch no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants