Skip to content

Added MDP generation to QEff Compile#930

Open
quic-mohmeh wants to merge 10 commits into
quic:release/v1.22.0_tmpfrom
quic-mohmeh:mdp
Open

Added MDP generation to QEff Compile#930
quic-mohmeh wants to merge 10 commits into
quic:release/v1.22.0_tmpfrom
quic-mohmeh:mdp

Conversation

@quic-mohmeh
Copy link
Copy Markdown

This PR adds the MDP generation required in case of disaggregated serving for Prefill. This supports both Pipeline Prefill + Tensor Slicing and also supports passing custom cores to the MDP generator

@quic-mohmeh quic-mohmeh force-pushed the mdp branch 3 times, most recently from e20d868 to f393d6e Compare April 22, 2026 08:42
@quic-mohmeh
Copy link
Copy Markdown
Author

Tested and working on the following model classes

  • CodeLlama-7b-Instruct
  • falcon-7b-instruct
  • gemma-2-9b-it
  • gpt-oss-20b
  • granite-3.1-8b-instruct
  • Llama-3.2-1B-Instruct
  • Llama-3.2-3B
  • Phi-3-mini-4k-instruct

@quic-rishinr
Copy link
Copy Markdown
Contributor

@mamtsing @ochougul please review the PR

@quic-mohmeh
Copy link
Copy Markdown
Author

@quic-rishinr @mamtsing @ochougul A gentle reminder for review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add warning that ignores mdp_ts_num_partitions whenever seq_len==1
Also, add a warning that ignores when ts_num_devices>1 and seq_len> and mdp_ts_num_partitions>1

@quic-mohmeh
Copy link
Copy Markdown
Author

To be tested on:

  • Qwen3 Dense

  • Qwen3 MOE

  • Test with subfunctions

@quic-hemagnih
Copy link
Copy Markdown
Contributor

@quic-mohmeh Please let us know once your testing is complete. Also incorporate the review comments. @quic-mamta Can also please review it.

@quic-mohmeh
Copy link
Copy Markdown
Author

Works on Qwen3-4B(Dense) - without subfunctions

@quic-mohmeh
Copy link
Copy Markdown
Author

Works on Qwen3-30B-A3B (MOE) model as well - without subfunctions

Comment thread QEfficient/compile/mdp_generator.py
@quic-mohmeh
Copy link
Copy Markdown
Author

Update example scripts for Qwen3(with and without VL) and GPTOSS

@quic-rishinr quic-rishinr added the 1.22 Release 1.22 candidate label May 22, 2026
@quic-mohmeh quic-mohmeh force-pushed the mdp branch 2 times, most recently from 36cd668 to d34dbda Compare May 25, 2026 08:49
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
@quic-rishinr quic-rishinr changed the base branch from main to release/v1.22.0_tmp May 25, 2026 16:37
@quic-rishinr
Copy link
Copy Markdown
Contributor

@quic-mohmeh please rebase it on top of release/v1.22.0_tmp branch

@quic-mohmeh
Copy link
Copy Markdown
Author

quic-mohmeh commented May 27, 2026

Tested with subfunctions on the following model:

  • CodeLlama-7b-Instruct-hf
  • falcon-7b-instruct
  • gemma-2-9b-it
  • granite-3.1-8b-instruct
  • Llama-3.1-70B-Instruct
  • Llama-3.2-1B-Instruct
  • Llama-3.2-3B
  • Phi-3-mini-4k-instruct
  • Qwen3-30B-A3B
  • GPTOSS-20B

@quic-mohmeh
Copy link
Copy Markdown
Author

quic-mohmeh commented May 27, 2026

Also verified by Qeff team on following models with subfunctions:

  • Qwen/Qwen3-235B-A22B (Karthikeya)
  • moonshotai/Kimi-K2-Instruct (Mamta)

Comment thread QEfficient/base/modeling_qeff.py Outdated
Comment thread QEfficient/compile/mdp_generator.py Outdated
Comment thread QEfficient/base/modeling_qeff.py Outdated
Comment thread QEfficient/base/modeling_qeff.py Outdated
Comment thread QEfficient/base/modeling_qeff.py Outdated
@@ -0,0 +1,263 @@
# -----------------------------------------------------------------------------
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add unit tests and test for mdp generation. Please make sure the tests are small and exectue under 5 to 10 seconds.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quic-rishinr as we have discussed that there is no valid approach to compare compiler MDP dump and QEff MDP dump, all we can ensure is that every node present in compiler MDP dump should be present in QEff MDP dump in correct order. For verifying and testing, we might need to compile no PP and PP QPCs and compare the output. Let me know which direction you want to proceed for testing

Comment thread QEfficient/compile/mdp_generator.py
Copy link
Copy Markdown
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quic-mohmeh Have you verified PP+TS combination with this? Or only PP works with this? WHen you say you have verified the above models, have you verified the output sanity too?

@quic-mohmeh
Copy link
Copy Markdown
Author

@quic-mohmeh Have you verified PP+TS combination with this? Or only PP works with this? WHen you say you have verified the above models, have you verified the output sanity too?

As vLLM currently doesn't support PP+TS, so I am unable to verify PP+TS. As for output validation, I have only skim the output, the main purpose of these tests is to check whether the model compiles or not with the MDP generated

Copy link
Copy Markdown
Author

@quic-mohmeh quic-mohmeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comments @quic-rishinr

Comment thread QEfficient/base/modeling_qeff.py Outdated
Comment thread QEfficient/base/modeling_qeff.py Outdated
Comment thread QEfficient/base/modeling_qeff.py Outdated
Comment thread QEfficient/base/modeling_qeff.py Outdated
@@ -0,0 +1,263 @@
# -----------------------------------------------------------------------------
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quic-rishinr as we have discussed that there is no valid approach to compare compiler MDP dump and QEff MDP dump, all we can ensure is that every node present in compiler MDP dump should be present in QEff MDP dump in correct order. For verifying and testing, we might need to compile no PP and PP QPCs and compare the output. Let me know which direction you want to proceed for testing

Comment thread QEfficient/compile/mdp_generator.py
Comment thread QEfficient/compile/mdp_generator.py Outdated
Addressed Rishin Comments

Signed-off-by: Mohit Mehta <mohmeh@qrc706r8-292-05.qualcomm.com>
@quic-mohmeh
Copy link
Copy Markdown
Author

@quic-rishinr @quic-hemagnih
Needs to be tested with VLMs as well, please do not merge yet.

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
@quic-mohmeh
Copy link
Copy Markdown
Author

quic-mohmeh commented Jun 2, 2026

VLMs are not working currently with this MDP partition, both with and without subfunctions(E-P-D). Have to look into compiler codebase

@quic-akuruvil
Copy link
Copy Markdown
Contributor

quic-akuruvil commented Jun 4, 2026

@quic-mohmeh Can you rebase your PR with latest

@quic-mohmeh
Copy link
Copy Markdown
Author

quic-mohmeh commented Jun 4, 2026

Working on VLMs - tested on Qwen2.5VL 3B with and without subfunctions(PP4)

Mohit Mehta added 2 commits June 4, 2026 23:41
Resolved conflict in QEfficient/base/modeling_qeff.py:
- Kept mdp_num_partitions parameter (MDP disagg prefill feature)
- Updated num_speculative_tokens to Union[int, List[int]] from release branch

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
@quic-akuruvil
Copy link
Copy Markdown
Contributor

Working on VLMs - tested on Qwen2.5VL 3B with and without subfunctions(PP4)

Hi @quic-mohmeh When you say tested on VLMS, have you enabled PP on prefill part of language model? Or have you checked with PP on vision part as well?

@quic-mohmeh
Copy link
Copy Markdown
Author

Working on VLMs - tested on Qwen2.5VL 3B with and without subfunctions(PP4)

Hi @quic-mohmeh When you say tested on VLMS, have you enabled PP on prefill part of language model? Or have you checked with PP on vision part as well?

Yes, PP with Prefill, TS on Encode and Decode

@quic-vargupt
Copy link
Copy Markdown
Contributor

Gemma4-26B-A4B Prefill compiled with subfunctions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1.22 Release 1.22 candidate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants