Added MDP generation to QEff Compile#930
Conversation
e20d868 to
f393d6e
Compare
|
Tested and working on the following model classes
|
|
@quic-rishinr @mamtsing @ochougul A gentle reminder for review |
There was a problem hiding this comment.
Add warning that ignores mdp_ts_num_partitions whenever seq_len==1
Also, add a warning that ignores when ts_num_devices>1 and seq_len> and mdp_ts_num_partitions>1
|
To be tested on:
|
|
@quic-mohmeh Please let us know once your testing is complete. Also incorporate the review comments. @quic-mamta Can also please review it. |
|
Works on Qwen3-4B(Dense) - without subfunctions |
|
Works on Qwen3-30B-A3B (MOE) model as well - without subfunctions |
|
Update example scripts for Qwen3(with and without VL) and GPTOSS |
36cd668 to
d34dbda
Compare
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
|
@quic-mohmeh please rebase it on top of release/v1.22.0_tmp branch |
|
Tested with subfunctions on the following model:
|
|
Also verified by Qeff team on following models with subfunctions:
|
| @@ -0,0 +1,263 @@ | |||
| # ----------------------------------------------------------------------------- | |||
There was a problem hiding this comment.
please add unit tests and test for mdp generation. Please make sure the tests are small and exectue under 5 to 10 seconds.
There was a problem hiding this comment.
@quic-rishinr as we have discussed that there is no valid approach to compare compiler MDP dump and QEff MDP dump, all we can ensure is that every node present in compiler MDP dump should be present in QEff MDP dump in correct order. For verifying and testing, we might need to compile no PP and PP QPCs and compare the output. Let me know which direction you want to proceed for testing
There was a problem hiding this comment.
@quic-mohmeh Have you verified PP+TS combination with this? Or only PP works with this? WHen you say you have verified the above models, have you verified the output sanity too?
As vLLM currently doesn't support PP+TS, so I am unable to verify PP+TS. As for output validation, I have only skim the output, the main purpose of these tests is to check whether the model compiles or not with the MDP generated |
quic-mohmeh
left a comment
There was a problem hiding this comment.
Addressed the comments @quic-rishinr
| @@ -0,0 +1,263 @@ | |||
| # ----------------------------------------------------------------------------- | |||
There was a problem hiding this comment.
@quic-rishinr as we have discussed that there is no valid approach to compare compiler MDP dump and QEff MDP dump, all we can ensure is that every node present in compiler MDP dump should be present in QEff MDP dump in correct order. For verifying and testing, we might need to compile no PP and PP QPCs and compare the output. Let me know which direction you want to proceed for testing
Addressed Rishin Comments Signed-off-by: Mohit Mehta <mohmeh@qrc706r8-292-05.qualcomm.com>
|
@quic-rishinr @quic-hemagnih |
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
|
VLMs are not working currently with this MDP partition, both with and without subfunctions(E-P-D). Have to look into compiler codebase |
|
@quic-mohmeh Can you rebase your PR with latest |
|
Working on VLMs - tested on Qwen2.5VL 3B with and without subfunctions(PP4) |
Resolved conflict in QEfficient/base/modeling_qeff.py: - Kept mdp_num_partitions parameter (MDP disagg prefill feature) - Updated num_speculative_tokens to Union[int, List[int]] from release branch Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Hi @quic-mohmeh When you say tested on VLMS, have you enabled PP on prefill part of language model? Or have you checked with PP on vision part as well? |
Yes, PP with Prefill, TS on Encode and Decode |
|
Gemma4-26B-A4B Prefill compiled with subfunctions |
This PR adds the MDP generation required in case of disaggregated serving for Prefill. This supports both Pipeline Prefill + Tensor Slicing and also supports passing custom cores to the MDP generator