Release/v1.22.0 tmp#1029
Open
quic-rishinr wants to merge 21 commits into
Open
Conversation
…changes to v5.5.4 and restore PyTorch/ORT parity (#876) - Rebased downstream wrapper stack to transformers v5.3.0 and aligned coupled deps (huggingface-hub, peft, diffusers) in project config. - Updated model wrapper compatibility paths across causal/VLM/audio/export flows to match upstream v5 APIs while preserving downstream public behavior. - Hardened cache compatibility layer and runtime glue for mixed legacy/new cache semantics used by downstream generation/export paths. - Fixed attention/mask/rotary call-path mismatches introduced by upstream API changes (including model-specific signature updates). - Updated AWQ/quantizer and export compatibility paths to remain ONNX-safe. - Validation evidence: ``` python -m pytest -q tests/test_model_quickcheck.py -n 16 Result: 26 passed. ``` - [x] QAic Verification Pending - [x] E2E CI read out cc: @quic-rishinr @quic-hemagnih @asmigosw @anujgupt-github --------- Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Co-authored-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Co-authored-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>
…o_empty() (#952) fix: improve weight offloading to handle plain tensor attrs and use to_empty() Replace manual storage resizing with `to_empty(device="meta")` for parameters/buffers and explicitly handle plain tensor attributes (e.g. stacked expert weights in MoE models) that are not registered as parameters or buffers. This ensures all tensors are properly moved to the meta device, reducing memory usage after ONNX export. Add unit tests for plain tensor attribute clearing --------- Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
After inspecting the attributes of QAicProgramProperties: SDK 1.22.0.120 -> ['dataPathTimeoutMs', 'devMapping', 'selectMask', 'submitNumRetries'] SDK 1.22.0.119 ->['SubmitNumRetries', 'SubmitRetryTimeoutMs', 'dataPathTimeoutMs', 'devMapping', 'selectMask', 'submitNumRetries'] SubmitRetryTimeoutMs is missing in 1.22.0.120, but present in 1.22.0.119. Root Cause: The attribute removal is introduced in LRT changes in 1.22.0.120 sdk. Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
…1003) This PR is for changing the precision of the CCL input from int8 to int64 to be aligned with the compiler for release 1.22. This change is for addressing the JIRA tickets raised when working with CCL enabled in new sdk versions. Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com> Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
…1013) Without this fix, the QPCs are loaded on QID 0 by default. Signed-off-by: sanising <sanising@qti.qualcomm.com>
## Summary Adds GLM4-MOE support for disaggregated serving with chunked prefill. ## Supported - GLM4-MOE decode path - Chunked prefill MoE path with packed expert dispatch - KV-blocked attention path - Disaggregated prefill/decode serving example - ONNX subfunction export for decode and prefill ## Tested - Added GLM4-MOE prefill/blocked export tests - Verified packed MoE custom-op counts for `prefill_seq_len=512`, packed chunk size `256` - Ran GLM4-MOE disaggregated example end-to-end w/tiny config. ```bash pytest -q tests/transformers/models/test_moe_prefill_blocked.py python examples/disagg_serving/glm4_moe_disagg_mode_with_chunking.py ``` --------- Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…s list (#984) ## Description Speculative decoding on QAIC requires statically compiled shapes. Previously, a TLM could only be compiled for a single proposal length K, forcing every decode step to run the full `seq_len=K+1` kernel regardless of how many tokens the draft model actually proposed. This PR allows compiling multiple decode specializations in one QPC so the runtime can dispatch to the smallest kernel that covers the actual proposal count, reducing unnecessary compute on short proposals without changing correctness. ### Summary - `QEFFAutoModelForCausalLM.compile()` now accepts `num_speculative_tokens` as a `List[int]`. Each value K compiles one TLM decode specialization (`seq_len=K+1`, `num_logits_to_keep=K+1`), enabling per-step dispatch to the cheapest kernel that covers the actual proposal count. - Plain `int` input still works (backward compatible — treated as `[K]`). - Removes `enable_fallback_decode_spec`; equivalent behavior is `[0, K]`. - Fixes flat-format `specializations.json` write in `_compile` (named format caused `RuntimeError: Failed to create ExecObj` on 4-device MDP QPCs). - Improves `find_candidate_pred_tokens` to return the *longest* n-gram continuation across all candidates instead of early-returning on the first match. ### Results Measured on Llama-3.1-8B-Instruct (mxfp6/mxint8, 4 SOCs), MT-bench, 80 prompts. `num_speculative_tokens=[0, 4]` (K=4) vs the prior fixed-K baseline. **Request throughput vs nospec (req/s):** <img width="1052" height="714" alt="image" src="https://github.com/user-attachments/assets/79a8ed1a-76f5-45d4-9655-9ab57f9f25c7" /> | max new seqs | nospec | ngram K=4 (fixed) | ngram varK | suffix K=4 (fixed) | suffix varK | |---|---|---|---|---|---| | 1 | 0.137 | 0.108 (−21%) | **0.282 (+106%)** | 0.131 (−4%) | **0.297 (+116%)** | | 2 | 0.257 | 0.197 (−23%) | **0.467 (+82%)** | 0.237 (−8%) | **0.495 (+93%)** | | 4 | 0.448 | 0.340 (−24%) | **0.640 (+43%)** | 0.405 (−9%) | **0.758 (+69%)** | | 8 | 0.674 | 0.580 (−14%) | **1.224 (+82%)** | 0.692 (+3%) | **1.142 (+69%)** | **varK vs fixed-K improvement:** | max new seqs | ngram Δ | suffix Δ | |---|---|---| | 1 | +162% | +126% | | 2 | +137% | +109% | | 4 | +88% | +87% | | 8 | +111% | +65% | Key takeaways: - Fixed-K SpD is 14–24% *slower* than nospec on QAIC (each step runs the full K+1 kernel even when no tokens are proposed). - Variable-K reverses this regression: **+43–116% throughput vs nospec** across both methods. - TPOT at mns=1 drops from ~34 ms/token (nospec) to ~16 ms/token with varK (~53% reduction). ### Test plan - [x] `pytest tests/unit_test/models/test_modeling_auto_cpu.py::TestTLMMultiSpecSpecializations -v` - [x] `pytest tests/unit_test/transforms/test_speculative_decoding.py::TestTLMForwardExecution::test_tlm_multi_spec_logit_consistency -v` - [x] `pytest tests/transformers/spd/test_pld_inference.py::test_multi_spec_structure tests/transformers/spd/test_pld_inference.py::test_select_k -v` - [x] Hardware: `pytest tests/transformers/spd/test_spd_inference.py -m on_qaic -k "pld"` (on QAIC device) ### Notes - TLM + CCL (`comp_ctx_lengths_decode`) combination raises `NotImplementedError` — not yet supported. - `speculative_config` in model config overrides user-supplied Ks; a `logger.warning` is emitted when values are discarded. --------- Signed-off-by: eplatero <eplatero@qti.qualcomm.com>
Added fix for fp16 export in qwen3 and qwen3vl modeling files. --------- Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com> Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>
This PR adds support for the following Qwen3-VL reranker models on AI100: - `Qwen/Qwen3-VL-Reranker-2B` - `Qwen/Qwen3-VL-Reranker-8B` The support is implemented using the existing QEff image-text-to-text flow (dual QPC), with model parity validation focused on **PyTorch(original) vs AI100**. ### Results | Model | PyTorch score | AI100 score | MAD max | Status | |---|---:|---:|---:|---| | Qwen/Qwen3-VL-Reranker-2B | 0.3213230073 | 0.3259495199 | 4.626513e-03 | Pass | | Qwen/Qwen3-VL-Reranker-8B | 0.6058825254 | 0.6043989062 | 1.483619e-03 | Pass | --------- Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
…ll (#935) Adds NSP-parallel expert-blocked dispatch to the chunked prefill MoE path for Qwen3MOE and GPT-OSS, replacing the sequential per-expert loop with a batched packed-prefix approach. ``` Configuration: export EXPERT_BLOCKING_NUM_NSP=16 # default: 1 NSP per expert (best perf at T=256) export EXPERT_BLOCKING_NUM_NSP=8 # 2 NSPs per expert export EXPERT_BLOCKING_NUM_NSP=2 # for testing ``` Falls back to the original per-expert loop if `num_experts % EXPERT_BLOCKING_NUM_NSP !=0`. `EXPERT_BLOCKING_NUM_NSP=2 pytest tests/transformers/models/test_moe_prefill_blocked.py -v` Update (0429): `export EXPERT_BLOCKING_PACKED_CHUNK_SIZE=256` for chunk PL of 512 **Update (0525):** Configuration is now compile-API driven: 1. num_cores controls NSP parallelism. 2. moe_prefill_packed_chunk_size controls packed chunk size. 3. No EXPERT_BLOCKING_NUM_NSP / EXPERT_BLOCKING_PACKED_CHUNK_SIZE env vars are required. **Example:** ``` qeff_model.compile( prefill_seq_len=512, ctx_len=..., num_cores=16, prefill_only=True, enable_chunking=True, moe_prefill_packed_chunk_size=256, ... ) ``` **Notes:** - Qwen3-MoE and GPT-OSS disaggregated serving examples are updated to use PL=512 and packed chunk size=256. - The optimized path requires num_experts % num_cores == 0. - Qwen3-MoE and GPT-OSS now use the same packed-chunk flow as the standalone benchmark. - torch.clamp is retained for bench alignment, with tensor bounds to avoid QAIC Clip dtype issues. - Subfunction-specific ReduceSum/Einsum cleanup is deferred and will be handled separately. **Validation:** ``` pytest -q \ tests/transformers/models/test_moe_prefill_blocked.py::test_qwen3moe_blocked_forward_parity \ tests/transformers/models/test_moe_prefill_blocked.py::test_qwen3moe_prefill_chunked_subfunction_export_contains_cumsum_custom_ops \ tests/transformers/models/test_moe_prefill_blocked.py::test_gptoss_blocked_forward_parity \ tests/transformers/models/test_moe_prefill_blocked.py::test_gptoss_prefill_chunked_export_traces_packed_chunks ``` Also verified tiny non-subfunction QAIC compile for Qwen3-MoE and GPT-OSS with: - prefill_seq_len=512 - moe_prefill_packed_chunk_size=256 - num_cores=2 --------- Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Adds support for two Qwen 3.5 and 3.6 VLM model architectures --------- Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com> Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Adds end-to-end support for Gemma 4, covering both the text-only (ForCausalLM) and multimodal (ForConditionalGeneration) variants, including dense and MoE configurations, optimized chunked prefill, and disaggregated (vision/language split) serving. --------- Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com> Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com> Co-authored-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>
Adds embedding-model support for Qwen/Qwen3-VL-Embedding-8B. [MAD] CPU vs AI100 mean=1.585330e-05, max=3.049895e-04 --------- Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Depricate the support for meta-llama/Llama-3.2-11B-Vision model from Efficient Transformers --------- Signed-off-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>
…1027) This PR fixes stale ONNX reuse in the shared compile path. Problem: - `_compile()` could reuse `self.onnx_path` from a previous compile. - In disaggregated serving, decode compile and prefill/chunked prefill compile need different ONNX graphs because the exported model forward/transforms differ. - Reusing decode ONNX for prefill could produce multiple QPCs from the same graph, which is incorrect for MoE prefill optimization. - For diffusion pipelines, compile can run after export, and some modules such as `QEffVAE` do not accept generic export kwargs like `offload_pt_weights`, causing CI failures when compile tried to re-export. Fix: - In `QEFFBaseModel._compile()`, reuse an existing ONNX only when weights are already offloaded/meta, since re-export is not possible in that state. - Otherwise, export for the current compile mode so decode and prefill/chunked prefill can produce distinct ONNX graphs. - In diffusion pipeline modules, pass `onnx_path=self.onnx_path` explicitly into `_compile()` so those modules compile the already-exported graph and avoid unintended re-export. Validation: - Qwen3-MoE disaggregated decode + chunked prefill ONNX regression passed. - VLM subfunction regression passed. --------- Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…es (#1030) ## Description Expands quickcheck coverage for enable_proxy across causal LM, embedding, sequence classification, Whisper, CTC, and dual-QPC VLM paths. Confirms default enable_proxy=False behavior remains unchanged and validates with the full quickcheck suite. cc: @anujgupt-github @quic-rishinr Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Added support for Qwen 3.5, Qwen 3.6 and Gemma 4 model on nightly list --------- Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
…1023) Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com> Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com> Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com> Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com>
2c3d69c to
6b8db51
Compare
This PR fixes the Qwen3-VL-MoE accuracy issue, compiler issue wrt
subfunction
### Main Fix: Deepstack Accuracy Fix
The deepstack visual merge previously used:
```
hidden_states = hidden_states.clone()
mixed_embeds = hidden_states + visual_embeds
local_this = torch.where(visual_pos_masks, mixed_embeds, hidden_states)
return local_this
```
This has been changed to:
```
visual_mask = visual_pos_masks.to(hidden_states.dtype)
return hidden_states + (visual_embeds * visual_mask)
```
This is mathematically equivalent for boolean visual position masks, but
avoids the torch.where merge pattern. The previous torch.where form
compiled but produced poor accuracy. The mask-multiply form resolves the
observed accuracy
issue.
### Summary for changes related to transformer upgrade v5.5.4
- Avoids calling the current HF **self.gate(x)** implementation inside
QEfficient MoE blocks, because the newer router internally performs
top-k normalization using a reduction with **.sum()** that can export as
unsupported **ReduceSum under -sub-functions**.
- Recreates the older raw-router-logits flow explicitly with F.linear,
then applies TopK and softmax over selected top-k logits.
- Updates decode sparse MoE expert weight gathering to use
pre-transposed expert weights before dynamic index_select, avoiding the
problematic Gather -> Transpose pattern on dynamically selected expert
tensors.
**Key Changes**
- Router path changed from:
```
gate_out = self.gate(x)
router_logits, top_w, top_i = gate_out
```
to:
```
router_logits = F.linear(x.reshape(-1, self.gate.hidden_dim), self.gate.weight)
top_w, top_i = torch.topk(router_logits, self.gate.top_k, dim=-1)
top_w = F.softmax(top_w, dim=-1, dtype=torch.float)
```
- Decode expert gather changed from:
```
w_up = self.experts.gate_up_proj.index_select(0, idx)
w_up = w_up.transpose(1, 2)
```
to:
` w_up = self.experts.gate_up_proj.transpose(1, 2).index_select(0, idx)
`
Why
- Newer transformers Qwen3-VL-MoE router returns (router_logits,
router_scores, router_indices) and performs normalization internally.
- That normalization introduced reduction ops in the exported ONNX,
which caused QAIC compile failures (segfaults).
---------
Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>
In latest TF upgrade, past_seen_tokens were calculated every time causing performance to drop. Hence, reverting this change. Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Added bug fix for layerwise export --------- Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com> Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com> Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com> Co-authored-by: vtirumal <vtirumal@qti.qualcomm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Collection of PRs which include TF 5.5.4, Qwen 3.5, 3.6 Gemma 4 etc.