Release/v1.22.0 tmp by quic-rishinr · Pull Request #1029 · quic/efficient-transformers

quic-rishinr · 2026-06-03T15:12:46Z

Collection of PRs which include TF 5.5.4, Qwen 3.5, 3.6 Gemma 4 etc.

@quic-rishinr

…changes to v5.5.4 and restore PyTorch/ORT parity (#876) - Rebased downstream wrapper stack to transformers v5.3.0 and aligned coupled deps (huggingface-hub, peft, diffusers) in project config. - Updated model wrapper compatibility paths across causal/VLM/audio/export flows to match upstream v5 APIs while preserving downstream public behavior. - Hardened cache compatibility layer and runtime glue for mixed legacy/new cache semantics used by downstream generation/export paths. - Fixed attention/mask/rotary call-path mismatches introduced by upstream API changes (including model-specific signature updates). - Updated AWQ/quantizer and export compatibility paths to remain ONNX-safe. - Validation evidence: ``` python -m pytest -q tests/test_model_quickcheck.py -n 16 Result: 26 passed. ``` - [x] QAic Verification Pending - [x] E2E CI read out cc: @quic-rishinr @quic-hemagnih @asmigosw @anujgupt-github --------- Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Co-authored-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Co-authored-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>

…o_empty() (#952) fix: improve weight offloading to handle plain tensor attrs and use to_empty() Replace manual storage resizing with `to_empty(device="meta")` for parameters/buffers and explicitly handle plain tensor attributes (e.g. stacked expert weights in MoE models) that are not registered as parameters or buffers. This ensures all tensors are properly moved to the meta device, reducing memory usage after ONNX export. Add unit tests for plain tensor attribute clearing --------- Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>

After inspecting the attributes of QAicProgramProperties: SDK 1.22.0.120 -> ['dataPathTimeoutMs', 'devMapping', 'selectMask', 'submitNumRetries'] SDK 1.22.0.119 ->['SubmitNumRetries', 'SubmitRetryTimeoutMs', 'dataPathTimeoutMs', 'devMapping', 'selectMask', 'submitNumRetries'] SubmitRetryTimeoutMs is missing in 1.22.0.120, but present in 1.22.0.119. Root Cause: The attribute removal is introduced in LRT changes in 1.22.0.120 sdk. Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

…1003) This PR is for changing the precision of the CCL input from int8 to int64 to be aligned with the compiler for release 1.22. This change is for addressing the JIRA tickets raised when working with CCL enabled in new sdk versions. Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com> Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>

…1013) Without this fix, the QPCs are loaded on QID 0 by default. Signed-off-by: sanising <sanising@qti.qualcomm.com>

## Summary Adds GLM4-MOE support for disaggregated serving with chunked prefill. ## Supported - GLM4-MOE decode path - Chunked prefill MoE path with packed expert dispatch - KV-blocked attention path - Disaggregated prefill/decode serving example - ONNX subfunction export for decode and prefill ## Tested - Added GLM4-MOE prefill/blocked export tests - Verified packed MoE custom-op counts for `prefill_seq_len=512`, packed chunk size `256` - Ran GLM4-MOE disaggregated example end-to-end w/tiny config. ```bash pytest -q tests/transformers/models/test_moe_prefill_blocked.py python examples/disagg_serving/glm4_moe_disagg_mode_with_chunking.py ``` --------- Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

…s list (#984) ## Description Speculative decoding on QAIC requires statically compiled shapes. Previously, a TLM could only be compiled for a single proposal length K, forcing every decode step to run the full `seq_len=K+1` kernel regardless of how many tokens the draft model actually proposed. This PR allows compiling multiple decode specializations in one QPC so the runtime can dispatch to the smallest kernel that covers the actual proposal count, reducing unnecessary compute on short proposals without changing correctness. ### Summary - `QEFFAutoModelForCausalLM.compile()` now accepts `num_speculative_tokens` as a `List[int]`. Each value K compiles one TLM decode specialization (`seq_len=K+1`, `num_logits_to_keep=K+1`), enabling per-step dispatch to the cheapest kernel that covers the actual proposal count. - Plain `int` input still works (backward compatible — treated as `[K]`). - Removes `enable_fallback_decode_spec`; equivalent behavior is `[0, K]`. - Fixes flat-format `specializations.json` write in `_compile` (named format caused `RuntimeError: Failed to create ExecObj` on 4-device MDP QPCs). - Improves `find_candidate_pred_tokens` to return the *longest* n-gram continuation across all candidates instead of early-returning on the first match. ### Results Measured on Llama-3.1-8B-Instruct (mxfp6/mxint8, 4 SOCs), MT-bench, 80 prompts. `num_speculative_tokens=[0, 4]` (K=4) vs the prior fixed-K baseline. **Request throughput vs nospec (req/s):** <img width="1052" height="714" alt="image" src="https://github.com/user-attachments/assets/79a8ed1a-76f5-45d4-9655-9ab57f9f25c7" /> | max new seqs | nospec | ngram K=4 (fixed) | ngram varK | suffix K=4 (fixed) | suffix varK | |---|---|---|---|---|---| | 1 | 0.137 | 0.108 (−21%) | **0.282 (+106%)** | 0.131 (−4%) | **0.297 (+116%)** | | 2 | 0.257 | 0.197 (−23%) | **0.467 (+82%)** | 0.237 (−8%) | **0.495 (+93%)** | | 4 | 0.448 | 0.340 (−24%) | **0.640 (+43%)** | 0.405 (−9%) | **0.758 (+69%)** | | 8 | 0.674 | 0.580 (−14%) | **1.224 (+82%)** | 0.692 (+3%) | **1.142 (+69%)** | **varK vs fixed-K improvement:** | max new seqs | ngram Δ | suffix Δ | |---|---|---| | 1 | +162% | +126% | | 2 | +137% | +109% | | 4 | +88% | +87% | | 8 | +111% | +65% | Key takeaways: - Fixed-K SpD is 14–24% *slower* than nospec on QAIC (each step runs the full K+1 kernel even when no tokens are proposed). - Variable-K reverses this regression: **+43–116% throughput vs nospec** across both methods. - TPOT at mns=1 drops from ~34 ms/token (nospec) to ~16 ms/token with varK (~53% reduction). ### Test plan - [x] `pytest tests/unit_test/models/test_modeling_auto_cpu.py::TestTLMMultiSpecSpecializations -v` - [x] `pytest tests/unit_test/transforms/test_speculative_decoding.py::TestTLMForwardExecution::test_tlm_multi_spec_logit_consistency -v` - [x] `pytest tests/transformers/spd/test_pld_inference.py::test_multi_spec_structure tests/transformers/spd/test_pld_inference.py::test_select_k -v` - [x] Hardware: `pytest tests/transformers/spd/test_spd_inference.py -m on_qaic -k "pld"` (on QAIC device) ### Notes - TLM + CCL (`comp_ctx_lengths_decode`) combination raises `NotImplementedError` — not yet supported. - `speculative_config` in model config overrides user-supplied Ks; a `logger.warning` is emitted when values are discarded. --------- Signed-off-by: eplatero <eplatero@qti.qualcomm.com>

Added fix for fp16 export in qwen3 and qwen3vl modeling files. --------- Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com> Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>

This PR adds support for the following Qwen3-VL reranker models on AI100: - `Qwen/Qwen3-VL-Reranker-2B` - `Qwen/Qwen3-VL-Reranker-8B` The support is implemented using the existing QEff image-text-to-text flow (dual QPC), with model parity validation focused on **PyTorch(original) vs AI100**. ### Results | Model | PyTorch score | AI100 score | MAD max | Status | |---|---:|---:|---:|---| | Qwen/Qwen3-VL-Reranker-2B | 0.3213230073 | 0.3259495199 | 4.626513e-03 | Pass | | Qwen/Qwen3-VL-Reranker-8B | 0.6058825254 | 0.6043989062 | 1.483619e-03 | Pass | --------- Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

…ll (#935) Adds NSP-parallel expert-blocked dispatch to the chunked prefill MoE path for Qwen3MOE and GPT-OSS, replacing the sequential per-expert loop with a batched packed-prefix approach. ``` Configuration: export EXPERT_BLOCKING_NUM_NSP=16 # default: 1 NSP per expert (best perf at T=256) export EXPERT_BLOCKING_NUM_NSP=8 # 2 NSPs per expert export EXPERT_BLOCKING_NUM_NSP=2 # for testing ``` Falls back to the original per-expert loop if `num_experts % EXPERT_BLOCKING_NUM_NSP !=0`. `EXPERT_BLOCKING_NUM_NSP=2 pytest tests/transformers/models/test_moe_prefill_blocked.py -v` Update (0429): `export EXPERT_BLOCKING_PACKED_CHUNK_SIZE=256` for chunk PL of 512 **Update (0525):** Configuration is now compile-API driven: 1. num_cores controls NSP parallelism. 2. moe_prefill_packed_chunk_size controls packed chunk size. 3. No EXPERT_BLOCKING_NUM_NSP / EXPERT_BLOCKING_PACKED_CHUNK_SIZE env vars are required. **Example:** ``` qeff_model.compile( prefill_seq_len=512, ctx_len=..., num_cores=16, prefill_only=True, enable_chunking=True, moe_prefill_packed_chunk_size=256, ... ) ``` **Notes:** - Qwen3-MoE and GPT-OSS disaggregated serving examples are updated to use PL=512 and packed chunk size=256. - The optimized path requires num_experts % num_cores == 0. - Qwen3-MoE and GPT-OSS now use the same packed-chunk flow as the standalone benchmark. - torch.clamp is retained for bench alignment, with tensor bounds to avoid QAIC Clip dtype issues. - Subfunction-specific ReduceSum/Einsum cleanup is deferred and will be handled separately. **Validation:** ``` pytest -q \ tests/transformers/models/test_moe_prefill_blocked.py::test_qwen3moe_blocked_forward_parity \ tests/transformers/models/test_moe_prefill_blocked.py::test_qwen3moe_prefill_chunked_subfunction_export_contains_cumsum_custom_ops \ tests/transformers/models/test_moe_prefill_blocked.py::test_gptoss_blocked_forward_parity \ tests/transformers/models/test_moe_prefill_blocked.py::test_gptoss_prefill_chunked_export_traces_packed_chunks ``` Also verified tiny non-subfunction QAIC compile for Qwen3-MoE and GPT-OSS with: - prefill_seq_len=512 - moe_prefill_packed_chunk_size=256 - num_cores=2 --------- Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Adds support for two Qwen 3.5 and 3.6 VLM model architectures --------- Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com> Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>

Adds end-to-end support for Gemma 4, covering both the text-only (ForCausalLM) and multimodal (ForConditionalGeneration) variants, including dense and MoE configurations, optimized chunked prefill, and disaggregated (vision/language split) serving. --------- Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com> Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com> Co-authored-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>

Adds embedding-model support for Qwen/Qwen3-VL-Embedding-8B. [MAD] CPU vs AI100 mean=1.585330e-05, max=3.049895e-04 --------- Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Depricate the support for meta-llama/Llama-3.2-11B-Vision model from Efficient Transformers --------- Signed-off-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>

…1027) This PR fixes stale ONNX reuse in the shared compile path. Problem: - `_compile()` could reuse `self.onnx_path` from a previous compile. - In disaggregated serving, decode compile and prefill/chunked prefill compile need different ONNX graphs because the exported model forward/transforms differ. - Reusing decode ONNX for prefill could produce multiple QPCs from the same graph, which is incorrect for MoE prefill optimization. - For diffusion pipelines, compile can run after export, and some modules such as `QEffVAE` do not accept generic export kwargs like `offload_pt_weights`, causing CI failures when compile tried to re-export. Fix: - In `QEFFBaseModel._compile()`, reuse an existing ONNX only when weights are already offloaded/meta, since re-export is not possible in that state. - Otherwise, export for the current compile mode so decode and prefill/chunked prefill can produce distinct ONNX graphs. - In diffusion pipeline modules, pass `onnx_path=self.onnx_path` explicitly into `_compile()` so those modules compile the already-exported graph and avoid unintended re-export. Validation: - Qwen3-MoE disaggregated decode + chunked prefill ONNX regression passed. - VLM subfunction regression passed. --------- Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

@anujgupt-github

…es (#1030) ## Description Expands quickcheck coverage for enable_proxy across causal LM, embedding, sequence classification, Whisper, CTC, and dual-QPC VLM paths. Confirms default enable_proxy=False behavior remains unchanged and validates with the full quickcheck suite. cc: @anujgupt-github @quic-rishinr Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Added support for Qwen 3.5, Qwen 3.6 and Gemma 4 model on nightly list --------- Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>

…1023) Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com> Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com> Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com> Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com>

This PR fixes the Qwen3-VL-MoE accuracy issue, compiler issue wrt subfunction ### Main Fix: Deepstack Accuracy Fix The deepstack visual merge previously used: ``` hidden_states = hidden_states.clone() mixed_embeds = hidden_states + visual_embeds local_this = torch.where(visual_pos_masks, mixed_embeds, hidden_states) return local_this ``` This has been changed to: ``` visual_mask = visual_pos_masks.to(hidden_states.dtype) return hidden_states + (visual_embeds * visual_mask) ``` This is mathematically equivalent for boolean visual position masks, but avoids the torch.where merge pattern. The previous torch.where form compiled but produced poor accuracy. The mask-multiply form resolves the observed accuracy issue. ### Summary for changes related to transformer upgrade v5.5.4 - Avoids calling the current HF **self.gate(x)** implementation inside QEfficient MoE blocks, because the newer router internally performs top-k normalization using a reduction with **.sum()** that can export as unsupported **ReduceSum under -sub-functions**. - Recreates the older raw-router-logits flow explicitly with F.linear, then applies TopK and softmax over selected top-k logits. - Updates decode sparse MoE expert weight gathering to use pre-transposed expert weights before dynamic index_select, avoiding the problematic Gather -> Transpose pattern on dynamically selected expert tensors. **Key Changes** - Router path changed from: ``` gate_out = self.gate(x) router_logits, top_w, top_i = gate_out ``` to: ``` router_logits = F.linear(x.reshape(-1, self.gate.hidden_dim), self.gate.weight) top_w, top_i = torch.topk(router_logits, self.gate.top_k, dim=-1) top_w = F.softmax(top_w, dim=-1, dtype=torch.float) ``` - Decode expert gather changed from: ``` w_up = self.experts.gate_up_proj.index_select(0, idx) w_up = w_up.transpose(1, 2) ``` to: ` w_up = self.experts.gate_up_proj.transpose(1, 2).index_select(0, idx) ` Why - Newer transformers Qwen3-VL-MoE router returns (router_logits, router_scores, router_indices) and performs normalization internally. - That normalization introduced reduction ops in the exported ONNX, which caused QAIC compile failures (segfaults). --------- Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>

In latest TF upgrade, past_seen_tokens were calculated every time causing performance to drop. Hence, reverting this change. Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added bug fix for layerwise export --------- Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com> Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com> Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com> Co-authored-by: vtirumal <vtirumal@qti.qualcomm.com>

vbaddi and others added 18 commits May 25, 2026 22:07

Send device_ids as int to qaicrt.Program when len(device_ids) == 1 (#…

f823cde

…1013) Without this fix, the QPCs are loaded on QID 0 by default. Signed-off-by: sanising <sanising@qti.qualcomm.com>

Enabled Qwen3-VL embedding model (#923)

f776e80

Adds embedding-model support for Qwen/Qwen3-VL-Embedding-8B. [MAD] CPU vs AI100 mean=1.585330e-05, max=3.049895e-04 --------- Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Depricate the mllama 3.2 model (llama3.2 vision) (#1018)

6ccce3b

Depricate the support for meta-llama/Llama-3.2-11B-Vision model from Efficient Transformers --------- Signed-off-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>

[Nightly]: New Models are added in the Nightly List (#1034)

065543d

Added support for Qwen 3.5, Qwen 3.6 and Gemma 4 model on nightly list --------- Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>

quic-akuruvil force-pushed the release/v1.22.0_tmp branch from 2c3d69c to 6b8db51 Compare June 4, 2026 16:40

tv-karthikeya and others added 3 commits June 5, 2026 10:09

Reverting past_seen_token calculation to based on cache_position (#1032)

6c72846

In latest TF upgrade, past_seen_tokens were calculated every time causing performance to drop. Hence, reverting this change. Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release/v1.22.0 tmp#1029

Release/v1.22.0 tmp#1029
quic-rishinr wants to merge 21 commits into
mainfrom
release/v1.22.0_tmp

quic-rishinr commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

quic-rishinr commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants