Skip to content

Release/v1.22.0 tmp#1029

Open
quic-rishinr wants to merge 21 commits into
mainfrom
release/v1.22.0_tmp
Open

Release/v1.22.0 tmp#1029
quic-rishinr wants to merge 21 commits into
mainfrom
release/v1.22.0_tmp

Conversation

@quic-rishinr
Copy link
Copy Markdown
Contributor

Collection of PRs which include TF 5.5.4, Qwen 3.5, 3.6 Gemma 4 etc.

vbaddi and others added 18 commits May 25, 2026 22:07
…changes to v5.5.4 and restore PyTorch/ORT parity (#876)

- Rebased downstream wrapper stack to transformers v5.3.0 and aligned
coupled deps (huggingface-hub, peft, diffusers) in project config.
- Updated model wrapper compatibility paths across
causal/VLM/audio/export flows to match upstream v5 APIs while preserving
downstream public behavior.
- Hardened cache compatibility layer and runtime glue for mixed
legacy/new cache semantics used by downstream generation/export paths.
- Fixed attention/mask/rotary call-path mismatches introduced by
upstream API changes (including model-specific signature updates).
- Updated AWQ/quantizer and export compatibility paths to remain
ONNX-safe.
- Validation evidence:
```
python -m pytest -q tests/test_model_quickcheck.py -n 16
Result: 26 passed.
```

- [x] QAic Verification Pending
- [x]  E2E CI read out

cc: @quic-rishinr @quic-hemagnih @asmigosw @anujgupt-github

---------

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Co-authored-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Co-authored-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>
…o_empty() (#952)

fix: improve weight offloading to handle plain tensor attrs and use
to_empty()

Replace manual storage resizing with `to_empty(device="meta")` for
parameters/buffers and explicitly handle plain tensor attributes (e.g.
stacked expert weights in MoE models) that are not registered as
parameters or buffers. This ensures all tensors are properly moved to
the meta device, reducing memory usage after ONNX export.

Add unit tests for plain tensor attribute clearing

---------

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
After inspecting the attributes of QAicProgramProperties:
SDK 1.22.0.120 -> ['dataPathTimeoutMs', 'devMapping', 'selectMask',
'submitNumRetries']
SDK 1.22.0.119 ->['SubmitNumRetries', 'SubmitRetryTimeoutMs',
'dataPathTimeoutMs', 'devMapping', 'selectMask', 'submitNumRetries']
SubmitRetryTimeoutMs is missing in 1.22.0.120, but present in
1.22.0.119.
Root Cause:
The attribute removal is introduced in LRT changes in 1.22.0.120 sdk.

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
…1003)

This PR is for changing the precision of the CCL input from int8 to
int64 to be aligned with the compiler for release 1.22. This change is
for addressing the JIRA tickets raised when working with CCL enabled in
new sdk versions.

Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com>
Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
…1013)

Without this fix, the QPCs are loaded on QID 0 by default.

Signed-off-by: sanising <sanising@qti.qualcomm.com>
## Summary

  Adds GLM4-MOE support for disaggregated serving with chunked prefill.

  ## Supported

  - GLM4-MOE decode path
  - Chunked prefill MoE path with packed expert dispatch
  - KV-blocked attention path
  - Disaggregated prefill/decode serving example
  - ONNX subfunction export for decode and prefill

  ## Tested

  - Added GLM4-MOE prefill/blocked export tests
- Verified packed MoE custom-op counts for `prefill_seq_len=512`, packed
chunk size `256`
  - Ran GLM4-MOE disaggregated example end-to-end w/tiny config.

 ```bash
  pytest -q tests/transformers/models/test_moe_prefill_blocked.py
  python examples/disagg_serving/glm4_moe_disagg_mode_with_chunking.py
```

---------

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…s list (#984)

## Description

Speculative decoding on QAIC requires statically compiled shapes.
Previously, a TLM
could only be compiled for a single proposal length K, forcing every
decode step to
run the full `seq_len=K+1` kernel regardless of how many tokens the
draft model
actually proposed. This PR allows compiling multiple decode
specializations in one
QPC so the runtime can dispatch to the smallest kernel that covers the
actual proposal
count, reducing unnecessary compute on short proposals without changing
correctness.

### Summary

- `QEFFAutoModelForCausalLM.compile()` now accepts
`num_speculative_tokens` as a
`List[int]`. Each value K compiles one TLM decode specialization
(`seq_len=K+1`,
`num_logits_to_keep=K+1`), enabling per-step dispatch to the cheapest
kernel that
  covers the actual proposal count.
- Plain `int` input still works (backward compatible — treated as
`[K]`).
- Removes `enable_fallback_decode_spec`; equivalent behavior is `[0,
K]`.
- Fixes flat-format `specializations.json` write in `_compile` (named
format caused
  `RuntimeError: Failed to create ExecObj` on 4-device MDP QPCs).
- Improves `find_candidate_pred_tokens` to return the *longest* n-gram
continuation
  across all candidates instead of early-returning on the first match.

### Results

Measured on Llama-3.1-8B-Instruct (mxfp6/mxint8, 4 SOCs), MT-bench, 80
prompts.
`num_speculative_tokens=[0, 4]` (K=4) vs the prior fixed-K baseline.

**Request throughput vs nospec (req/s):**

<img width="1052" height="714" alt="image"
src="https://github.com/user-attachments/assets/79a8ed1a-76f5-45d4-9655-9ab57f9f25c7"
/>


| max new seqs | nospec | ngram K=4 (fixed) | ngram varK | suffix K=4
(fixed) | suffix varK |
|---|---|---|---|---|---|
| 1 | 0.137 | 0.108 (−21%) | **0.282 (+106%)** | 0.131 (−4%) | **0.297
(+116%)** |
| 2 | 0.257 | 0.197 (−23%) | **0.467 (+82%)** | 0.237 (−8%) | **0.495
(+93%)** |
| 4 | 0.448 | 0.340 (−24%) | **0.640 (+43%)** | 0.405 (−9%) | **0.758
(+69%)** |
| 8 | 0.674 | 0.580 (−14%) | **1.224 (+82%)** | 0.692 (+3%) | **1.142
(+69%)** |

**varK vs fixed-K improvement:**

| max new seqs | ngram Δ  | suffix Δ |
|---|---|---|
| 1 | +162% | +126% |
| 2 | +137% | +109% |
| 4 | +88%  | +87%  |
| 8 | +111% | +65%  |

Key takeaways:
- Fixed-K SpD is 14–24% *slower* than nospec on QAIC (each step runs the
full K+1 kernel even when no tokens are proposed).
- Variable-K reverses this regression: **+43–116% throughput vs nospec**
across both methods.
- TPOT at mns=1 drops from ~34 ms/token (nospec) to ~16 ms/token with
varK (~53% reduction).

### Test plan

- [x] `pytest
tests/unit_test/models/test_modeling_auto_cpu.py::TestTLMMultiSpecSpecializations
-v`
- [x] `pytest
tests/unit_test/transforms/test_speculative_decoding.py::TestTLMForwardExecution::test_tlm_multi_spec_logit_consistency
-v`
- [x] `pytest
tests/transformers/spd/test_pld_inference.py::test_multi_spec_structure
tests/transformers/spd/test_pld_inference.py::test_select_k -v`
- [x] Hardware: `pytest tests/transformers/spd/test_spd_inference.py -m
on_qaic -k "pld"` (on QAIC device)

### Notes

- TLM + CCL (`comp_ctx_lengths_decode`) combination raises
`NotImplementedError` — not yet supported.
- `speculative_config` in model config overrides user-supplied Ks; a
`logger.warning` is emitted when values are discarded.

---------

Signed-off-by: eplatero <eplatero@qti.qualcomm.com>
Added fix for fp16 export in qwen3 and qwen3vl modeling files.

---------

Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>
This PR adds support for the following Qwen3-VL reranker models on
AI100:

  - `Qwen/Qwen3-VL-Reranker-2B`
  - `Qwen/Qwen3-VL-Reranker-8B`

The support is implemented using the existing QEff image-text-to-text
flow (dual QPC), with model parity validation focused on
**PyTorch(original) vs AI100**.
  
  ### Results

  | Model | PyTorch score | AI100 score | MAD max | Status |
  |---|---:|---:|---:|---|
| Qwen/Qwen3-VL-Reranker-2B | 0.3213230073 | 0.3259495199 | 4.626513e-03
| Pass |
| Qwen/Qwen3-VL-Reranker-8B | 0.6058825254 | 0.6043989062 | 1.483619e-03
| Pass |

---------

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
…ll (#935)

Adds NSP-parallel expert-blocked dispatch to the chunked prefill MoE
path for Qwen3MOE and GPT-OSS, replacing the sequential per-expert loop
with a batched packed-prefix approach.

```
Configuration:
  export EXPERT_BLOCKING_NUM_NSP=16   # default: 1 NSP per expert (best perf at T=256)
  export EXPERT_BLOCKING_NUM_NSP=8    # 2 NSPs per expert
  export EXPERT_BLOCKING_NUM_NSP=2    # for testing
```
Falls back to the original per-expert loop if `num_experts %
EXPERT_BLOCKING_NUM_NSP !=0`.
`EXPERT_BLOCKING_NUM_NSP=2 pytest
tests/transformers/models/test_moe_prefill_blocked.py -v`

Update (0429): 
`export EXPERT_BLOCKING_PACKED_CHUNK_SIZE=256` for chunk PL of 512

**Update (0525):**
Configuration is now compile-API driven:
1. num_cores controls NSP parallelism.
2. moe_prefill_packed_chunk_size controls packed chunk size.
3. No EXPERT_BLOCKING_NUM_NSP / EXPERT_BLOCKING_PACKED_CHUNK_SIZE env
vars are required.

**Example:**
```
  qeff_model.compile(
      prefill_seq_len=512,
      ctx_len=...,
      num_cores=16,
      prefill_only=True,
      enable_chunking=True,
      moe_prefill_packed_chunk_size=256,
      ...
  )
```

**Notes:**
- Qwen3-MoE and GPT-OSS disaggregated serving examples are updated to
use PL=512 and packed chunk size=256.
  - The optimized path requires num_experts % num_cores == 0.
- Qwen3-MoE and GPT-OSS now use the same packed-chunk flow as the
standalone benchmark.
- torch.clamp is retained for bench alignment, with tensor bounds to
avoid QAIC Clip dtype issues.
- Subfunction-specific ReduceSum/Einsum cleanup is deferred and will be
handled separately.

**Validation:**
  ```
pytest -q \

tests/transformers/models/test_moe_prefill_blocked.py::test_qwen3moe_blocked_forward_parity
\

tests/transformers/models/test_moe_prefill_blocked.py::test_qwen3moe_prefill_chunked_subfunction_export_contains_cumsum_custom_ops
\

tests/transformers/models/test_moe_prefill_blocked.py::test_gptoss_blocked_forward_parity
\

tests/transformers/models/test_moe_prefill_blocked.py::test_gptoss_prefill_chunked_export_traces_packed_chunks
```

  Also verified tiny non-subfunction QAIC compile for Qwen3-MoE and GPT-OSS with:
  - prefill_seq_len=512
  - moe_prefill_packed_chunk_size=256
  - num_cores=2

---------

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Adds support for two Qwen 3.5 and 3.6 VLM model architectures

---------

Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com>
Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Adds end-to-end support for Gemma 4, covering both the text-only
(ForCausalLM) and
multimodal (ForConditionalGeneration) variants, including dense and MoE
configurations, optimized chunked
  prefill, and disaggregated (vision/language split) serving.

---------

Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Co-authored-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>
Adds embedding-model support for Qwen/Qwen3-VL-Embedding-8B.
  [MAD] CPU vs AI100 mean=1.585330e-05, max=3.049895e-04

---------

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Depricate the support for meta-llama/Llama-3.2-11B-Vision model from
Efficient Transformers

---------

Signed-off-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>
…1027)

This PR fixes stale ONNX reuse in the shared compile path.

  Problem:
  - `_compile()` could reuse `self.onnx_path` from a previous compile.
- In disaggregated serving, decode compile and prefill/chunked prefill
compile need different ONNX graphs because the exported model
forward/transforms differ.
- Reusing decode ONNX for prefill could produce multiple QPCs from the
same graph, which is incorrect for MoE prefill optimization.
- For diffusion pipelines, compile can run after export, and some
modules such as `QEffVAE` do not accept generic export kwargs like
`offload_pt_weights`, causing CI failures when compile tried to
re-export.

  Fix:
- In `QEFFBaseModel._compile()`, reuse an existing ONNX only when
weights are already offloaded/meta, since re-export is not possible in
that state.
- Otherwise, export for the current compile mode so decode and
prefill/chunked prefill can produce distinct ONNX graphs.
- In diffusion pipeline modules, pass `onnx_path=self.onnx_path`
explicitly into `_compile()` so those modules compile the
already-exported graph and avoid unintended re-export.

  Validation:
- Qwen3-MoE disaggregated decode + chunked prefill ONNX regression
passed.
  - VLM subfunction regression passed.

---------

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…es (#1030)

## Description

Expands quickcheck coverage for enable_proxy across causal LM,
embedding, sequence classification, Whisper, CTC, and dual-QPC VLM
paths. Confirms default enable_proxy=False behavior remains unchanged
and validates with the full quickcheck suite.

cc: @anujgupt-github @quic-rishinr

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Added support for Qwen 3.5, Qwen 3.6 and Gemma 4 model on nightly list

---------

Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
…1023)

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com>
Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com>
Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com>
@quic-akuruvil quic-akuruvil force-pushed the release/v1.22.0_tmp branch from 2c3d69c to 6b8db51 Compare June 4, 2026 16:40
tv-karthikeya and others added 3 commits June 5, 2026 10:09
This PR fixes the Qwen3-VL-MoE accuracy issue, compiler issue wrt
subfunction
### Main Fix:  Deepstack Accuracy Fix
  The deepstack visual merge previously used:

```
  hidden_states = hidden_states.clone()
  mixed_embeds = hidden_states + visual_embeds

  local_this = torch.where(visual_pos_masks, mixed_embeds, hidden_states)

  return local_this
```

This has been changed to:

```
  visual_mask = visual_pos_masks.to(hidden_states.dtype)
  return hidden_states + (visual_embeds * visual_mask)
```

This is mathematically equivalent for boolean visual position masks, but
avoids the torch.where merge pattern. The previous torch.where form
compiled but produced poor accuracy. The mask-multiply form resolves the
observed accuracy
  issue.


  ### Summary for changes related to transformer upgrade v5.5.4
- Avoids calling the current HF **self.gate(x)** implementation inside
QEfficient MoE blocks, because the newer router internally performs
top-k normalization using a reduction with **.sum()** that can export as
unsupported **ReduceSum under -sub-functions**.
- Recreates the older raw-router-logits flow explicitly with F.linear,
then applies TopK and softmax over selected top-k logits.
- Updates decode sparse MoE expert weight gathering to use
pre-transposed expert weights before dynamic index_select, avoiding the
problematic Gather -> Transpose pattern on dynamically selected expert
tensors.

**Key Changes**
  - Router path changed from:

```
    gate_out = self.gate(x)
    router_logits, top_w, top_i = gate_out
```

 to:

```
    router_logits = F.linear(x.reshape(-1, self.gate.hidden_dim), self.gate.weight)
    top_w, top_i = torch.topk(router_logits, self.gate.top_k, dim=-1)
    top_w = F.softmax(top_w, dim=-1, dtype=torch.float)
```

  - Decode expert gather changed from:

```
    w_up = self.experts.gate_up_proj.index_select(0, idx)
    w_up = w_up.transpose(1, 2)
```

to:

` w_up = self.experts.gate_up_proj.transpose(1, 2).index_select(0, idx)
`  

Why
- Newer transformers Qwen3-VL-MoE router returns (router_logits,
router_scores, router_indices) and performs normalization internally.
- That normalization introduced reduction ops in the exported ONNX,
which caused QAIC compile failures (segfaults).

---------

Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>
In latest TF upgrade, past_seen_tokens were calculated every time
causing performance to drop. Hence, reverting this change.

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Added bug fix for layerwise export

---------

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com>
Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>
Co-authored-by: vtirumal <vtirumal@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.