Reranker & Embedding: single-QPC support with KV cache eliminated by quic-amitraj · Pull Request #1045 · quic/efficient-transformers

quic-amitraj · 2026-06-05T10:03:09Z

Summary

Adds single-QPC (kv_offload=False) support for Qwen3-VL reranker and embedding models, with the KV cache fully removed from the compiled binary. Both models run a single prefill pass — no decode loop — so KV state is unnecessary overhead.

Key change: Session inputs drop from 60+ to 4 (image_idx, input_ids, pixel_values, position_ids).

Changes

Bug fixes (blocked single-QPC export)

QEffQwen3VLForConditionalGeneration.forward: self.language_model → self.model.language_model; indices0 batch-dim fix; past_key_values=None default
QEffQwen3VLTextModel.forward: honour use_cache parameter; fix target_length=0 when no KV cache
attention_blocking.py: initialize cache_kwargs = {} before the if past_key_value is not None: block

KV cache removal (`_is_single_shot_mode` gate)

Triggered by qaic_config={"no_kv_cache": True} (reranker) or qaic_config={"export_embedding": True} (embedding):

Strips past_key.* from ONNX inputs, output names, dynamic axes, and specializations
Compiles with retained_state=False and a single Prefill specialization
Sets custom_io["pixel_values"] explicitly (no retained-state to derive it from)

Applies to both reranker_model.py and _embedding_utils.py.

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Signed-off-by: Amit <amitraj@qti.qualcomm.com> Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Mirror of the reranker fix: Qwen3-VL embedding is single-shot prefill (reads last-token hidden state as embedding vector, no decode loop). `get_compile_specs` now returns ctx_len == prefill_seq_len, triggering Solution A in modeling_auto.py to compile only the Prefill kernel. Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

… simplify config path Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Three bugs in QEffQwen3VLForConditionalGeneration.forward (single-QPC path): 1. self.language_model -> self.model.language_model (attribute error) 2. indices0: wrong batch dim via unsqueeze(0) -> use selected.shape[0] with device 3. get_onnx_dynamic_axes: remove deepstack_features from single-QPC axes (computed internally by vision encoder, not a direct ONNX input) get_specializations kv_offload=False: add grid_height/width/h/w/time to lang specs so qaic-compile can resolve pixel_values dynamic symbols. modeling_auto.py single-QPC compile: apply Solution A (prefill-only spec) and compile without -retained-state for single-shot models to avoid pixel_values / pixel_values_RetainedState shape mismatch. reranker_model.py: add _run_ai100_single_qpc_prefill and update process() to dispatch on isinstance(qpc_paths, dict) for dual vs single QPC. Unit tests: add three tests covering dual/single QPC dispatch. Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Four bugs in QEffQwen3VLForConditionalGeneration (single-QPC path): 1. self.language_model -> self.model.language_model 2. indices0 wrong batch dim (unsqueeze) -> selected.shape[0] + device 3. get_onnx_dynamic_axes: drop deepstack_features from single-QPC axes 4. get_specializations: add grid_height/width/h/w/time for pixel_values modeling_auto.py single-QPC compile: - Solution A (prefill-only spec when prefill_seq_len == ctx_len) - retained_state=False for single-shot to avoid pixel_values shape mismatch reranker_model.py: - _run_ai100_single_qpc_prefill: runs fused session with explicit zero KV buffers (retained_state=False requires host-managed buffers) - process(): dispatch on isinstance(qpc_paths, dict) dual vs single QPC Unit tests: three new tests covering dual/single QPC dispatch Note: KV cache removal from single-shot ONNX is a future optimization requiring KVCacheTransform changes (tracked as TODO in code). Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

quic-amitraj added 14 commits June 4, 2026 21:28

Enabling support of rerankers models 2B and 8B of qwen3vl bucket

08bb022

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Functionality changes to PR and rebase with main branch

711fd81

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Addressed comments and fix CI issue

612ed3e

Signed-off-by: Amit <amitraj@qti.qualcomm.com> Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Updated installation of qwen-vl-utils

c4334c1

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Addressed comments

eee7098

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Rebased and addressed comments

7d1e2f4

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Intial fix

15d0ff1

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Update the exmple script and modelling files

28dc773

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Address review comments: use ONNX_EXPORT_EXAMPLE_SEQ_LEN constant and…

5fbc0d8

… simplify config path Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Removed past-key values from onnx and qpcs input output

3b433f5

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Eabled embedding as well

52be851

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

quic-amitraj changed the title ~~Reranker single qpc~~ Reranker & Embedding: single-QPC support with KV cache eliminated Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reranker & Embedding: single-QPC support with KV cache eliminated#1045

Reranker & Embedding: single-QPC support with KV cache eliminated#1045
quic-amitraj wants to merge 14 commits into
quic:release/v1.22.0_tmpfrom
quic-amitraj:reranker_single_qpc

quic-amitraj commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

quic-amitraj commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Bug fixes (blocked single-QPC export)

KV cache removal (_is_single_shot_mode gate)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

quic-amitraj commented Jun 5, 2026 •

edited

Loading

KV cache removal (`_is_single_shot_mode` gate)