Skip to content

Reranker & Embedding: single-QPC support with KV cache eliminated#1045

Draft
quic-amitraj wants to merge 14 commits into
quic:release/v1.22.0_tmpfrom
quic-amitraj:reranker_single_qpc
Draft

Reranker & Embedding: single-QPC support with KV cache eliminated#1045
quic-amitraj wants to merge 14 commits into
quic:release/v1.22.0_tmpfrom
quic-amitraj:reranker_single_qpc

Conversation

@quic-amitraj
Copy link
Copy Markdown
Contributor

@quic-amitraj quic-amitraj commented Jun 5, 2026

Summary

Adds single-QPC (kv_offload=False) support for Qwen3-VL reranker and embedding models, with the KV cache fully removed from the compiled binary. Both models run a single prefill pass — no decode loop — so KV state is unnecessary overhead.

Key change: Session inputs drop from 60+ to 4 (image_idx, input_ids, pixel_values, position_ids).


Changes

Bug fixes (blocked single-QPC export)

  • QEffQwen3VLForConditionalGeneration.forward: self.language_modelself.model.language_model; indices0 batch-dim fix; past_key_values=None default
  • QEffQwen3VLTextModel.forward: honour use_cache parameter; fix target_length=0 when no KV cache
  • attention_blocking.py: initialize cache_kwargs = {} before the if past_key_value is not None: block

KV cache removal (_is_single_shot_mode gate)

Triggered by qaic_config={"no_kv_cache": True} (reranker) or qaic_config={"export_embedding": True} (embedding):

  • Strips past_key.* from ONNX inputs, output names, dynamic axes, and specializations
  • Compiles with retained_state=False and a single Prefill specialization
  • Sets custom_io["pixel_values"] explicitly (no retained-state to derive it from)

Applies to both reranker_model.py and _embedding_utils.py.


Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Mirror of the reranker fix: Qwen3-VL embedding is single-shot prefill
(reads last-token hidden state as embedding vector, no decode loop).
`get_compile_specs` now returns ctx_len == prefill_seq_len, triggering
Solution A in modeling_auto.py to compile only the Prefill kernel.

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
… simplify config path

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Three bugs in QEffQwen3VLForConditionalGeneration.forward (single-QPC path):
1. self.language_model -> self.model.language_model (attribute error)
2. indices0: wrong batch dim via unsqueeze(0) -> use selected.shape[0] with device
3. get_onnx_dynamic_axes: remove deepstack_features from single-QPC axes
   (computed internally by vision encoder, not a direct ONNX input)

get_specializations kv_offload=False: add grid_height/width/h/w/time to
lang specs so qaic-compile can resolve pixel_values dynamic symbols.

modeling_auto.py single-QPC compile: apply Solution A (prefill-only spec)
and compile without -retained-state for single-shot models to avoid
pixel_values / pixel_values_RetainedState shape mismatch.

reranker_model.py: add _run_ai100_single_qpc_prefill and update process()
to dispatch on isinstance(qpc_paths, dict) for dual vs single QPC.

Unit tests: add three tests covering dual/single QPC dispatch.

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Four bugs in QEffQwen3VLForConditionalGeneration (single-QPC path):
1. self.language_model -> self.model.language_model
2. indices0 wrong batch dim (unsqueeze) -> selected.shape[0] + device
3. get_onnx_dynamic_axes: drop deepstack_features from single-QPC axes
4. get_specializations: add grid_height/width/h/w/time for pixel_values

modeling_auto.py single-QPC compile:
- Solution A (prefill-only spec when prefill_seq_len == ctx_len)
- retained_state=False for single-shot to avoid pixel_values shape mismatch

reranker_model.py:
- _run_ai100_single_qpc_prefill: runs fused session with explicit zero KV
  buffers (retained_state=False requires host-managed buffers)
- process(): dispatch on isinstance(qpc_paths, dict) dual vs single QPC

Unit tests: three new tests covering dual/single QPC dispatch

Note: KV cache removal from single-shot ONNX is a future optimization
requiring KVCacheTransform changes (tracked as TODO in code).

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
@quic-amitraj quic-amitraj changed the title Reranker single qpc Reranker & Embedding: single-QPC support with KV cache eliminated Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant