Reranker & Embedding: single-QPC support with KV cache eliminated#1045
Draft
quic-amitraj wants to merge 14 commits into
Draft
Reranker & Embedding: single-QPC support with KV cache eliminated#1045quic-amitraj wants to merge 14 commits into
quic-amitraj wants to merge 14 commits into
Conversation
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit <amitraj@qti.qualcomm.com> Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Mirror of the reranker fix: Qwen3-VL embedding is single-shot prefill (reads last-token hidden state as embedding vector, no decode loop). `get_compile_specs` now returns ctx_len == prefill_seq_len, triggering Solution A in modeling_auto.py to compile only the Prefill kernel. Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
… simplify config path Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Three bugs in QEffQwen3VLForConditionalGeneration.forward (single-QPC path): 1. self.language_model -> self.model.language_model (attribute error) 2. indices0: wrong batch dim via unsqueeze(0) -> use selected.shape[0] with device 3. get_onnx_dynamic_axes: remove deepstack_features from single-QPC axes (computed internally by vision encoder, not a direct ONNX input) get_specializations kv_offload=False: add grid_height/width/h/w/time to lang specs so qaic-compile can resolve pixel_values dynamic symbols. modeling_auto.py single-QPC compile: apply Solution A (prefill-only spec) and compile without -retained-state for single-shot models to avoid pixel_values / pixel_values_RetainedState shape mismatch. reranker_model.py: add _run_ai100_single_qpc_prefill and update process() to dispatch on isinstance(qpc_paths, dict) for dual vs single QPC. Unit tests: add three tests covering dual/single QPC dispatch. Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Four bugs in QEffQwen3VLForConditionalGeneration (single-QPC path): 1. self.language_model -> self.model.language_model 2. indices0 wrong batch dim (unsqueeze) -> selected.shape[0] + device 3. get_onnx_dynamic_axes: drop deepstack_features from single-QPC axes 4. get_specializations: add grid_height/width/h/w/time for pixel_values modeling_auto.py single-QPC compile: - Solution A (prefill-only spec when prefill_seq_len == ctx_len) - retained_state=False for single-shot to avoid pixel_values shape mismatch reranker_model.py: - _run_ai100_single_qpc_prefill: runs fused session with explicit zero KV buffers (retained_state=False requires host-managed buffers) - process(): dispatch on isinstance(qpc_paths, dict) dual vs single QPC Unit tests: three new tests covering dual/single QPC dispatch Note: KV cache removal from single-shot ONNX is a future optimization requiring KVCacheTransform changes (tracked as TODO in code). Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds single-QPC (
kv_offload=False) support for Qwen3-VL reranker and embedding models, with the KV cache fully removed from the compiled binary. Both models run a single prefill pass — no decode loop — so KV state is unnecessary overhead.Key change: Session inputs drop from 60+ to 4 (
image_idx,input_ids,pixel_values,position_ids).Changes
Bug fixes (blocked single-QPC export)
QEffQwen3VLForConditionalGeneration.forward:self.language_model→self.model.language_model;indices0batch-dim fix;past_key_values=NonedefaultQEffQwen3VLTextModel.forward: honouruse_cacheparameter; fixtarget_length=0when no KV cacheattention_blocking.py: initializecache_kwargs = {}before theif past_key_value is not None:blockKV cache removal (
_is_single_shot_modegate)Triggered by
qaic_config={"no_kv_cache": True}(reranker) orqaic_config={"export_embedding": True}(embedding):past_key.*from ONNX inputs, output names, dynamic axes, and specializationsretained_state=Falseand a single Prefill specializationcustom_io["pixel_values"]explicitly (no retained-state to derive it from)Applies to both
reranker_model.pyand_embedding_utils.py.