Skip to content

Reranker & Embedding: Qwen3-VL single-shot inference with single-specialization compile#1031

Open
quic-amitraj wants to merge 11 commits into
quic:release/v1.22.0_tmpfrom
quic-amitraj:bugfix_2
Open

Reranker & Embedding: Qwen3-VL single-shot inference with single-specialization compile#1031
quic-amitraj wants to merge 11 commits into
quic:release/v1.22.0_tmpfrom
quic-amitraj:bugfix_2

Conversation

@quic-amitraj
Copy link
Copy Markdown
Contributor

@quic-amitraj quic-amitraj commented Jun 3, 2026

Summary

This PR adds end-to-end AI100 inference support for Qwen3-VL multimodal reranker and embedding models, and fixes the compile pipeline so both model types always produce exactly one QPC specialization (Prefill only — no wasted Decode kernel).

Supported models:

  • Qwen/Qwen3-VL-Reranker-2B and Qwen/Qwen3-VL-Reranker-8B
  • Qwen/Qwen3-VL-Embedding-8B

Test Results

Model Type MAD mean MAD max Threshold Status
Qwen3-VL-Reranker-2B Reranker 2.16e-03 4.03e-03 5e-03 ✅ Pass
Qwen3-VL-Embedding-8B Embedding 3.62e-05 1.62e-03 2e-03 ✅ Pass

@quic-amitraj quic-amitraj changed the title Rerankers refomating Reranker: Qwen3-VL reranker support with single-specialization compile Jun 4, 2026
@quic-amitraj quic-amitraj marked this pull request as ready for review June 4, 2026 14:37
Comment thread QEfficient/transformers/models/whisper/modeling_whisper.py Outdated
Comment thread tests/unit_test/models/reranker/test_reranker_models_unit.py Outdated
@quic-amitraj quic-amitraj self-assigned this Jun 4, 2026
@quic-amitraj quic-amitraj added embedding This label is for all the PR related to embedding model. reranker This label is for all the PR related to reranker model. 1.22 Release 1.22 candidate labels Jun 4, 2026
@quic-amitraj quic-amitraj changed the title Reranker: Qwen3-VL reranker support with single-specialization compile Reranker & Embedding: Qwen3-VL single-shot inference with single-specialization compile Jun 4, 2026
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Mirror of the reranker fix: Qwen3-VL embedding is single-shot prefill
(reads last-token hidden state as embedding vector, no decode loop).
`get_compile_specs` now returns ctx_len == prefill_seq_len, triggering
Solution A in modeling_auto.py to compile only the Prefill kernel.

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
… simplify config path

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
@quic-amitraj
Copy link
Copy Markdown
Contributor Author

@quic-rishinr @vbaddi Please review it, added few more changes.

  1. Removed key and values from input and output for single_shot infer model.
  2. Removed dual specialization from the lang model.

…out kv input outpur

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1.22 Release 1.22 candidate embedding This label is for all the PR related to embedding model. reranker This label is for all the PR related to reranker model.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants