Reranker & Embedding: Qwen3-VL single-shot inference with single-specialization compile by quic-amitraj · Pull Request #1031 · quic/efficient-transformers

quic-amitraj · 2026-06-03T19:37:18Z

Summary

This PR adds end-to-end AI100 inference support for Qwen3-VL multimodal reranker and embedding models, and fixes the compile pipeline so both model types always produce exactly one QPC specialization (Prefill only — no wasted Decode kernel).

Supported models:

Qwen/Qwen3-VL-Reranker-2B and Qwen/Qwen3-VL-Reranker-8B
Qwen/Qwen3-VL-Embedding-8B

Test Results

Model	Type	MAD mean	MAD max	Threshold	Status
Qwen3-VL-Reranker-2B	Reranker	2.16e-03	4.03e-03	5e-03	✅ Pass
Qwen3-VL-Embedding-8B	Embedding	3.62e-05	1.62e-03	2e-03	✅ Pass

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Signed-off-by: Amit <amitraj@qti.qualcomm.com> Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Mirror of the reranker fix: Qwen3-VL embedding is single-shot prefill (reads last-token hidden state as embedding vector, no decode loop). `get_compile_specs` now returns ctx_len == prefill_seq_len, triggering Solution A in modeling_auto.py to compile only the Prefill kernel. Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

… simplify config path Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

quic-amitraj · 2026-06-05T06:55:36Z

@quic-rishinr @vbaddi Please review it, added few more changes.

Removed key and values from input and output for single_shot infer model.
Removed dual specialization from the lang model.

…out kv input outpur Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

quic-amitraj changed the title ~~Rerankers refomating~~ Reranker: Qwen3-VL reranker support with single-specialization compile Jun 4, 2026

quic-amitraj force-pushed the bugfix_2 branch from ed4f805 to a49d025 Compare June 4, 2026 14:05

quic-amitraj requested review from quic-rishinr and vbaddi June 4, 2026 14:21

quic-amitraj marked this pull request as ready for review June 4, 2026 14:37

vbaddi requested changes Jun 4, 2026

View reviewed changes

Comment thread QEfficient/transformers/models/whisper/modeling_whisper.py Outdated

Comment thread tests/unit_test/models/reranker/test_reranker_models_unit.py Outdated

quic-amitraj self-assigned this Jun 4, 2026

quic-amitraj added embedding This label is for all the PR related to embedding model. reranker This label is for all the PR related to reranker model. 1.22 Release 1.22 candidate labels Jun 4, 2026

quic-amitraj changed the title ~~Reranker: Qwen3-VL reranker support with single-specialization compile~~ Reranker & Embedding: Qwen3-VL single-shot inference with single-specialization compile Jun 4, 2026

quic-amitraj added 8 commits June 4, 2026 21:28

Enabling support of rerankers models 2B and 8B of qwen3vl bucket

08bb022

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Functionality changes to PR and rebase with main branch

711fd81

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Addressed comments and fix CI issue

612ed3e

Signed-off-by: Amit <amitraj@qti.qualcomm.com> Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Updated installation of qwen-vl-utils

c4334c1

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Addressed comments

eee7098

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Rebased and addressed comments

7d1e2f4

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Intial fix

15d0ff1

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Update the exmple script and modelling files

28dc773

Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

quic-amitraj force-pushed the bugfix_2 branch from fd5287c to bc86f85 Compare June 4, 2026 15:58

quic-amitraj force-pushed the bugfix_2 branch from bc86f85 to fc44abe Compare June 4, 2026 16:01

Address review comments: use ONNX_EXPORT_EXAMPLE_SEQ_LEN constant and…

5fbc0d8

… simplify config path Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

Added support of embedding and reranker model export wnd compile with…

e14f780

…out kv input outpur Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>

quic-amitraj force-pushed the bugfix_2 branch from de9b363 to e14f780 Compare June 5, 2026 06:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reranker & Embedding: Qwen3-VL single-shot inference with single-specialization compile#1031

Reranker & Embedding: Qwen3-VL single-shot inference with single-specialization compile#1031
quic-amitraj wants to merge 11 commits into
quic:release/v1.22.0_tmpfrom
quic-amitraj:bugfix_2

quic-amitraj commented Jun 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

quic-amitraj commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

quic-amitraj commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Results

Uh oh!

Uh oh!

Uh oh!

quic-amitraj commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

quic-amitraj commented Jun 3, 2026 •

edited

Loading