feat(0506): Add optional KV-cache buffer-name prefix for vLLM disaggr… by vbaddi · Pull Request #1046 · quic/efficient-transformers

vbaddi · 2026-06-05T10:22:17Z

Summary

model classes (QEFFAutoModelForCausalLM, single-QPC VLM, dual-QPC VLM).
When provided, injects the token as an infix into KV-cache retained-state buffer names:
past_key.{i}_RetainedState → past_key.{i}_<prefix>_RetainedState, enabling vLLM to
regex-select LLM KV buffers for disaggregated device-to-device transfer.
Vision/multimodal retained buffers are never renamed. Default behaviour is unchanged.

Motivation

The vLLM team integrating on this QEff branch needs a stable regex handle on LLM KV buffers
(distinct from vision_embeds_RetainedState etc.) for disaggregated KV transfer between devices.

Test plan

python -m pytest -q tests/unit_test/models/test_model_quickcheck.py -n auto
All 121 existing tests still pass
18 new unit tests cover: helper correctness, export prefix on CausalLM, default unchanged,
hash-dir collision prevention, compile custom_io consistency, VLM lang-only scope, and input
validation

Usage

## CausalLM
  model = QEFFAutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
  model.compile(prefill_seq_len=32, ctx_len=4096, num_cores=16, kv_cache_prefix="VLLM")
  # KV buffers: past_key.0_VLLM / past_key.0_VLLM_RetainedState
  
## VLM — dual QPC (kv_offload=True)
  model = QEFFAutoModelForImageTextToText.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", kv_offload=True)
  model.compile(prefill_seq_len=32, ctx_len=4096, img_size=560, num_cores=16, kv_cache_prefix="VLLM")

# Without the flag — behaviour identical to today, no name changes.
  model.compile(prefill_seq_len=32, ctx_len=4096, num_cores=16)

…egated KV transfer vLLM disaggregated serving needs to regex-select only LLM KV cache buffers for device-to-device transfer, without matching vision/multimodal retained buffers. QEFFAutoModelForCausalLM and both VLM implementations. When set to an alphanumeric token P, the KV-cache retained-state buffers are renamed: past_key.{i}_RetainedState -> past_key.{i}_P_RetainedState (output) past_key.{i} -> past_key.{i}_P (input) Dynamic axes carry over to the renamed inputs. The AIC compiler's retention pairing (output X_RetainedState ↔ input X) is preserved. Vision/multimodal retained buffers (vision_embeds, pixel_values, image_idx, deepstack_features) are never prefixed. Without the flag, behaviour is byte-for-byte identical to today. The export/compile hash changes when a prefix is used, preventing cache collisions. Validation enforces an alphanumeric-only prefix (no dots, underscores, spaces) to keep the `past_key.{i}_{P}` structure unambiguous. Verified: - 139 tests pass (18 new) on host CI. - Manual export/generate smoke on CausalLM + VLM with kv_cache_prefix="VLLM" confirms ONNX graph buffer names and dynamic axis pairing are correct. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

…ditions Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

vbaddi added 2 commits June 5, 2026 15:45

nit: add test for quick fix on checking for compiler args w/prefix ad…

3deb371

…ditions Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

vbaddi self-assigned this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(0506): Add optional KV-cache buffer-name prefix for vLLM disaggr…#1046

feat(0506): Add optional KV-cache buffer-name prefix for vLLM disaggr…#1046
vbaddi wants to merge 2 commits into
release/v1.22.0_tmpfrom
feature/kv_cache_buffer_prefix_for_vllm

vbaddi commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vbaddi commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vbaddi commented Jun 5, 2026 •

edited

Loading