Skip to content

feat(0506): Add optional KV-cache buffer-name prefix for vLLM disaggr…#1046

Open
vbaddi wants to merge 2 commits into
release/v1.22.0_tmpfrom
feature/kv_cache_buffer_prefix_for_vllm
Open

feat(0506): Add optional KV-cache buffer-name prefix for vLLM disaggr…#1046
vbaddi wants to merge 2 commits into
release/v1.22.0_tmpfrom
feature/kv_cache_buffer_prefix_for_vllm

Conversation

@vbaddi
Copy link
Copy Markdown
Contributor

@vbaddi vbaddi commented Jun 5, 2026

Summary

  • model classes (QEFFAutoModelForCausalLM, single-QPC VLM, dual-QPC VLM).
  • When provided, injects the token as an infix into KV-cache retained-state buffer names:
    past_key.{i}_RetainedStatepast_key.{i}_<prefix>_RetainedState, enabling vLLM to
    regex-select LLM KV buffers for disaggregated device-to-device transfer.
  • Vision/multimodal retained buffers are never renamed. Default behaviour is unchanged.

Motivation

The vLLM team integrating on this QEff branch needs a stable regex handle on LLM KV buffers
(distinct from vision_embeds_RetainedState etc.) for disaggregated KV transfer between devices.

Test plan

  • python -m pytest -q tests/unit_test/models/test_model_quickcheck.py -n auto
  • All 121 existing tests still pass
  • 18 new unit tests cover: helper correctness, export prefix on CausalLM, default unchanged,
    hash-dir collision prevention, compile custom_io consistency, VLM lang-only scope, and input
    validation

Usage

## CausalLM
  model = QEFFAutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
  model.compile(prefill_seq_len=32, ctx_len=4096, num_cores=16, kv_cache_prefix="VLLM")
  # KV buffers: past_key.0_VLLM / past_key.0_VLLM_RetainedState
  
## VLM — dual QPC (kv_offload=True)
  model = QEFFAutoModelForImageTextToText.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", kv_offload=True)
  model.compile(prefill_seq_len=32, ctx_len=4096, img_size=560, num_cores=16, kv_cache_prefix="VLLM")

# Without the flag — behaviour identical to today, no name changes.
  model.compile(prefill_seq_len=32, ctx_len=4096, num_cores=16)

vbaddi added 2 commits June 5, 2026 15:45
…egated KV transfer

  vLLM disaggregated serving needs to regex-select only LLM KV cache buffers for
  device-to-device transfer, without matching vision/multimodal retained buffers.

  QEFFAutoModelForCausalLM and both VLM implementations. When set to an alphanumeric
  token P, the KV-cache retained-state buffers are renamed:

    past_key.{i}_RetainedState  ->  past_key.{i}_P_RetainedState   (output)
    past_key.{i}                ->  past_key.{i}_P                  (input)

  Dynamic axes carry over to the renamed inputs. The AIC compiler's retention pairing
  (output X_RetainedState ↔ input X) is preserved. Vision/multimodal retained buffers
  (vision_embeds, pixel_values, image_idx, deepstack_features) are never prefixed.

  Without the flag, behaviour is byte-for-byte identical to today. The export/compile
  hash changes when a prefix is used, preventing cache collisions.

  Validation enforces an alphanumeric-only prefix (no dots, underscores, spaces) to
  keep the `past_key.{i}_{P}` structure unambiguous.

  Verified:
  - 139 tests pass (18 new) on host CI.
  - Manual export/generate smoke on CausalLM + VLM with kv_cache_prefix="VLLM"
    confirms ONNX graph buffer names and dynamic axis pairing are correct.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…ditions

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
@vbaddi vbaddi self-assigned this Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant