feat(0506): Add optional KV-cache buffer-name prefix for vLLM disaggr…#1046
Open
vbaddi wants to merge 2 commits into
Open
feat(0506): Add optional KV-cache buffer-name prefix for vLLM disaggr…#1046vbaddi wants to merge 2 commits into
vbaddi wants to merge 2 commits into
Conversation
…egated KV transfer
vLLM disaggregated serving needs to regex-select only LLM KV cache buffers for
device-to-device transfer, without matching vision/multimodal retained buffers.
QEFFAutoModelForCausalLM and both VLM implementations. When set to an alphanumeric
token P, the KV-cache retained-state buffers are renamed:
past_key.{i}_RetainedState -> past_key.{i}_P_RetainedState (output)
past_key.{i} -> past_key.{i}_P (input)
Dynamic axes carry over to the renamed inputs. The AIC compiler's retention pairing
(output X_RetainedState ↔ input X) is preserved. Vision/multimodal retained buffers
(vision_embeds, pixel_values, image_idx, deepstack_features) are never prefixed.
Without the flag, behaviour is byte-for-byte identical to today. The export/compile
hash changes when a prefix is used, preventing cache collisions.
Validation enforces an alphanumeric-only prefix (no dots, underscores, spaces) to
keep the `past_key.{i}_{P}` structure unambiguous.
Verified:
- 139 tests pass (18 new) on host CI.
- Manual export/generate smoke on CausalLM + VLM with kv_cache_prefix="VLLM"
confirms ONNX graph buffer names and dynamic axis pairing are correct.
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…ditions Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
QEFFAutoModelForCausalLM, single-QPC VLM, dual-QPC VLM).past_key.{i}_RetainedState→past_key.{i}_<prefix>_RetainedState, enabling vLLM toregex-select LLM KV buffers for disaggregated device-to-device transfer.
Motivation
The vLLM team integrating on this QEff branch needs a stable regex handle on LLM KV buffers
(distinct from
vision_embeds_RetainedStateetc.) for disaggregated KV transfer between devices.Test plan
python -m pytest -q tests/unit_test/models/test_model_quickcheck.py -n autohash-dir collision prevention, compile custom_io consistency, VLM lang-only scope, and input
validation
Usage