Adding PagedAttention support for CausalLM models by vaibverm · Pull Request #982 · quic/efficient-transformers

vaibverm · 2026-05-13T07:33:30Z

This PR adds the PagedAttention (https://arxiv.org/pdf/2309.06180) support for all CausalLM models in QEfficient.
The major change is that KV cache is not treated as a contiguous memory under this implementation but rather a collection of blocks which can reside in a non-contiguous fashion inside the memory. This forces cache scatter and gather operations to happen per KV block.

Summary of changes compared to BlockedKV:

The cache shape changes from [BS, num_kv_heads, CL, dh] to [total_num_kv_blocks, num_kv_heads, kv_block_size, dh].
num_kv_blocks = -(-ctx_len // kv_block_size) = physical blocks required for 1 batch element in K cache.
Total_num_kv_blocks = BS (kv_batch_size) * num_kv_blocks = total physical blocks available for K cache.
2 new inputs block_table [BS, num_kv_blocks] and slot_id [BS] are passed as inputs to the ONNX.
4) a) block_id is each entry in the block_table and points to the physical K/V block that needs to be read/written corresponding to (position_id // kv_block_size)th entry in block_table. ‘-1’ signifies invalid/unallocated block.
4) b) slot_id tells how many entries are already filled in currently active block => read up to / write after (slot_id – 1)
Limitation - Cache writes to only 1 block at a time per batch element => CPL = kv_block_size. Hence, cache writes should not cross the block boundary.
vLLM provides KV Cache Manager implementation which maintains the KV cache block_table with logical to physical block mapping and slot_id for location mapping within the active block.

vbaddi · 2026-05-27T17:15:12Z

        v_out = torch.where(invalid_mask.unsqueeze(-1), torch.tensor(0.0, dtype=torch.float32), v_out)
        return k_out, v_out

+    def read_only_pagedAttention(self, block_index, updated, cache_kwargs):


nit: can we rename this to read_only_paged_attention()?

I was staying consistent with the actual name of the technique from the paper: https://arxiv.org/pdf/2309.06180
The attention mechanism is called PagedAttention rather than paged attention, hence I was keeping pagedAttention in our naming. I can change to paged_attention for all the methods if that would look better with snake case convention.

vbaddi · 2026-05-27T17:15:33Z

+        v_out = torch.where((invalid_mask.unsqueeze(1)).unsqueeze(-1), torch.tensor(0.0, dtype=torch.float32), v_out)
+        return k_out, v_out
+
+    def write_only_pagedAttention(self, key_states, value_states, cache_kwargs):


write_only_paged_attention()?

vbaddi · 2026-05-27T17:15:54Z

        """
        return self.layers[layer_idx].read_only_blockedKV(start_index, end_index, cache_kwargs)

+    def read_only_pagedAttention(self, block_index, updated, layer_idx, cache_kwargs):


nit: same as above

vbaddi · 2026-05-27T17:16:34Z

+
 _STRATEGIES: Dict[BlockingMode, Callable] = {
    BlockingMode.KV: blocked_kv_attention_forward,
+    BlockingMode.KV_PAGED: blocked_kv_paged_attention_forward,


nit: @vaibverm can we add unit tests to all methods mentioned here?

vbaddi · 2026-05-27T17:17:38Z

+
 _STRATEGIES: Dict[BlockingMode, Callable] = {
    BlockingMode.KV: blocked_kv_attention_forward,
+    BlockingMode.KV_PAGED: blocked_kv_paged_attention_forward,


nit: also, can you add an example to enable this here, similar to this: https://github.com/quic/efficient-transformers/blob/main/examples/text_generation/blocked_attention_inference.py

Done. Added examples for causalLM and Qwen3 models.

vbaddi · 2026-05-27T17:21:07Z

nit: lint and format are missing, pls check.

Issue: Generated text is not accurate (unable to identify object in given image) 5/25: Found a workaround, seems like compiler issue - debugging further --------- Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com> Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Reverts quic#1010 Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com> Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

changed the code from doing the exact same math repeatedly. Signed-off-by: Anuj Gupta <anujgupt@qti.qualcomm.com> Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

…ormat cleanup Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

vaibverm force-pushed the PR_branch branch from 65bd648 to f4eefaa Compare May 13, 2026 19:48

anujgupt-github added the enhancement New feature or request label May 20, 2026

vbaddi requested changes May 27, 2026

View reviewed changes

vbaddi assigned vaibverm May 27, 2026

vaibverm force-pushed the PR_branch branch from e6d101e to d60baca Compare June 4, 2026 14:03

vbaddi changed the base branch from main to release/v1.22.0_tmp June 5, 2026 18:53

vaibverm force-pushed the PR_branch branch from d60baca to 5dc796a Compare June 6, 2026 00:10

tv-karthikeya and others added 14 commits June 5, 2026 19:11

Revert "[WIP] Fix for acc issue in Qwen3 VL moe" (quic#1019)

e90c410

Reverts quic#1010 Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com> Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Rebased PagedAttention support with latest Qeff for PR

f219599

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Added block_table and slot_id inputs + minor modelling_auto.py changes

99e5566

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Working version with PagedAttention

e580faf

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Minor fixes to specialization builder

7ddff49

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Added support for Qwen2.5_VL PagedAttention

f1f0bec

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

slot_id fix for Qwen2.5_VL PagedAttention decode

4bc8b91

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Added support for Qwen3_VL PagedAttention

ad02fbb

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Removed commented code corrected in rebase

bd428ad

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Minor fix for enum bug

a32fd47

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Optimize attention blocking nested loops (quic#957)

51c04f6

changed the code from doing the exact same math repeatedly. Signed-off-by: Anuj Gupta <anujgupt@qti.qualcomm.com> Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Adding PagedAttetion support for Qwen3_VL_MOE

17fdade

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Adding PagedAttention specific test, unit tests and examples + lint/f…

ee2dadb

…ormat cleanup Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

vaibverm force-pushed the PR_branch branch from 5dc796a to ee2dadb Compare June 6, 2026 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding PagedAttention support for CausalLM models#982

Adding PagedAttention support for CausalLM models#982
vaibverm wants to merge 14 commits into
quic:release/v1.22.0_tmpfrom
vaibverm:PR_branch

vaibverm commented May 13, 2026

Uh oh!

vbaddi May 27, 2026

Uh oh!

vaibverm Jun 3, 2026

Uh oh!

vbaddi May 27, 2026

Uh oh!

vbaddi May 27, 2026

Uh oh!

vbaddi May 27, 2026

Uh oh!

vaibverm Jun 4, 2026

Uh oh!

vbaddi May 27, 2026

Uh oh!

vaibverm Jun 4, 2026

Uh oh!

vbaddi commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

vaibverm commented May 13, 2026

Summary of changes compared to BlockedKV:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbaddi commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants