KV handoff with DMA slicing APIs to avoid KV input/output copies. by quic-akuruvil · Pull Request #1039 · quic/efficient-transformers

quic-akuruvil · 2026-06-04T17:14:44Z

Prefill to decode KV transfer happens through host (shared memory).
Shared memory is used so that there's no copy of KV cache when transferring from prefill to host.
Dump the kv cache from prefill devices to shared memory on host and then pass the pointer of shared memory to decode instance which loads up the kv cache directly from those host buffers.

Adds a new temporary QAICInferenceSession class (cloud_infer_kv_slice.py) that enables zero-copy KV-cache handoff between disaggregated prefill and decode sessions using shared DMA buffers and setDataWithSlices(). On the last prefill chunk, KV outputs are wired directly into the decode session's input slots via a sliced DMA descriptor — eliminating the Python/numpy copy at the prefill→decode boundary.

cluster_id="prefill" gives a pool of stages+1 slots for concurrent chunk pipelining; cluster_id="decode" gives a single fixed slot because decode is strictly sequential

Also adds an end-to-end example (examples/disagg_serving/qwen3moe_disagg_mode_with_chunking_kvslice.py) demonstrating the full disaggregated serving flow for Qwen3-MoE with chunked prefill, PP (stages), TS, and DMA-sliced KV handoff.

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>

Signed-off-by: Ann <quic_akuruvil@quicinc.com>

quic-mohmeh and others added 5 commits June 4, 2026 22:24

Added MDP generation to QEff Compile

16833df

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>

Formatting and Linting

bc006dd

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>

Add compiler options - 'stages'

7a0d651

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>

Added support for layerwise export

8193f30

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>

Added inference serving with DMA slicing for KV handoff

fe974d0

Signed-off-by: Ann <quic_akuruvil@quicinc.com>

quic-akuruvil requested review from anujgupt-github, quic-hemagnih, quic-rishinr and vbaddi June 4, 2026 17:16

quic-akuruvil assigned ochougul and quic-akuruvil and unassigned ochougul Jun 4, 2026

quic-akuruvil requested a review from ochougul June 4, 2026 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039

KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039
quic-akuruvil wants to merge 5 commits into
quic:release/v1.22.0_tmpfrom
quic-akuruvil:dma_slice

quic-akuruvil commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

quic-akuruvil commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

quic-akuruvil commented Jun 4, 2026 •

edited

Loading