Skip to content

KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039

Open
quic-akuruvil wants to merge 5 commits into
quic:release/v1.22.0_tmpfrom
quic-akuruvil:dma_slice
Open

KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039
quic-akuruvil wants to merge 5 commits into
quic:release/v1.22.0_tmpfrom
quic-akuruvil:dma_slice

Conversation

@quic-akuruvil
Copy link
Copy Markdown
Contributor

@quic-akuruvil quic-akuruvil commented Jun 4, 2026

Prefill to decode KV transfer happens through host (shared memory).
Shared memory is used so that there's no copy of KV cache when transferring from prefill to host.
Dump the kv cache from prefill devices to shared memory on host and then pass the pointer of shared memory to decode instance which loads up the kv cache directly from those host buffers.

Adds a new temporary QAICInferenceSession class (cloud_infer_kv_slice.py) that enables zero-copy KV-cache handoff between disaggregated prefill and decode sessions using shared DMA buffers and setDataWithSlices(). On the last prefill chunk, KV outputs are wired directly into the decode session's input slots via a sliced DMA descriptor — eliminating the Python/numpy copy at the prefill→decode boundary.

cluster_id="prefill" gives a pool of stages+1 slots for concurrent chunk pipelining; cluster_id="decode" gives a single fixed slot because decode is strictly sequential

Also adds an end-to-end example (examples/disagg_serving/qwen3moe_disagg_mode_with_chunking_kvslice.py) demonstrating the full disaggregated serving flow for Qwen3-MoE with chunked prefill, PP (stages), TS, and DMA-sliced KV handoff.

quic-mohmeh and others added 5 commits June 4, 2026 22:24
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Ann <quic_akuruvil@quicinc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants