Implementing distributed prefill and decode stages for an LLM

Introduction

The process of inference by an LLM can be split into two stages - prefill and decode. The prefill stage of inference takes as its input the entire input token stream and generates as its output a Key-Value (KV) cache. As a result, it is computation-intensive and best implemented on a high-compute GPU. The decode stage takes as its input the KV cache and generates as its output the next token. By running the decode stage several times, the full output token stream is generated. Since the KV cache is large, the decode stage is memory bandwidth-intensive.

In some use cases, we may need to run the prefill stage on one processor node (node 1) which handles matrix multiplication well and the decode stage on another node (node 2) that handles memory transfers well. This involves efficiently transferring the KV cache from node 1 to node 2.

Goal of the prototype

The ultimate goal of this prototype is to demonstrate a working setup where the prefill stage is implemented on one GPU (node 1), the KV cache is transmitted to a second GPU (node 2), and the decode stage is implemented on node 2.

We plan to accomplish this ultimate goal through the following intermediate steps.

Step 0: Run prefill and decode (PD) together on an Nvidia GPU
Step 0.5: Run PD together on an Intel XPU
Step 1: Split PD processes but run both on the same Nvidia GPU Demonstrate that prefill can be implemented in one process on a GPU, the KV cache can be saved, and the decode can be implemented on a second process on the same GPU.
Step 1.5: Split PD processes but run both on the same Intel XPU
Step 2: Split PD between an Nvidia GPU and an Intel XPU (same host) Run the prefill step on the Nvidia GPU and the decode step on an Intel XPU on the same. The KV cache can be communicated through PCIe or other means.
Step 3: Split PD between an Nvidia GPU and an Intel XPU (different hosts) The KV cache should be communicated by an appropriate protocol like RDMA.

These intermediate steps may be accomplished through any of the following software stacks.

PyTorch,
vLLM, or
llm-d with vLLM as the inference engine.

In addition to demonstrating the prototype, we will also benchmark the performance from using each of these approaches.

Status of the prototype

Software stack	Step 0	Step 0.5	Step 1	Step 2
PyTorch	X	X	X	X
vLLM	X	X
llm-d	X

“X” implies that that step is complete.

The following steps cover how to recreate each step. For all Python scripts, you must activate the virtual environment before running the scripts.

PyTorch instructions

Step 0

Step 1

Demonstrated using file copy.

Step 2

Demonstrated using file copy.

vLLM instructions

Step 0

Running on combined Intel XPU/Nvidia GPU machine

TBD

Step 0.5

Running on combined Intel XPU/Nvidia GPU machine

TBD

llm-d instructions

Follow the instructions in llm-d installation to setup.

Step 0

python split_vllm.py --remote --prompt "Write a 100-word essay on the Enlightenment movement"

Limitations of vLLM

The vLLM software package supports benchmarking an LLM inference and splitting the benchmark results between the prefill and decode stages. However, as of Feb 2026, it may not support running the prefill and decode stages on generic separate nodes.

It only supports a specific use case where all the nodes are Nvidia GPU nodes. Under the hood, it supports this feature by enabling transfers of the KV cache between nodes using the Nvidia Inference Transfer Library (NIXL). This is clearly not reusable for other GPUs.

We will need to investigate whether this is a real limitation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing distributed prefill and decode stages for an LLM

Introduction

Goal of the prototype

Status of the prototype

PyTorch instructions

Step 0

Step 1

Step 2

vLLM instructions

Step 0

Step 0.5

llm-d instructions

Step 0

Limitations of vLLM

References

FilesExpand file tree

Distributed_LLM.org

Latest commit

History

Distributed_LLM.org

File metadata and controls

Implementing distributed prefill and decode stages for an LLM

Introduction

Goal of the prototype

Status of the prototype

PyTorch instructions

Step 0

Step 1

Step 2

vLLM instructions

Step 0

Step 0.5

llm-d instructions

Step 0

Limitations of vLLM

References