The process of inference by an LLM can be split into two stages - prefill and decode. The prefill stage of inference takes as its input the entire input token stream and generates as its output a Key-Value (KV) cache. As a result, it is computation-intensive and best implemented on a high-compute GPU. The decode stage takes as its input the KV cache and generates as its output the next token. By running the decode stage several times, the full output token stream is generated. Since the KV cache is large, the decode stage is memory bandwidth-intensive.
In some use cases, we may need to run the prefill stage on one processor node (node 1) which handles matrix multiplication well and the decode stage on another node (node 2) that handles memory transfers well. This involves efficiently transferring the KV cache from node 1 to node 2.
The ultimate goal of this prototype is to demonstrate a working setup where the prefill stage is implemented on one GPU (node 1), the KV cache is transmitted to a second GPU (node 2), and the decode stage is implemented on node 2.
We plan to accomplish this ultimate goal through the following intermediate steps.
- Step 0
- Run prefill and decode (PD) together on an Nvidia GPU
- Step 0.5
- Run PD together on an Intel XPU
- Step 1
- Split PD processes but run both on the same Nvidia GPU Demonstrate that prefill can be implemented in one process on a GPU, the KV cache can be saved, and the decode can be implemented on a second process on the same GPU.
- Step 1.5
- Split PD processes but run both on the same Intel XPU
- Step 2
- Split PD between an Nvidia GPU and an Intel XPU (same host) Run the prefill step on the Nvidia GPU and the decode step on an Intel XPU on the same. The KV cache can be communicated through PCIe or other means.
- Step 3
- Split PD between an Nvidia GPU and an Intel XPU (different hosts) The KV cache should be communicated by an appropriate protocol like RDMA.
These intermediate steps may be accomplished through any of the following software stacks.
- PyTorch,
- vLLM, or
- llm-d with vLLM as the inference engine.
In addition to demonstrating the prototype, we will also benchmark the performance from using each of these approaches.
| Software stack | Step 0 | Step 0.5 | Step 1 | Step 1.5 | Step 2 | Step 3 |
|---|---|---|---|---|---|---|
| PyTorch | X | X | X | X | ||
| vLLM | X | X | ||||
| llm-d | X |
“X” implies that that step is complete.
The following steps cover how to recreate each step. For all Python scripts, you must activate the virtual environment before running the scripts.
Demonstrated using file copy.
Demonstrated using file copy.
Running on combined Intel XPU/Nvidia GPU machine
TBDRunning on combined Intel XPU/Nvidia GPU machine
TBDFollow the instructions in llm-d installation to setup.
python split_vllm.py --remote --prompt "Write a 100-word essay on the Enlightenment movement"The vLLM software package supports benchmarking an LLM inference and splitting the benchmark results between the prefill and decode stages. However, as of Feb 2026, it may not support running the prefill and decode stages on generic separate nodes.
It only supports a specific use case where all the nodes are Nvidia GPU nodes. Under the hood, it supports this feature by enabling transfers of the KV cache between nodes using the Nvidia Inference Transfer Library (NIXL). This is clearly not reusable for other GPUs.
We will need to investigate whether this is a real limitation.