Local decode with distributed prefill by lobanov · Pull Request #343 · antirez/ds4

lobanov · 2026-06-06T16:15:34Z

This is a draft PR for work in progress feature. This branch is not ready to be merged as it is, because it contains a lot of research artifacts and code instrumentation that would likely be unnecessary in the final result. Still, I want to share it with the community in case someone else wants to benchmark. I also need some feedback if this is going in the right direction.

What is it? See #304. This allows to run distributed prefill with fully local generation. Unlike standard distributed prefill, this is not intented to allow to run a big model straddled across two or more nodes, this is a performance optimisation. Entire model needs to fit into the memory of the worker owning the output layer for the generation to be fully resident (and --local-decode flag passed).

Current status: usable for evaluation, but still largely research-focused work. Generation with local decode can work with ds4 and ds4-server frontends. There is also work-in-progress implementation of ds4-eval frontend for quality controls. Code compiles for metal and cuda-spark targets.

Current focus of work: evaluation of quality of local decode after distributed prefill. Official logits gate passes, but I see divergent generation trajectories. I'm focusing on longer evaluations using modified ds4-eval. After that I will look at options to speed up the distributed handoff, as I sense lots of improvement opportunities. Evaluation-wise, first 4 questions on ds4-eval pass on CUDA->Metal (local decode), but 2nd question fails on Metal->CUDA (local decode). I'm not sure yet if it's a bug in the evaluation or if the inference quality is poorer. Running full suite is currently blocked by some kind of stability issue, the coordinator running ds4-eval fails after 5-6 questions.

How to run:

Connect two nodes with high-speed link. I use USB-C-to-5GbE adapter on Mac side and a short Cat8 cable plugged into DGX Spark, but I have not seen traffic mode than 200MB/s, so 2.5GbE with Cat6 cable should be adequate, 1GbE will be too tight unless you switch to --dist-activation-bits 16 (not tested).
Run controller on the node that will own early layers prefill, e.g. ./ds4 --role coordinator --layers 0:21 --listen 0.0.0.0 1234 --prompt-file speed-bench/promessi_sposi_1k.txt --ctx 200000 -n 1000 where speed-bench/promessi_sposi_1k.txt is the first 1000 lines of speed-bench/promessi_sposi.txt created by head -n 1000 speed-bench/promessi_sposi.txt > speed-bench/promessi_sposi_1k.txt, it's around 18k tokens. I use DGX Spark for this role. Only the owned layers (0-21 in this case) will be loaded into this instance.
Run controller on the node that will own later layers prefill and local generation, e.g. ./ds4 --role worker --layers 22:output --coordinator <coordinator_ip> 1234 --local-decode --ctx 200000. I use MacBook M5 Max 128GB for that. Note that the worker with --local-decode must have enough memory to fit the entire model plus context.

On this reference setup I'm currently getting ds4: prefill: 602.78 t/s, generation: 30.10 t/s. Prefill speed is consistent with running fully distributed prefill acrss DGX Spark and M5 Max, and with fully local generation on M5 Max. Best of both worlds!

The implementation hands off KV cache for early layers to the node owning local decode, and catches of KV cache of early layers once the decode produces tokens, so repeated generation can resume correctly.

Two questions for @antirez

Is this something you want to have in DwarfStar at all? It seems useful, but admittedly rather niche.
There's currently a topology limitation in distributed prefill whereby the coordinator must own the early layers. This creates an awkward setup if I want to use MacBook for local decode (it's 2x faster than DGX) and also use it as a frontend. Currently I have to run whatever frontend I need on DGX to get MacBook to run local decode. I'd love to be able to reverse the topology and run the coordinator with local-decode and ownership of later layers. Is this something you envisage working on? I can give it a go once I get comfort with this PR.

lobanov · 2026-06-07T15:50:01Z

Completed full ds4-eval run just now, had to modify the eval code to work in distributed prefill local decode mode as a coordinator. All runs are with --tokens 2048 --nothink and with q2-imatrix model and the code is based on upstream commit id 477c0e8.

Results (full results are checked in artifacts/issue-304/ds4-eval/2026-06-06-ds4-eval-*.log):

Fully local Metal on M5 Max: 67/92 passed, 25 failed, runtime 00h:47m
Fully local CUDA on DGX Spark: 69/92 passed, 23 failed, runtime 01h:39m
Distributed prefill CUDA->Metal then local decode on Metal: 65/92 passed, 27 failed, runtime 03h:01m

Interestingly, local decode with distributed prefill performs differently than on either of the engines, with mild loss of precision around 3-4% on one-shot benchmarks. This could be within the normal variance, I'm doing more runs. I'm also going to evaluate the model on agentic coding tasks and maybe SWE-Bench.

Run time is a lot longer because how ds4-eval interacts with the worker owning decode, I did not yet try to optimize that yet.

lobanov added 14 commits June 3, 2026 14:45

create plan

9952733

phase 0 and phase 1 completed

116e358

phase 2 complete

9c83dac

refine phase 3 plan

f7a0baa

phase 3.5, route-specificity

85c13af

refine phase 4 scope

c05eed4

phase 4 closure

2ad3829

phases 5+ scope refinement

e9e5a48

remove compiled binaries

5c621a0

phase 5 plan

30c9832

phase 5 first implementation with initial tests

e2bd47a

phase 5 local decode with sampling

b7ed1ec

fix: stale coordinator logits

a9d673b

phase 5 frontend testing and fixes

42f6021

lobanov mentioned this pull request Jun 6, 2026

Support distributed prefill with local generation (add KV cache chunks pipelining) #304

Open

lobanov force-pushed the local-gen-with-dist-prefill branch from 9b00880 to c3be0c2 Compare June 6, 2026 19:50

phase 5 performance eval [WIP]

97d0688

lobanov force-pushed the local-gen-with-dist-prefill branch from c3be0c2 to 97d0688 Compare June 7, 2026 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local decode with distributed prefill#343

Local decode with distributed prefill#343
lobanov wants to merge 15 commits into
antirez:mainfrom
lobanov:local-gen-with-dist-prefill

lobanov commented Jun 6, 2026 •

edited

Loading

Uh oh!

lobanov commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lobanov commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lobanov commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lobanov commented Jun 6, 2026 •

edited

Loading

lobanov commented Jun 7, 2026 •

edited

Loading