Skip to content

Local decode with distributed prefill#343

Draft
lobanov wants to merge 15 commits into
antirez:mainfrom
lobanov:local-gen-with-dist-prefill
Draft

Local decode with distributed prefill#343
lobanov wants to merge 15 commits into
antirez:mainfrom
lobanov:local-gen-with-dist-prefill

Conversation

@lobanov
Copy link
Copy Markdown

@lobanov lobanov commented Jun 6, 2026

This is a draft PR for work in progress feature. This branch is not ready to be merged as it is, because it contains a lot of research artifacts and code instrumentation that would likely be unnecessary in the final result. Still, I want to share it with the community in case someone else wants to benchmark. I also need some feedback if this is going in the right direction.

What is it? See #304. This allows to run distributed prefill with fully local generation. Unlike standard distributed prefill, this is not intented to allow to run a big model straddled across two or more nodes, this is a performance optimisation. Entire model needs to fit into the memory of the worker owning the output layer for the generation to be fully resident (and --local-decode flag passed).

Current status: usable for evaluation, but still largely research-focused work. Generation with local decode can work with ds4 and ds4-server frontends. There is also work-in-progress implementation of ds4-eval frontend for quality controls. Code compiles for metal and cuda-spark targets.

Current focus of work: evaluation of quality of local decode after distributed prefill. Official logits gate passes, but I see divergent generation trajectories. I'm focusing on longer evaluations using modified ds4-eval. After that I will look at options to speed up the distributed handoff, as I sense lots of improvement opportunities. Evaluation-wise, first 4 questions on ds4-eval pass on CUDA->Metal (local decode), but 2nd question fails on Metal->CUDA (local decode). I'm not sure yet if it's a bug in the evaluation or if the inference quality is poorer. Running full suite is currently blocked by some kind of stability issue, the coordinator running ds4-eval fails after 5-6 questions.

How to run:

  1. Connect two nodes with high-speed link. I use USB-C-to-5GbE adapter on Mac side and a short Cat8 cable plugged into DGX Spark, but I have not seen traffic mode than 200MB/s, so 2.5GbE with Cat6 cable should be adequate, 1GbE will be too tight unless you switch to --dist-activation-bits 16 (not tested).
  2. Run controller on the node that will own early layers prefill, e.g. ./ds4 --role coordinator --layers 0:21 --listen 0.0.0.0 1234 --prompt-file speed-bench/promessi_sposi_1k.txt --ctx 200000 -n 1000 where speed-bench/promessi_sposi_1k.txt is the first 1000 lines of speed-bench/promessi_sposi.txt created by head -n 1000 speed-bench/promessi_sposi.txt > speed-bench/promessi_sposi_1k.txt, it's around 18k tokens. I use DGX Spark for this role. Only the owned layers (0-21 in this case) will be loaded into this instance.
  3. Run controller on the node that will own later layers prefill and local generation, e.g. ./ds4 --role worker --layers 22:output --coordinator <coordinator_ip> 1234 --local-decode --ctx 200000. I use MacBook M5 Max 128GB for that. Note that the worker with --local-decode must have enough memory to fit the entire model plus context.

On this reference setup I'm currently getting ds4: prefill: 602.78 t/s, generation: 30.10 t/s. Prefill speed is consistent with running fully distributed prefill acrss DGX Spark and M5 Max, and with fully local generation on M5 Max. Best of both worlds!

The implementation hands off KV cache for early layers to the node owning local decode, and catches of KV cache of early layers once the decode produces tokens, so repeated generation can resume correctly.

Two questions for @antirez

  1. Is this something you want to have in DwarfStar at all? It seems useful, but admittedly rather niche.
  2. There's currently a topology limitation in distributed prefill whereby the coordinator must own the early layers. This creates an awkward setup if I want to use MacBook for local decode (it's 2x faster than DGX) and also use it as a frontend. Currently I have to run whatever frontend I need on DGX to get MacBook to run local decode. I'd love to be able to reverse the topology and run the coordinator with local-decode and ownership of later layers. Is this something you envisage working on? I can give it a go once I get comfort with this PR.

@lobanov lobanov force-pushed the local-gen-with-dist-prefill branch from c3be0c2 to 97d0688 Compare June 7, 2026 09:57
@lobanov
Copy link
Copy Markdown
Author

lobanov commented Jun 7, 2026

Completed full ds4-eval run just now, had to modify the eval code to work in distributed prefill local decode mode as a coordinator. All runs are with --tokens 2048 --nothink and with q2-imatrix model and the code is based on upstream commit id 477c0e8.

Results (full results are checked in artifacts/issue-304/ds4-eval/2026-06-06-ds4-eval-*.log):

  1. Fully local Metal on M5 Max: 67/92 passed, 25 failed, runtime 00h:47m
  2. Fully local CUDA on DGX Spark: 69/92 passed, 23 failed, runtime 01h:39m
  3. Distributed prefill CUDA->Metal then local decode on Metal: 65/92 passed, 27 failed, runtime 03h:01m

Interestingly, local decode with distributed prefill performs differently than on either of the engines, with mild loss of precision around 3-4% on one-shot benchmarks. This could be within the normal variance, I'm doing more runs. I'm also going to evaluate the model on agentic coding tasks and maybe SWE-Bench.

Run time is a lot longer because how ds4-eval interacts with the worker owning decode, I did not yet try to optimize that yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant