Local decode with distributed prefill#343
Conversation
9b00880 to
c3be0c2
Compare
c3be0c2 to
97d0688
Compare
|
Completed full Results (full results are checked in
Interestingly, local decode with distributed prefill performs differently than on either of the engines, with mild loss of precision around 3-4% on one-shot benchmarks. This could be within the normal variance, I'm doing more runs. I'm also going to evaluate the model on agentic coding tasks and maybe SWE-Bench. Run time is a lot longer because how ds4-eval interacts with the worker owning decode, I did not yet try to optimize that yet. |
This is a draft PR for work in progress feature. This branch is not ready to be merged as it is, because it contains a lot of research artifacts and code instrumentation that would likely be unnecessary in the final result. Still, I want to share it with the community in case someone else wants to benchmark. I also need some feedback if this is going in the right direction.
What is it? See #304. This allows to run distributed prefill with fully local generation. Unlike standard distributed prefill, this is not intented to allow to run a big model straddled across two or more nodes, this is a performance optimisation. Entire model needs to fit into the memory of the worker owning the
outputlayer for the generation to be fully resident (and--local-decodeflag passed).Current status: usable for evaluation, but still largely research-focused work. Generation with local decode can work with
ds4andds4-serverfrontends. There is also work-in-progress implementation ofds4-evalfrontend for quality controls. Code compiles formetalandcuda-sparktargets.Current focus of work: evaluation of quality of local decode after distributed prefill. Official logits gate passes, but I see divergent generation trajectories. I'm focusing on longer evaluations using modified
ds4-eval. After that I will look at options to speed up the distributed handoff, as I sense lots of improvement opportunities. Evaluation-wise, first 4 questions onds4-evalpass on CUDA->Metal (local decode), but 2nd question fails on Metal->CUDA (local decode). I'm not sure yet if it's a bug in the evaluation or if the inference quality is poorer. Running full suite is currently blocked by some kind of stability issue, the coordinator runningds4-evalfails after 5-6 questions.How to run:
--dist-activation-bits 16(not tested)../ds4 --role coordinator --layers 0:21 --listen 0.0.0.0 1234 --prompt-file speed-bench/promessi_sposi_1k.txt --ctx 200000 -n 1000wherespeed-bench/promessi_sposi_1k.txtis the first 1000 lines ofspeed-bench/promessi_sposi.txtcreated byhead -n 1000 speed-bench/promessi_sposi.txt > speed-bench/promessi_sposi_1k.txt, it's around 18k tokens. I use DGX Spark for this role. Only the owned layers (0-21 in this case) will be loaded into this instance../ds4 --role worker --layers 22:output --coordinator <coordinator_ip> 1234 --local-decode --ctx 200000. I use MacBook M5 Max 128GB for that. Note that the worker with--local-decodemust have enough memory to fit the entire model plus context.On this reference setup I'm currently getting
ds4: prefill: 602.78 t/s, generation: 30.10 t/s. Prefill speed is consistent with running fully distributed prefill acrss DGX Spark and M5 Max, and with fully local generation on M5 Max. Best of both worlds!The implementation hands off KV cache for early layers to the node owning local decode, and catches of KV cache of early layers once the decode produces tokens, so repeated generation can resume correctly.
Two questions for @antirez