Make generate() async to serialize back-to-back turns#80
Open
stikves wants to merge 3 commits into
Open
Conversation
InferenceEngine.generate() is now async throws instead of throws. The pipelined engine awaits the prior generation Task before starting a new one, preventing the fatalError on rapid multi-turn conversations. The serialization preserves KV cache state -- prefix caching handles reuse automatically across turns. No data is lost; the engine just waits for the GPU pipeline to drain before restarting. Adds stress tests: back-to-back turns, rapid-fire 10-turn, and generate-after-cancel sequences.
Contributor
|
Thanks for looking into this! The root cause is helpful tracing :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a crash when reusing an engine for rapid multi-turn conversations:
The
InferenceEngine.generate()method is nowasync throws(wasthrows). The pipelined engine awaits the prior generation's Task before starting a new one, preventing the fatalError on rapid back-to-back calls.Root cause
PR #64 removed the per-turn
engine.reset()call to enable multi-turn KV cache reuse. Thatreset()also served as the serialization point between consecutive turns (it calleddrain()internally). Without it, a newgenerate()call can race with the prior generation's GPU pipeline drain.Fix
The pipelined engine now cancels and awaits any in-flight generation at the top of
generate():This preserves KV cache state — prefix caching handles reuse automatically across turns.
Test plan
CancelAPITestspassUnifiedGenerationAPITestspass (prefix caching still works)