perf(stream): render ghost text while the model is still decoding#687
Merged
Conversation
Generation was single-shot end to end: the llama decode loop accumulated pieces privately and the FM engine consumed Apple's cumulative snapshots internally, so nothing reached the overlay until the full completion finished and perceived latency was prefill plus the entire decode. Both engines now forward cumulative, normalized partial results through a new streaming variant of the generation contract, and the coordinator paints them as real acceptable sessions: coalesced to one render per runloop turn, monotonic by policy (reordered hops and normalizer rewrites can never shrink visible text), guarded by the same work-id and materialize checks the final apply uses, and frozen at whatever was streamed if the user Tabs mid-decode. The final result remains authoritative and flows through the unchanged apply path.
fd40fbb to
78f8701
Compare
…g survives a dispatch reset
78f8701 to
99a9e26
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is the largest perceived-latency lever from the performance research pass. Generation was single-shot end to end: the llama decode loop accumulated pieces privately, the FM engine consumed Apple's cumulative snapshots internally (with a comment explicitly deferring partial rendering), and nothing reached the overlay until the entire completion finished. Perceived time-to-ghost-text was therefore prefill plus the full decode (~250ms-1.3s depending on model); with streaming it becomes prefill plus the first words.
How it works, layer by layer:
LlamaRuntimeCore's decode loop reports the cumulative raw completion after each sampled token;LlamaRuntimeManagerexposes it as a streaminggeneratevariant onLlamaRuntimeGenerating(default implementation keeps every fake compiling).LlamaSuggestionEnginenormalizes each cumulative partial through the sameSuggestionTextNormalizerpath as final results and forwards non-empty ones to the main actor.FoundationModelSuggestionEnginedoes the same inline from Apple's snapshot stream.SuggestionEngineRouterthreads the hook through both engines and the locale-fallback path.state), so the user can Tab into a stream the moment the first words appear, and accepting bumps the work id which freezes the suggestion at what was streamed. Renders are coalesced to at most one per runloop turn (token-rate deliveries cannot stack layout work), monotonic via the new pureStreamedGhostTextPolicy(reordered main-actor hops and normalizer rewrites can never shrink visible text), and guarded by the same work-id and context-materialize staleness checks the finalapplyuses. The final result remains authoritative and flows through the unchanged apply path.Validation
New tests:
StreamedGhostTextPolicyTests(monotonicity incl. stale-shorter, equal-redundant, and divergent-rewrite cases) andLlamaSuggestionEngineStreamingTests(cumulative partials normalized and forwarded with the request generation; the single-shot entry point provably never touches the streaming runtime path).Linked issues
Refs #661
Risk / rollout notes
LlamaRuntimeGeneratingand the runtime files); whichever lands second rebases mechanically.🤖 Generated with Claude Code
Greptile Summary
This PR wires streaming ghost text end-to-end: the llama decode loop and Apple's Foundation Model stream now forward cumulative normalized partials to the coordinator while decoding, so the first words of a suggestion appear after prefill + first tokens rather than after the full decode.
LlamaRuntimeCorecallsonPartialRawTextafter each sampled token;LlamaSuggestionEnginehops each partial to the main actor, normalizes it, and delivers it via the newonPartialcallback.FoundationModelSuggestionEnginedoes the same inline from Apple's snapshot stream. Both engines degrade gracefully whenonPartialis nil.DispatchQueue.main.asyncdrain and guarded by work-id, context-generation, andStreamedGhostTextPolicymonotonicity checks before being committed as real, acceptable sessions.SuggestionGeneratingandLlamaRuntimeGeneratingroute non-streaming callers to the single-shot path; the existingapplypath for the authoritative final result is untouched.Confidence Score: 5/5
Safe to merge. Every render path is guarded by work-id checks, context-generation matching, and the monotonic extension policy, so stale or reordered partials are silently dropped and the authoritative final result still owns the last word.
The streaming pipeline adds multiple independent staleness guards that together make incorrect renders impossible. The final apply path is unchanged. The two noted issues — a stale file comment and a missing @sendable annotation on an internal implementation method — have no runtime impact.
LlamaRuntimeCore.swift for the missing @sendable annotation on the onPartialRawText closure parameter.
Important Files Changed
onPartialRawTextthrough the decode loop on the detached thread. The closure parameter is missing@Sendablehere (though enforced at theLlamaRuntimeManagerboundary).Sequence Diagram
sequenceDiagram participant RT as LlamaRuntimeCore (decode thread) participant SE as LlamaSuggestionEngine participant CO as SuggestionCoordinator participant OV as Overlay CO->>SE: generateSuggestion(for:onPartial:) loop each sampled token RT-->>SE: onPartialRawText(cumulativeRaw) SE-->>CO: "Task @MainActor onPartial(normalized)" CO->>CO: work-id + generation + monotonicity guards CO->>OV: presentOverlay(partial.text) end SE-->>CO: return SuggestionResult (final) CO->>OV: presentOverlay(final.text)Comments Outside Diff (1)
CotabbyTests/LlamaSuggestionEngineStreamingTests.swift, line 685-689 (link)drainUntilpolls with up to 200 × 2 ms = 400 ms of wall time. On a loaded CI runner this budget can be exhausted before the main-actorTask { @MainActor in }deliveries complete, causing the assertion to fail with[] != [" wor", " world ag"]rather than an explicit timeout message. A continuation-based approach (e.g. wrapping theonPartialcallback to fulfill a checked continuation once N partials are received) would be deterministic and produce a clear failure on timeout.Reviews (3): Last reviewed commit: "review: make queueStreamedPartial privat..." | Re-trigger Greptile