perf(stream): render ghost text while the model is still decoding by FuJacob · Pull Request #687 · FuJacob/cotabby

FuJacob · 2026-06-12T02:05:21Z

Summary

This is the largest perceived-latency lever from the performance research pass. Generation was single-shot end to end: the llama decode loop accumulated pieces privately, the FM engine consumed Apple's cumulative snapshots internally (with a comment explicitly deferring partial rendering), and nothing reached the overlay until the entire completion finished. Perceived time-to-ghost-text was therefore prefill plus the full decode (~250ms-1.3s depending on model); with streaming it becomes prefill plus the first words.

How it works, layer by layer:

LlamaRuntimeCore's decode loop reports the cumulative raw completion after each sampled token; LlamaRuntimeManager exposes it as a streaming generate variant on LlamaRuntimeGenerating (default implementation keeps every fake compiling).
LlamaSuggestionEngine normalizes each cumulative partial through the same SuggestionTextNormalizer path as final results and forwards non-empty ones to the main actor. FoundationModelSuggestionEngine does the same inline from Apple's snapshot stream. SuggestionEngineRouter threads the hook through both engines and the locale-fallback path.
The coordinator renders partials as REAL sessions, not cosmetic overlays: acceptance gates on the live session (never on state), so the user can Tab into a stream the moment the first words appear, and accepting bumps the work id which freezes the suggestion at what was streamed. Renders are coalesced to at most one per runloop turn (token-rate deliveries cannot stack layout work), monotonic via the new pure StreamedGhostTextPolicy (reordered main-actor hops and normalizer rewrites can never shrink visible text), and guarded by the same work-id and context-materialize staleness checks the final apply uses. The final result remains authoritative and flows through the unchanged apply path.

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing \
  -derivedDataPath build/DerivedData CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO
# ** TEST BUILD SUCCEEDED **

xcodebuild ... test-without-building \
  -only-testing:CotabbyTests/StreamedGhostTextPolicyTests \
  -only-testing:CotabbyTests/LlamaSuggestionEngineStreamingTests \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests \
  -only-testing:CotabbyTests/SuggestionCoordinatorAcceptanceTests \
  -only-testing:CotabbyTests/SuggestionSessionReconcilerTests
# 121 tests, 0 failures (cancel-must-not-wipe-KV and the 101 reconciler invariants stay green)

swiftlint lint --quiet
# exit 0

New tests: StreamedGhostTextPolicyTests (monotonicity incl. stale-shorter, equal-redundant, and divergent-rewrite cases) and LlamaSuggestionEngineStreamingTests (cumulative partials normalized and forwarded with the request generation; the single-shot entry point provably never touches the streaming runtime path).

Linked issues

Refs #661

Risk / rollout notes

A partial can render and then the authoritative final can shrink it once at completion (the normalizer's final pass wins). The monotonic policy prevents shrinking DURING the stream; the single end-of-stream correction is the show-then-correct tradeoff and in practice the final usually extends the last partial.
Typing the exact next character mid-stream advances the session in place and cancels the in-flight work (existing reconciler behavior), so the ghost stops extending at that point rather than waiting for the rest of the decode. Versus the old behavior (nothing visible until complete), the user strictly sees text sooner.
Mid-stream non-keyboard edits (mouse paste) stop partial rendering via the materialize-generation check at the next drain, and the poll-driven reconciler tears the session down exactly as it does for completed suggestions today.
This PR will conflict textually with perf(runtime): llama prewarm-on-focus, mid-prefill abort, and KV-reuse visibility #681 (both extend LlamaRuntimeGenerating and the runtime files); whichever lands second rebases mechanically.
Quality guardrail note: normalization runs on every partial through the same code path as finals, and nothing about sampling, stop logic, or prompts changed. A llama-side golden eval harness (seeded decode) remains the right follow-up before tuning any of those.

🤖 Generated with Claude Code

Greptile Summary

This PR wires streaming ghost text end-to-end: the llama decode loop and Apple's Foundation Model stream now forward cumulative normalized partials to the coordinator while decoding, so the first words of a suggestion appear after prefill + first tokens rather than after the full decode.

Decode loop → coordinator pipeline: LlamaRuntimeCore calls onPartialRawText after each sampled token; LlamaSuggestionEngine hops each partial to the main actor, normalizes it, and delivers it via the new onPartial callback. FoundationModelSuggestionEngine does the same inline from Apple's snapshot stream. Both engines degrade gracefully when onPartial is nil.
Coordinator rendering: partials are coalesced to one render per runloop turn via a DispatchQueue.main.async drain and guarded by work-id, context-generation, and StreamedGhostTextPolicy monotonicity checks before being committed as real, acceptable sessions.
Backward compatibility: default protocol implementations on both SuggestionGenerating and LlamaRuntimeGenerating route non-streaming callers to the single-shot path; the existing apply path for the authoritative final result is untouched.

Confidence Score: 5/5

Safe to merge. Every render path is guarded by work-id checks, context-generation matching, and the monotonic extension policy, so stale or reordered partials are silently dropped and the authoritative final result still owns the last word.

The streaming pipeline adds multiple independent staleness guards that together make incorrect renders impossible. The final apply path is unchanged. The two noted issues — a stale file comment and a missing @sendable annotation on an internal implementation method — have no runtime impact.

LlamaRuntimeCore.swift for the missing @sendable annotation on the onPartialRawText closure parameter.

Important Files Changed

Filename	Overview
Cotabby/App/Coordinators/SuggestionCoordinator+Prediction.swift	Adds streaming partial rendering: coalesces partials via DispatchQueue drain, applies monotonic extension policy, and guards every render with work-id and context-generation checks.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Threads `onPartialRawText` through the decode loop on the detached thread. The closure parameter is missing `@Sendable` here (though enforced at the `LlamaRuntimeManager` boundary).
Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift	Delegates single-shot to streaming variant (onPartial: nil); forwards inline partials already on main actor. Contains one stale file-comment sentence superseded by this PR.
Cotabby/Support/StreamedGhostTextPolicy.swift	New pure enum implementing monotonic-extension predicate with correct strict-length-increase + prefix-check logic and full test coverage.
Cotabby/Models/SuggestionSubsystemContracts.swift	Adds streaming signatures to both protocols with default single-shot fallbacks, preserving backward compatibility for all fakes.

Sequence Diagram

sequenceDiagram
    participant RT as LlamaRuntimeCore (decode thread)
    participant SE as LlamaSuggestionEngine
    participant CO as SuggestionCoordinator
    participant OV as Overlay
    CO->>SE: generateSuggestion(for:onPartial:)
    loop each sampled token
        RT-->>SE: onPartialRawText(cumulativeRaw)
        SE-->>CO: "Task @MainActor onPartial(normalized)"
        CO->>CO: work-id + generation + monotonicity guards
        CO->>OV: presentOverlay(partial.text)
    end
    SE-->>CO: return SuggestionResult (final)
    CO->>OV: presentOverlay(final.text)

Comments Outside Diff (1)

CotabbyTests/LlamaSuggestionEngineStreamingTests.swift, line 685-689 (link)

drainUntil polls with up to 200 × 2 ms = 400 ms of wall time. On a loaded CI runner this budget can be exhausted before the main-actor Task { @MainActor in } deliveries complete, causing the assertion to fail with [] != [" wor", " world ag"] rather than an explicit timeout message. A continuation-based approach (e.g. wrapping the onPartial callback to fulfill a checked continuation once N partials are received) would be deterministic and produce a clear failure on timeout.

_{Reviews (3): Last reviewed commit: "review: make queueStreamedPartial privat..." | Re-trigger Greptile}

Generation was single-shot end to end: the llama decode loop accumulated pieces privately and the FM engine consumed Apple's cumulative snapshots internally, so nothing reached the overlay until the full completion finished and perceived latency was prefill plus the entire decode. Both engines now forward cumulative, normalized partial results through a new streaming variant of the generation contract, and the coordinator paints them as real acceptable sessions: coalesced to one render per runloop turn, monotonic by policy (reordered hops and normalizer rewrites can never shrink visible text), guarded by the same work-id and materialize checks the final apply uses, and frozen at whatever was streamed if the user Tabs mid-decode. The final result remains authoritative and flows through the unchanged apply path.

…g survives a dispatch reset

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread Cotabby/App/Coordinators/SuggestionCoordinator+Prediction.swift Outdated

Comment thread Cotabby/App/Coordinators/SuggestionCoordinator+Prediction.swift

FuJacob force-pushed the perf/streaming-ghost-text branch from fd40fbb to 78f8701 Compare June 12, 2026 02:40

review: make queueStreamedPartial private; document why the drain fla…

99a9e26

…g survives a dispatch reset

FuJacob force-pushed the perf/streaming-ghost-text branch from 78f8701 to 99a9e26 Compare June 12, 2026 02:40

FuJacob merged commit 3d251cf into main Jun 12, 2026
4 checks passed

FuJacob mentioned this pull request Jun 12, 2026

feat(settings): make token-by-token streaming reveal opt-in (default off) #692

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(stream): render ghost text while the model is still decoding#687

perf(stream): render ghost text while the model is still decoding#687
FuJacob merged 2 commits into
mainfrom
perf/streaming-ghost-text

FuJacob commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 12, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading