Skip to content

perf(stream): render ghost text while the model is still decoding#687

Merged
FuJacob merged 2 commits into
mainfrom
perf/streaming-ghost-text
Jun 12, 2026
Merged

perf(stream): render ghost text while the model is still decoding#687
FuJacob merged 2 commits into
mainfrom
perf/streaming-ghost-text

Conversation

@FuJacob

@FuJacob FuJacob commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

This is the largest perceived-latency lever from the performance research pass. Generation was single-shot end to end: the llama decode loop accumulated pieces privately, the FM engine consumed Apple's cumulative snapshots internally (with a comment explicitly deferring partial rendering), and nothing reached the overlay until the entire completion finished. Perceived time-to-ghost-text was therefore prefill plus the full decode (~250ms-1.3s depending on model); with streaming it becomes prefill plus the first words.

How it works, layer by layer:

  • LlamaRuntimeCore's decode loop reports the cumulative raw completion after each sampled token; LlamaRuntimeManager exposes it as a streaming generate variant on LlamaRuntimeGenerating (default implementation keeps every fake compiling).
  • LlamaSuggestionEngine normalizes each cumulative partial through the same SuggestionTextNormalizer path as final results and forwards non-empty ones to the main actor. FoundationModelSuggestionEngine does the same inline from Apple's snapshot stream. SuggestionEngineRouter threads the hook through both engines and the locale-fallback path.
  • The coordinator renders partials as REAL sessions, not cosmetic overlays: acceptance gates on the live session (never on state), so the user can Tab into a stream the moment the first words appear, and accepting bumps the work id which freezes the suggestion at what was streamed. Renders are coalesced to at most one per runloop turn (token-rate deliveries cannot stack layout work), monotonic via the new pure StreamedGhostTextPolicy (reordered main-actor hops and normalizer rewrites can never shrink visible text), and guarded by the same work-id and context-materialize staleness checks the final apply uses. The final result remains authoritative and flows through the unchanged apply path.

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing \
  -derivedDataPath build/DerivedData CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO
# ** TEST BUILD SUCCEEDED **

xcodebuild ... test-without-building \
  -only-testing:CotabbyTests/StreamedGhostTextPolicyTests \
  -only-testing:CotabbyTests/LlamaSuggestionEngineStreamingTests \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests \
  -only-testing:CotabbyTests/SuggestionCoordinatorAcceptanceTests \
  -only-testing:CotabbyTests/SuggestionSessionReconcilerTests
# 121 tests, 0 failures (cancel-must-not-wipe-KV and the 101 reconciler invariants stay green)

swiftlint lint --quiet
# exit 0

New tests: StreamedGhostTextPolicyTests (monotonicity incl. stale-shorter, equal-redundant, and divergent-rewrite cases) and LlamaSuggestionEngineStreamingTests (cumulative partials normalized and forwarded with the request generation; the single-shot entry point provably never touches the streaming runtime path).

Linked issues

Refs #661

Risk / rollout notes

  • A partial can render and then the authoritative final can shrink it once at completion (the normalizer's final pass wins). The monotonic policy prevents shrinking DURING the stream; the single end-of-stream correction is the show-then-correct tradeoff and in practice the final usually extends the last partial.
  • Typing the exact next character mid-stream advances the session in place and cancels the in-flight work (existing reconciler behavior), so the ghost stops extending at that point rather than waiting for the rest of the decode. Versus the old behavior (nothing visible until complete), the user strictly sees text sooner.
  • Mid-stream non-keyboard edits (mouse paste) stop partial rendering via the materialize-generation check at the next drain, and the poll-driven reconciler tears the session down exactly as it does for completed suggestions today.
  • This PR will conflict textually with perf(runtime): llama prewarm-on-focus, mid-prefill abort, and KV-reuse visibility #681 (both extend LlamaRuntimeGenerating and the runtime files); whichever lands second rebases mechanically.
  • Quality guardrail note: normalization runs on every partial through the same code path as finals, and nothing about sampling, stop logic, or prompts changed. A llama-side golden eval harness (seeded decode) remains the right follow-up before tuning any of those.

🤖 Generated with Claude Code

Greptile Summary

This PR wires streaming ghost text end-to-end: the llama decode loop and Apple's Foundation Model stream now forward cumulative normalized partials to the coordinator while decoding, so the first words of a suggestion appear after prefill + first tokens rather than after the full decode.

  • Decode loop → coordinator pipeline: LlamaRuntimeCore calls onPartialRawText after each sampled token; LlamaSuggestionEngine hops each partial to the main actor, normalizes it, and delivers it via the new onPartial callback. FoundationModelSuggestionEngine does the same inline from Apple's snapshot stream. Both engines degrade gracefully when onPartial is nil.
  • Coordinator rendering: partials are coalesced to one render per runloop turn via a DispatchQueue.main.async drain and guarded by work-id, context-generation, and StreamedGhostTextPolicy monotonicity checks before being committed as real, acceptable sessions.
  • Backward compatibility: default protocol implementations on both SuggestionGenerating and LlamaRuntimeGenerating route non-streaming callers to the single-shot path; the existing apply path for the authoritative final result is untouched.

Confidence Score: 5/5

Safe to merge. Every render path is guarded by work-id checks, context-generation matching, and the monotonic extension policy, so stale or reordered partials are silently dropped and the authoritative final result still owns the last word.

The streaming pipeline adds multiple independent staleness guards that together make incorrect renders impossible. The final apply path is unchanged. The two noted issues — a stale file comment and a missing @sendable annotation on an internal implementation method — have no runtime impact.

LlamaRuntimeCore.swift for the missing @sendable annotation on the onPartialRawText closure parameter.

Important Files Changed

Filename Overview
Cotabby/App/Coordinators/SuggestionCoordinator+Prediction.swift Adds streaming partial rendering: coalesces partials via DispatchQueue drain, applies monotonic extension policy, and guards every render with work-id and context-generation checks.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift Threads onPartialRawText through the decode loop on the detached thread. The closure parameter is missing @Sendable here (though enforced at the LlamaRuntimeManager boundary).
Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift Delegates single-shot to streaming variant (onPartial: nil); forwards inline partials already on main actor. Contains one stale file-comment sentence superseded by this PR.
Cotabby/Support/StreamedGhostTextPolicy.swift New pure enum implementing monotonic-extension predicate with correct strict-length-increase + prefix-check logic and full test coverage.
Cotabby/Models/SuggestionSubsystemContracts.swift Adds streaming signatures to both protocols with default single-shot fallbacks, preserving backward compatibility for all fakes.

Sequence Diagram

sequenceDiagram
    participant RT as LlamaRuntimeCore (decode thread)
    participant SE as LlamaSuggestionEngine
    participant CO as SuggestionCoordinator
    participant OV as Overlay
    CO->>SE: generateSuggestion(for:onPartial:)
    loop each sampled token
        RT-->>SE: onPartialRawText(cumulativeRaw)
        SE-->>CO: "Task @MainActor onPartial(normalized)"
        CO->>CO: work-id + generation + monotonicity guards
        CO->>OV: presentOverlay(partial.text)
    end
    SE-->>CO: return SuggestionResult (final)
    CO->>OV: presentOverlay(final.text)
Loading

Comments Outside Diff (1)

  1. CotabbyTests/LlamaSuggestionEngineStreamingTests.swift, line 685-689 (link)

    P2 drainUntil polls with up to 200 × 2 ms = 400 ms of wall time. On a loaded CI runner this budget can be exhausted before the main-actor Task { @MainActor in } deliveries complete, causing the assertion to fail with [] != [" wor", " world ag"] rather than an explicit timeout message. A continuation-based approach (e.g. wrapping the onPartial callback to fulfill a checked continuation once N partials are received) would be deterministic and produce a clear failure on timeout.

    Fix in Codex Fix in Claude Code

Fix All in Codex Fix All in Claude Code

Reviews (3): Last reviewed commit: "review: make queueStreamedPartial privat..." | Re-trigger Greptile

Comment thread Cotabby/App/Coordinators/SuggestionCoordinator+Prediction.swift Outdated
Comment thread Cotabby/App/Coordinators/SuggestionCoordinator+Prediction.swift
Generation was single-shot end to end: the llama decode loop accumulated
pieces privately and the FM engine consumed Apple's cumulative snapshots
internally, so nothing reached the overlay until the full completion
finished and perceived latency was prefill plus the entire decode. Both
engines now forward cumulative, normalized partial results through a new
streaming variant of the generation contract, and the coordinator paints
them as real acceptable sessions: coalesced to one render per runloop
turn, monotonic by policy (reordered hops and normalizer rewrites can
never shrink visible text), guarded by the same work-id and materialize
checks the final apply uses, and frozen at whatever was streamed if the
user Tabs mid-decode. The final result remains authoritative and flows
through the unchanged apply path.
@FuJacob FuJacob force-pushed the perf/streaming-ghost-text branch from fd40fbb to 78f8701 Compare June 12, 2026 02:40
@FuJacob FuJacob force-pushed the perf/streaming-ghost-text branch from 78f8701 to 99a9e26 Compare June 12, 2026 02:40
@FuJacob FuJacob merged commit 3d251cf into main Jun 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant