perf(runtime): llama prewarm-on-focus, mid-prefill abort, and KV-reuse visibility by FuJacob · Pull Request #681 · FuJacob/cotabby

FuJacob · 2026-06-12T01:15:01Z

Summary

Three gaps in the llama path, found during the speed/footprint research pass:

Prewarm-on-focus. The router already calls prewarm(for:) on focus change, and Apple Foundation Models uses it, but the llama engine inherited the protocol's no-op default whose comment claimed "llama already keeps its KV cache hot". In reality the focus-change reset destroys the native sequence, so the first suggestion in every field paid the full cold prompt decode (~30-300ms model-dependent). LlamaSuggestionEngine.prewarm now prefills the new field's prompt KV through a new prefill runtime entry point (tokenize + decode, no sampling) and primes the byte-prefix reuse hint only after the native decode succeeded. A real generation cancels an in-flight warmup on entry so it can never queue behind one.
Mid-prefill abort. Swift cancellation is polled between sampled tokens, but prompt prefill (decodePrompt) was uninterruptible: the engine's per-sequence atomic abort flag exists with a per-chunk check, and the app never set it. A superseding keystroke therefore waited out the entire stale prefill while it held the autocomplete lock (worst case ~0.15-2s on a cold long prompt on the larger models). The manager's cancellation handlers now call core.abortInFlightGeneration(), which targets the in-flight sequence through a lock-guarded handshake and fires engine.cancelSequence. A cancelled prefill surfaces as quiet CancellationError (never a runtime error, never a fresh rebuild of the stale prompt), and aborted sequences are destroyed because the native flag is set-once per sequence.
KV-reuse visibility. trimKV is a partial llama_memory_seq_rm, which llama.cpp rejects on hybrid/recurrent and SWA caches; the catalog models read as qwen35/gemma4 in their GGUF headers, so the prefix-reuse fast path very likely falls back to a full prompt re-prefill on every request today, silently. The rejection now logs once per model load at info (plus per-event detail and reuse-hit stats at debug), so a single log line answers whether a given model reuses its prompt KV. This is the instrumentation step before any repair (state-checkpoint or two-slot copy designs).

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing \
  -derivedDataPath build/DerivedData CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO
# ** TEST BUILD SUCCEEDED **

xcodebuild ... test-without-building \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests \
  -only-testing:CotabbyTests/LlamaSuggestionEnginePrewarmTests \
  -only-testing:CotabbyTests/LlamaPromptCacheHintTrackerTests
# 12 tests, 0 failures (the cancel-must-not-wipe-KV invariant suite stays green)

swiftlint lint --quiet
# exit 0

New LlamaSuggestionEnginePrewarmTests pin the prewarm contract: prefill primes the reuse hint, a failed prefill leaves it cold, and a context reset clears it. End-to-end abort behavior (engine flag firing mid-chunk) requires a loaded model and is validated by the existing middleware-level cancellation plumbing it reuses; flagged explicitly as not unit-coverable here.

Linked issues

Refs #661

Risk / rollout notes

prefill shares the exact tokenize/truncate/options path with generate (one PreparedPrompt helper, one options factory), so a warmup can never poison reuse validation with a differently-built prompt.
The abort target is registered before each native decode and cleared on every exit, including throw paths, so a late cancel cannot flag a recycled sequence slot; the engine-side lookup is mutex-guarded and null-safe.
Focus changes now cost one prompt prefill on the llama engine (tens of ms GPU for typical prompts, bounded by the same context-window truncation as generation). This mirrors what the FM engine already does on focus, and a keystroke arriving mid-warmup aborts it.
Cancelled prefills/generations keep diagnostics.lastError untouched.

🤖 Generated with Claude Code

Greptile Summary

This PR adds three runtime improvements to the llama autocomplete path: prewarm-on-focus (prefills the KV cache when a field receives focus), mid-prefill abort (a new abortInFlightGeneration engine-level mechanism that interrupts uninterruptible prompt decodes so superseded requests don't block behind stale work), and KV-reuse visibility (logs trim rejections once per model load to surface silent fallbacks on hybrid/SWA cache models).

Prewarm-on-focus: LlamaSuggestionEngine.prewarm now issues a prefill call through a new LlamaRuntimeCore.prefill entry point; the hint tracker is updated only after the native decode succeeds, and a real generation cancels any in-flight warmup before acquiring the autocomplete lock.
Mid-prefill abort: LlamaRuntimeManager.generate and .prefill both register an abortInFlightGeneration call in their withTaskCancellationHandler onCancel closures; the abort target is guarded by a separate abortTargetLock and cleared on every exit path.
KV-reuse visibility: modelRejectsPartialTrims is learned from the first failed trim after model load and gates the prewarm path; one info-level log fires per model load when the engine rejects partial trims.

Confidence Score: 4/5

Safe to merge with one fix: the KV-trim defer in generate calls engine.trimKV on a sequence that was already destroyed by the engine-abort path, which can incorrectly mark the model as incapable of prefix reuse for the rest of the session.

The abort path introduced by this PR creates a new state where a sequence is explicitly destroyed mid-function (decode.engineCancelled branch) before the KV-trim defer runs. The defer still calls engine.trimKV with the destroyed sequence ID; if the engine returns false for an unknown ID (the typical llama.cpp behavior), modelRejectsPartialTrims is set to true. Every subsequent prefill call is then skipped silently for the lifetime of the model session — exactly the opposite of what the prewarm feature is trying to achieve. The bug is triggered the first time a user types fast enough to cancel a generation mid-prefill, which is the primary scenario the abort mechanism is designed for. The remaining changes are well-structured and handle their edge cases correctly.

Cotabby/Services/Runtime/LlamaRuntimeCore.swift — specifically the KV-trim defer block inside generate and the interaction with the if decode.engineCancelled destroy branch immediately after.

Important Files Changed

Filename	Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Core inference engine refactored to extract `preparedPrompt` helper, add `prefill` entry point, mid-prefill abort via `abortTargetLock`/`abortTargetSequenceID`, and `modelRejectsPartialTrims` flag. A P1 bug exists: the KV-trim defer calls `engine.trimKV` on an already-destroyed sequence when engine cancellation fires, which can incorrectly set `modelRejectsPartialTrims = true` and permanently disable the prewarm path for the session.
Cotabby/Services/Runtime/LlamaRuntimeManager.swift	Adds `prefill` forwarding method with the same cancellation-handler pattern as `generate`, correctly calling `abortInFlightGeneration` in both cancel handlers. Clean implementation.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift	Adds `prewarm` implementation and shared `makeGenerationOptions` helper. Task management for `inflightPrewarmTask` is correct: cancelled on new prewarm, on generation entry, and on context reset; tracker updated only after successful prefill.
Cotabby/Models/SuggestionSubsystemContracts.swift	Adds `prefill` to `LlamaRuntimeGenerating` with a default no-op extension so test doubles continue compiling; updates `prewarm` doc-comment. Straightforward protocol extension.
CotabbyTests/LlamaSuggestionEnginePrewarmTests.swift	New test file covering the three prewarm contract cases: successful prefill primes the hint, failed prefill leaves it cold, and context reset clears it. `RecordingPrewarmRuntime` fake is minimal and correct.
Cotabby.xcodeproj/project.pbxproj	Adds `LlamaSuggestionEnginePrewarmTests.swift` to the test target. Mechanical project file change.

Sequence Diagram

sequenceDiagram
    participant Coord as Coordinator
    participant Engine as LlamaSuggestionEngine
    participant Manager as LlamaRuntimeManager
    participant Core as LlamaRuntimeCore

    Note over Coord,Core: Focus change
    Coord->>Engine: prewarm(for: request)
    Engine->>Engine: inflightPrewarmTask?.cancel()
    Engine->>Manager: prefill(prompt:cachedPrefixBytes:options:)
    Manager->>Core: core.prefill(...) [detached]
    Core->>Core: acquires autocompleteLock
    Core->>Core: guard !modelRejectsPartialTrims
    Core->>Core: obtainAutocompleteSequence → buildFreshSequence
    Core->>Core: setAbortTarget(seqID)
    Core->>Core: engine.decodePrompt (full prompt + seed token)
    Core->>Core: engine.trimKV (remove seed token)
    Core-->>Manager: ()
    Manager-->>Engine: ()
    Engine->>Engine: promptCacheHintTracker.recordSuccessfulRequest

    Note over Coord,Core: User types (keystroke)
    Coord->>Engine: generateSuggestion(for: request)
    Engine->>Engine: inflightPrewarmTask?.cancel() → abortInFlightGeneration()
    Engine->>Manager: generate(prompt:cachedPrefixBytes:options:)
    Manager->>Core: core.generate(...) [detached]
    Core->>Core: acquires autocompleteLock
    Core->>Core: obtainAutocompleteSequence (reuse path)
    Core->>Core: setAbortTarget(autocompleteSequenceID)
    Core->>Core: engine.decodePrompt (delta tokens only)
    Core->>Core: runEngineSampledDecode
    Core-->>Manager: result text
    Manager-->>Engine: rawSuggestion
    Engine-->>Coord: SuggestionResult

    Note over Coord,Core: Engine abort fires mid-prefill
    Core->>Core: sampleNext returns was_cancelled
    Core->>Core: engine.destroySequence(sequenceID)
    Core->>Core: defer trimKV on destroyed seq ⚠️

_{Reviews (3): Last reviewed commit: "review: stop prewarm from double-decodin..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

…e visibility Three llama-path gaps. The engine's prewarm hook was the protocol no-op while a focus change destroys the native sequence, so the first suggestion in every field paid the full cold prompt decode; prewarm now prefills the new field's prompt KV (no sampling) and primes the reuse hint only after the native decode succeeded. Prompt prefill was uninterruptible: Swift cancellation is polled between sampled tokens, and the engine's per-sequence abort flag was never set by the app, so a superseding request waited out the entire stale decode while it held the autocomplete lock; the manager's cancellation handlers now fire engine.cancelSequence through an abort-target handshake, with cancelled prefills surfacing as quiet CancellationError and aborted sequences destroyed (the native flag is set-once). And trimKV rejection, which llama.cpp returns for partial removals on hybrid/SWA caches and which silently degrades every request to a full re-prefill, now logs once per model load at info plus per-event reuse stats at debug.

…al KV trims On hybrid/SWA models (the current catalog) trimKV is rejected unconditionally, so a warmed sequence still carries its seed token, the following generate's reuse trim is rejected too, and the prompt gets fully decoded twice. The core now learns the rejection from the first failed trim after model load (generate's restore-trim, the reuse path, or the prefill tail), drops a warmed sequence it cannot trim instead of recording tracker facts the KV does not match, and turns subsequent prefills into no-ops until the next model load.

greptile-apps · 2026-06-12T02:37:54Z

+        defer {
+            // Trim sampled tokens so KV retains only the prompt for the next request. A rejected
+            // trim leaves the sampled tokens in KV while the tracker records prompt-only state;
+            // that mismatch self-heals (the next reuse trim is rejected too and rebuilds fresh),
+            // but it also proves this model can never reuse, so remember that for `prefill`.
+            if !engine.trimKV(sequenceID, Int32(preparation.promptTokens.count)) {
+                modelRejectsPartialTrims = true
+            }
+            autocompletePromptBytes = preparation.promptBytes
+            autocompletePromptTokens = preparation.promptTokens
+            autocompleteSamplingFingerprint = preparation.fingerprint
+        }


engine.trimKV called on already-destroyed sequence when generation is engine-cancelled

When runEngineSampledDecode returns engineCancelled: true, the code explicitly destroys the sequence and sets autocompleteSequenceID = -1. However, the KV-trim defer block — registered before the destroy logic — runs afterward and still calls engine.trimKV(sequenceID, Int32(preparation.promptTokens.count)) on the already-freed sequence.

If the engine returns false for a trim on a non-existent sequence (the common behavior in llama.cpp when the sequence ID is unknown), modelRejectsPartialTrims is incorrectly set to true. From that point, every subsequent prefill call is skipped with "Prefill skipped: the loaded model rejects partial KV trims" — permanently disabling the prewarm fast path for the current model session even on models that fully support prefix reuse. The flag is only reset on model reload.

The concrete failure path: user types quickly → first generation's prefill is aborted by abortInFlightGeneration → sampleNext returns was_cancelled → sequence destroyed → trim defer calls trimKV on freed ID → returns false → modelRejectsPartialTrims = true → all subsequent prewarms are no-ops for the rest of the session.

The fix is to guard the trimKV call in the defer with a check that the sequence was not destroyed, e.g. by gating on autocompleteSequenceID >= 0 (which is set to -1 in the engine-cancelled branch before the defer runs).

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread Cotabby/Services/Runtime/LlamaSuggestionEngine.swift

Comment thread Cotabby/Services/Runtime/LlamaRuntimeCore.swift Outdated

FuJacob force-pushed the perf/llama-prewarm-abort branch from a21317b to fa49f43 Compare June 12, 2026 01:58

FuJacob mentioned this pull request Jun 12, 2026

perf(stream): render ghost text while the model is still decoding #687

Merged

FuJacob added 2 commits June 11, 2026 19:30

FuJacob force-pushed the perf/llama-prewarm-abort branch from 968e454 to 00c4968 Compare June 12, 2026 02:31

FuJacob merged commit 7122deb into main Jun 12, 2026
4 checks passed

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(runtime): llama prewarm-on-focus, mid-prefill abort, and KV-reuse visibility#681

perf(runtime): llama prewarm-on-focus, mid-prefill abort, and KV-reuse visibility#681
FuJacob merged 2 commits into
mainfrom
perf/llama-prewarm-abort

FuJacob commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 12, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading