perf(runtime): llama prewarm-on-focus, mid-prefill abort, and KV-reuse visibility#681
Conversation
a21317b to
fa49f43
Compare
…e visibility Three llama-path gaps. The engine's prewarm hook was the protocol no-op while a focus change destroys the native sequence, so the first suggestion in every field paid the full cold prompt decode; prewarm now prefills the new field's prompt KV (no sampling) and primes the reuse hint only after the native decode succeeded. Prompt prefill was uninterruptible: Swift cancellation is polled between sampled tokens, and the engine's per-sequence abort flag was never set by the app, so a superseding request waited out the entire stale decode while it held the autocomplete lock; the manager's cancellation handlers now fire engine.cancelSequence through an abort-target handshake, with cancelled prefills surfacing as quiet CancellationError and aborted sequences destroyed (the native flag is set-once). And trimKV rejection, which llama.cpp returns for partial removals on hybrid/SWA caches and which silently degrades every request to a full re-prefill, now logs once per model load at info plus per-event reuse stats at debug.
…al KV trims On hybrid/SWA models (the current catalog) trimKV is rejected unconditionally, so a warmed sequence still carries its seed token, the following generate's reuse trim is rejected too, and the prompt gets fully decoded twice. The core now learns the rejection from the first failed trim after model load (generate's restore-trim, the reuse path, or the prefill tail), drops a warmed sequence it cannot trim instead of recording tracker facts the KV does not match, and turns subsequent prefills into no-ops until the next model load.
968e454 to
00c4968
Compare
| defer { | ||
| // Trim sampled tokens so KV retains only the prompt for the next request. A rejected | ||
| // trim leaves the sampled tokens in KV while the tracker records prompt-only state; | ||
| // that mismatch self-heals (the next reuse trim is rejected too and rebuilds fresh), | ||
| // but it also proves this model can never reuse, so remember that for `prefill`. | ||
| if !engine.trimKV(sequenceID, Int32(preparation.promptTokens.count)) { | ||
| modelRejectsPartialTrims = true | ||
| } | ||
| autocompletePromptBytes = preparation.promptBytes | ||
| autocompletePromptTokens = preparation.promptTokens | ||
| autocompleteSamplingFingerprint = preparation.fingerprint | ||
| } |
There was a problem hiding this comment.
engine.trimKV called on already-destroyed sequence when generation is engine-cancelled
When runEngineSampledDecode returns engineCancelled: true, the code explicitly destroys the sequence and sets autocompleteSequenceID = -1. However, the KV-trim defer block — registered before the destroy logic — runs afterward and still calls engine.trimKV(sequenceID, Int32(preparation.promptTokens.count)) on the already-freed sequence.
If the engine returns false for a trim on a non-existent sequence (the common behavior in llama.cpp when the sequence ID is unknown), modelRejectsPartialTrims is incorrectly set to true. From that point, every subsequent prefill call is skipped with "Prefill skipped: the loaded model rejects partial KV trims" — permanently disabling the prewarm fast path for the current model session even on models that fully support prefix reuse. The flag is only reset on model reload.
The concrete failure path: user types quickly → first generation's prefill is aborted by abortInFlightGeneration → sampleNext returns was_cancelled → sequence destroyed → trim defer calls trimKV on freed ID → returns false → modelRejectsPartialTrims = true → all subsequent prewarms are no-ops for the rest of the session.
The fix is to guard the trimKV call in the defer with a check that the sequence was not destroyed, e.g. by gating on autocompleteSequenceID >= 0 (which is set to -1 in the engine-cancelled branch before the defer runs).
Summary
Three gaps in the llama path, found during the speed/footprint research pass:
Prewarm-on-focus. The router already calls
prewarm(for:)on focus change, and Apple Foundation Models uses it, but the llama engine inherited the protocol's no-op default whose comment claimed "llama already keeps its KV cache hot". In reality the focus-change reset destroys the native sequence, so the first suggestion in every field paid the full cold prompt decode (~30-300ms model-dependent).LlamaSuggestionEngine.prewarmnow prefills the new field's prompt KV through a newprefillruntime entry point (tokenize + decode, no sampling) and primes the byte-prefix reuse hint only after the native decode succeeded. A real generation cancels an in-flight warmup on entry so it can never queue behind one.Mid-prefill abort. Swift cancellation is polled between sampled tokens, but prompt prefill (
decodePrompt) was uninterruptible: the engine's per-sequence atomic abort flag exists with a per-chunk check, and the app never set it. A superseding keystroke therefore waited out the entire stale prefill while it held the autocomplete lock (worst case ~0.15-2s on a cold long prompt on the larger models). The manager's cancellation handlers now callcore.abortInFlightGeneration(), which targets the in-flight sequence through a lock-guarded handshake and firesengine.cancelSequence. A cancelled prefill surfaces as quietCancellationError(never a runtime error, never a fresh rebuild of the stale prompt), and aborted sequences are destroyed because the native flag is set-once per sequence.KV-reuse visibility.
trimKVis a partialllama_memory_seq_rm, which llama.cpp rejects on hybrid/recurrent and SWA caches; the catalog models read asqwen35/gemma4in their GGUF headers, so the prefix-reuse fast path very likely falls back to a full prompt re-prefill on every request today, silently. The rejection now logs once per model load at info (plus per-event detail and reuse-hit stats at debug), so a single log line answers whether a given model reuses its prompt KV. This is the instrumentation step before any repair (state-checkpoint or two-slot copy designs).Validation
New
LlamaSuggestionEnginePrewarmTestspin the prewarm contract: prefill primes the reuse hint, a failed prefill leaves it cold, and a context reset clears it. End-to-end abort behavior (engine flag firing mid-chunk) requires a loaded model and is validated by the existing middleware-level cancellation plumbing it reuses; flagged explicitly as not unit-coverable here.Linked issues
Refs #661
Risk / rollout notes
prefillshares the exact tokenize/truncate/options path withgenerate(onePreparedPrompthelper, one options factory), so a warmup can never poison reuse validation with a differently-built prompt.diagnostics.lastErroruntouched.🤖 Generated with Claude Code
Greptile Summary
This PR adds three runtime improvements to the llama autocomplete path: prewarm-on-focus (prefills the KV cache when a field receives focus), mid-prefill abort (a new
abortInFlightGenerationengine-level mechanism that interrupts uninterruptible prompt decodes so superseded requests don't block behind stale work), and KV-reuse visibility (logs trim rejections once per model load to surface silent fallbacks on hybrid/SWA cache models).LlamaSuggestionEngine.prewarmnow issues aprefillcall through a newLlamaRuntimeCore.prefillentry point; the hint tracker is updated only after the native decode succeeds, and a real generation cancels any in-flight warmup before acquiring the autocomplete lock.LlamaRuntimeManager.generateand.prefillboth register anabortInFlightGenerationcall in theirwithTaskCancellationHandleronCancelclosures; the abort target is guarded by a separateabortTargetLockand cleared on every exit path.modelRejectsPartialTrimsis learned from the first failed trim after model load and gates the prewarm path; one info-level log fires per model load when the engine rejects partial trims.Confidence Score: 4/5
Safe to merge with one fix: the KV-trim defer in
generatecallsengine.trimKVon a sequence that was already destroyed by the engine-abort path, which can incorrectly mark the model as incapable of prefix reuse for the rest of the session.The abort path introduced by this PR creates a new state where a sequence is explicitly destroyed mid-function (
decode.engineCancelledbranch) before the KV-trimdeferruns. The defer still callsengine.trimKVwith the destroyed sequence ID; if the engine returnsfalsefor an unknown ID (the typical llama.cpp behavior),modelRejectsPartialTrimsis set totrue. Every subsequentprefillcall is then skipped silently for the lifetime of the model session — exactly the opposite of what the prewarm feature is trying to achieve. The bug is triggered the first time a user types fast enough to cancel a generation mid-prefill, which is the primary scenario the abort mechanism is designed for. The remaining changes are well-structured and handle their edge cases correctly.Cotabby/Services/Runtime/LlamaRuntimeCore.swift — specifically the KV-trim
deferblock insidegenerateand the interaction with theif decode.engineCancelleddestroy branch immediately after.Important Files Changed
preparedPrompthelper, addprefillentry point, mid-prefill abort viaabortTargetLock/abortTargetSequenceID, andmodelRejectsPartialTrimsflag. A P1 bug exists: the KV-trim defer callsengine.trimKVon an already-destroyed sequence when engine cancellation fires, which can incorrectly setmodelRejectsPartialTrims = trueand permanently disable the prewarm path for the session.prefillforwarding method with the same cancellation-handler pattern asgenerate, correctly callingabortInFlightGenerationin both cancel handlers. Clean implementation.prewarmimplementation and sharedmakeGenerationOptionshelper. Task management forinflightPrewarmTaskis correct: cancelled on new prewarm, on generation entry, and on context reset; tracker updated only after successful prefill.prefilltoLlamaRuntimeGeneratingwith a default no-op extension so test doubles continue compiling; updatesprewarmdoc-comment. Straightforward protocol extension.RecordingPrewarmRuntimefake is minimal and correct.LlamaSuggestionEnginePrewarmTests.swiftto the test target. Mechanical project file change.Sequence Diagram
sequenceDiagram participant Coord as Coordinator participant Engine as LlamaSuggestionEngine participant Manager as LlamaRuntimeManager participant Core as LlamaRuntimeCore Note over Coord,Core: Focus change Coord->>Engine: prewarm(for: request) Engine->>Engine: inflightPrewarmTask?.cancel() Engine->>Manager: prefill(prompt:cachedPrefixBytes:options:) Manager->>Core: core.prefill(...) [detached] Core->>Core: acquires autocompleteLock Core->>Core: guard !modelRejectsPartialTrims Core->>Core: obtainAutocompleteSequence → buildFreshSequence Core->>Core: setAbortTarget(seqID) Core->>Core: engine.decodePrompt (full prompt + seed token) Core->>Core: engine.trimKV (remove seed token) Core-->>Manager: () Manager-->>Engine: () Engine->>Engine: promptCacheHintTracker.recordSuccessfulRequest Note over Coord,Core: User types (keystroke) Coord->>Engine: generateSuggestion(for: request) Engine->>Engine: inflightPrewarmTask?.cancel() → abortInFlightGeneration() Engine->>Manager: generate(prompt:cachedPrefixBytes:options:) Manager->>Core: core.generate(...) [detached] Core->>Core: acquires autocompleteLock Core->>Core: obtainAutocompleteSequence (reuse path) Core->>Core: setAbortTarget(autocompleteSequenceID) Core->>Core: engine.decodePrompt (delta tokens only) Core->>Core: runEngineSampledDecode Core-->>Manager: result text Manager-->>Engine: rawSuggestion Engine-->>Coord: SuggestionResult Note over Coord,Core: Engine abort fires mid-prefill Core->>Core: sampleNext returns was_cancelled Core->>Core: engine.destroySequence(sequenceID) Core->>Core: defer trimKV on destroyed seq ⚠️Reviews (3): Last reviewed commit: "review: stop prewarm from double-decodin..." | Re-trigger Greptile