Skip to content

perf(runtime): llama prewarm-on-focus, mid-prefill abort, and KV-reuse visibility#681

Merged
FuJacob merged 2 commits into
mainfrom
perf/llama-prewarm-abort
Jun 12, 2026
Merged

perf(runtime): llama prewarm-on-focus, mid-prefill abort, and KV-reuse visibility#681
FuJacob merged 2 commits into
mainfrom
perf/llama-prewarm-abort

Conversation

@FuJacob

@FuJacob FuJacob commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

Three gaps in the llama path, found during the speed/footprint research pass:

  1. Prewarm-on-focus. The router already calls prewarm(for:) on focus change, and Apple Foundation Models uses it, but the llama engine inherited the protocol's no-op default whose comment claimed "llama already keeps its KV cache hot". In reality the focus-change reset destroys the native sequence, so the first suggestion in every field paid the full cold prompt decode (~30-300ms model-dependent). LlamaSuggestionEngine.prewarm now prefills the new field's prompt KV through a new prefill runtime entry point (tokenize + decode, no sampling) and primes the byte-prefix reuse hint only after the native decode succeeded. A real generation cancels an in-flight warmup on entry so it can never queue behind one.

  2. Mid-prefill abort. Swift cancellation is polled between sampled tokens, but prompt prefill (decodePrompt) was uninterruptible: the engine's per-sequence atomic abort flag exists with a per-chunk check, and the app never set it. A superseding keystroke therefore waited out the entire stale prefill while it held the autocomplete lock (worst case ~0.15-2s on a cold long prompt on the larger models). The manager's cancellation handlers now call core.abortInFlightGeneration(), which targets the in-flight sequence through a lock-guarded handshake and fires engine.cancelSequence. A cancelled prefill surfaces as quiet CancellationError (never a runtime error, never a fresh rebuild of the stale prompt), and aborted sequences are destroyed because the native flag is set-once per sequence.

  3. KV-reuse visibility. trimKV is a partial llama_memory_seq_rm, which llama.cpp rejects on hybrid/recurrent and SWA caches; the catalog models read as qwen35/gemma4 in their GGUF headers, so the prefix-reuse fast path very likely falls back to a full prompt re-prefill on every request today, silently. The rejection now logs once per model load at info (plus per-event detail and reuse-hit stats at debug), so a single log line answers whether a given model reuses its prompt KV. This is the instrumentation step before any repair (state-checkpoint or two-slot copy designs).

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing \
  -derivedDataPath build/DerivedData CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO
# ** TEST BUILD SUCCEEDED **

xcodebuild ... test-without-building \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests \
  -only-testing:CotabbyTests/LlamaSuggestionEnginePrewarmTests \
  -only-testing:CotabbyTests/LlamaPromptCacheHintTrackerTests
# 12 tests, 0 failures (the cancel-must-not-wipe-KV invariant suite stays green)

swiftlint lint --quiet
# exit 0

New LlamaSuggestionEnginePrewarmTests pin the prewarm contract: prefill primes the reuse hint, a failed prefill leaves it cold, and a context reset clears it. End-to-end abort behavior (engine flag firing mid-chunk) requires a loaded model and is validated by the existing middleware-level cancellation plumbing it reuses; flagged explicitly as not unit-coverable here.

Linked issues

Refs #661

Risk / rollout notes

  • prefill shares the exact tokenize/truncate/options path with generate (one PreparedPrompt helper, one options factory), so a warmup can never poison reuse validation with a differently-built prompt.
  • The abort target is registered before each native decode and cleared on every exit, including throw paths, so a late cancel cannot flag a recycled sequence slot; the engine-side lookup is mutex-guarded and null-safe.
  • Focus changes now cost one prompt prefill on the llama engine (tens of ms GPU for typical prompts, bounded by the same context-window truncation as generation). This mirrors what the FM engine already does on focus, and a keystroke arriving mid-warmup aborts it.
  • Cancelled prefills/generations keep diagnostics.lastError untouched.

🤖 Generated with Claude Code

Greptile Summary

This PR adds three runtime improvements to the llama autocomplete path: prewarm-on-focus (prefills the KV cache when a field receives focus), mid-prefill abort (a new abortInFlightGeneration engine-level mechanism that interrupts uninterruptible prompt decodes so superseded requests don't block behind stale work), and KV-reuse visibility (logs trim rejections once per model load to surface silent fallbacks on hybrid/SWA cache models).

  • Prewarm-on-focus: LlamaSuggestionEngine.prewarm now issues a prefill call through a new LlamaRuntimeCore.prefill entry point; the hint tracker is updated only after the native decode succeeds, and a real generation cancels any in-flight warmup before acquiring the autocomplete lock.
  • Mid-prefill abort: LlamaRuntimeManager.generate and .prefill both register an abortInFlightGeneration call in their withTaskCancellationHandler onCancel closures; the abort target is guarded by a separate abortTargetLock and cleared on every exit path.
  • KV-reuse visibility: modelRejectsPartialTrims is learned from the first failed trim after model load and gates the prewarm path; one info-level log fires per model load when the engine rejects partial trims.

Confidence Score: 4/5

Safe to merge with one fix: the KV-trim defer in generate calls engine.trimKV on a sequence that was already destroyed by the engine-abort path, which can incorrectly mark the model as incapable of prefix reuse for the rest of the session.

The abort path introduced by this PR creates a new state where a sequence is explicitly destroyed mid-function (decode.engineCancelled branch) before the KV-trim defer runs. The defer still calls engine.trimKV with the destroyed sequence ID; if the engine returns false for an unknown ID (the typical llama.cpp behavior), modelRejectsPartialTrims is set to true. Every subsequent prefill call is then skipped silently for the lifetime of the model session — exactly the opposite of what the prewarm feature is trying to achieve. The bug is triggered the first time a user types fast enough to cancel a generation mid-prefill, which is the primary scenario the abort mechanism is designed for. The remaining changes are well-structured and handle their edge cases correctly.

Cotabby/Services/Runtime/LlamaRuntimeCore.swift — specifically the KV-trim defer block inside generate and the interaction with the if decode.engineCancelled destroy branch immediately after.

Important Files Changed

Filename Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift Core inference engine refactored to extract preparedPrompt helper, add prefill entry point, mid-prefill abort via abortTargetLock/abortTargetSequenceID, and modelRejectsPartialTrims flag. A P1 bug exists: the KV-trim defer calls engine.trimKV on an already-destroyed sequence when engine cancellation fires, which can incorrectly set modelRejectsPartialTrims = true and permanently disable the prewarm path for the session.
Cotabby/Services/Runtime/LlamaRuntimeManager.swift Adds prefill forwarding method with the same cancellation-handler pattern as generate, correctly calling abortInFlightGeneration in both cancel handlers. Clean implementation.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift Adds prewarm implementation and shared makeGenerationOptions helper. Task management for inflightPrewarmTask is correct: cancelled on new prewarm, on generation entry, and on context reset; tracker updated only after successful prefill.
Cotabby/Models/SuggestionSubsystemContracts.swift Adds prefill to LlamaRuntimeGenerating with a default no-op extension so test doubles continue compiling; updates prewarm doc-comment. Straightforward protocol extension.
CotabbyTests/LlamaSuggestionEnginePrewarmTests.swift New test file covering the three prewarm contract cases: successful prefill primes the hint, failed prefill leaves it cold, and context reset clears it. RecordingPrewarmRuntime fake is minimal and correct.
Cotabby.xcodeproj/project.pbxproj Adds LlamaSuggestionEnginePrewarmTests.swift to the test target. Mechanical project file change.

Sequence Diagram

sequenceDiagram
    participant Coord as Coordinator
    participant Engine as LlamaSuggestionEngine
    participant Manager as LlamaRuntimeManager
    participant Core as LlamaRuntimeCore

    Note over Coord,Core: Focus change
    Coord->>Engine: prewarm(for: request)
    Engine->>Engine: inflightPrewarmTask?.cancel()
    Engine->>Manager: prefill(prompt:cachedPrefixBytes:options:)
    Manager->>Core: core.prefill(...) [detached]
    Core->>Core: acquires autocompleteLock
    Core->>Core: guard !modelRejectsPartialTrims
    Core->>Core: obtainAutocompleteSequence → buildFreshSequence
    Core->>Core: setAbortTarget(seqID)
    Core->>Core: engine.decodePrompt (full prompt + seed token)
    Core->>Core: engine.trimKV (remove seed token)
    Core-->>Manager: ()
    Manager-->>Engine: ()
    Engine->>Engine: promptCacheHintTracker.recordSuccessfulRequest

    Note over Coord,Core: User types (keystroke)
    Coord->>Engine: generateSuggestion(for: request)
    Engine->>Engine: inflightPrewarmTask?.cancel() → abortInFlightGeneration()
    Engine->>Manager: generate(prompt:cachedPrefixBytes:options:)
    Manager->>Core: core.generate(...) [detached]
    Core->>Core: acquires autocompleteLock
    Core->>Core: obtainAutocompleteSequence (reuse path)
    Core->>Core: setAbortTarget(autocompleteSequenceID)
    Core->>Core: engine.decodePrompt (delta tokens only)
    Core->>Core: runEngineSampledDecode
    Core-->>Manager: result text
    Manager-->>Engine: rawSuggestion
    Engine-->>Coord: SuggestionResult

    Note over Coord,Core: Engine abort fires mid-prefill
    Core->>Core: sampleNext returns was_cancelled
    Core->>Core: engine.destroySequence(sequenceID)
    Core->>Core: defer trimKV on destroyed seq ⚠️
Loading

Fix All in Codex Fix All in Claude Code

Reviews (3): Last reviewed commit: "review: stop prewarm from double-decodin..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

Comment thread Cotabby/Services/Runtime/LlamaSuggestionEngine.swift
Comment thread Cotabby/Services/Runtime/LlamaRuntimeCore.swift Outdated
FuJacob added 2 commits June 11, 2026 19:30
…e visibility

Three llama-path gaps. The engine's prewarm hook was the protocol no-op
while a focus change destroys the native sequence, so the first
suggestion in every field paid the full cold prompt decode; prewarm now
prefills the new field's prompt KV (no sampling) and primes the reuse
hint only after the native decode succeeded. Prompt prefill was
uninterruptible: Swift cancellation is polled between sampled tokens,
and the engine's per-sequence abort flag was never set by the app, so a
superseding request waited out the entire stale decode while it held
the autocomplete lock; the manager's cancellation handlers now fire
engine.cancelSequence through an abort-target handshake, with cancelled
prefills surfacing as quiet CancellationError and aborted sequences
destroyed (the native flag is set-once). And trimKV rejection, which
llama.cpp returns for partial removals on hybrid/SWA caches and which
silently degrades every request to a full re-prefill, now logs once per
model load at info plus per-event reuse stats at debug.
…al KV trims

On hybrid/SWA models (the current catalog) trimKV is rejected
unconditionally, so a warmed sequence still carries its seed token, the
following generate's reuse trim is rejected too, and the prompt gets
fully decoded twice. The core now learns the rejection from the first
failed trim after model load (generate's restore-trim, the reuse path,
or the prefill tail), drops a warmed sequence it cannot trim instead of
recording tracker facts the KV does not match, and turns subsequent
prefills into no-ops until the next model load.
@FuJacob FuJacob force-pushed the perf/llama-prewarm-abort branch from 968e454 to 00c4968 Compare June 12, 2026 02:31
@FuJacob FuJacob merged commit 7122deb into main Jun 12, 2026
4 checks passed
Comment on lines +173 to +184
defer {
// Trim sampled tokens so KV retains only the prompt for the next request. A rejected
// trim leaves the sampled tokens in KV while the tracker records prompt-only state;
// that mismatch self-heals (the next reuse trim is rejected too and rebuilds fresh),
// but it also proves this model can never reuse, so remember that for `prefill`.
if !engine.trimKV(sequenceID, Int32(preparation.promptTokens.count)) {
modelRejectsPartialTrims = true
}
autocompletePromptBytes = preparation.promptBytes
autocompletePromptTokens = preparation.promptTokens
autocompleteSamplingFingerprint = preparation.fingerprint
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 engine.trimKV called on already-destroyed sequence when generation is engine-cancelled

When runEngineSampledDecode returns engineCancelled: true, the code explicitly destroys the sequence and sets autocompleteSequenceID = -1. However, the KV-trim defer block — registered before the destroy logic — runs afterward and still calls engine.trimKV(sequenceID, Int32(preparation.promptTokens.count)) on the already-freed sequence.

If the engine returns false for a trim on a non-existent sequence (the common behavior in llama.cpp when the sequence ID is unknown), modelRejectsPartialTrims is incorrectly set to true. From that point, every subsequent prefill call is skipped with "Prefill skipped: the loaded model rejects partial KV trims" — permanently disabling the prewarm fast path for the current model session even on models that fully support prefix reuse. The flag is only reset on model reload.

The concrete failure path: user types quickly → first generation's prefill is aborted by abortInFlightGenerationsampleNext returns was_cancelled → sequence destroyed → trim defer calls trimKV on freed ID → returns falsemodelRejectsPartialTrims = true → all subsequent prewarms are no-ops for the rest of the session.

The fix is to guard the trimKV call in the defer with a check that the sequence was not destroyed, e.g. by gating on autocompleteSequenceID >= 0 (which is set to -1 in the engine-cancelled branch before the defer runs).

Fix in Codex Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant