kv-cache: SWA checkpoints store only non-masked cells (cherry-pick #23981) by lalalune · Pull Request #27 · elizaOS/llama.cpp

lalalune · 2026-06-22T19:28:09Z

Cherry-picks upstream ggml-org/llama.cpp#23981 (commit 2365315) into the eliza fork.

Why: Gemma 4 is notorious for KV-checkpoint RAM blow-up (upstream #21690 OOM). This makes llama_kv_cache::state_write skip SWA-masked cells, shrinking every spec-decode rollback checkpoint (our FFI common_prompt_checkpoint → llama_state_seq_get_data_ext → state_write path). Directly relevant to the eliza-1 Gemma 4 cutover (#9033 in elizaOS/eliza).

Verified: cherry-pick clean; CPU rebuild green (build 10027); Gemma 4 E2B (Q8_0) still runs (llama-bench pp64/tg32 nominal). Deps (is_masked_swa/n_swa/swa_type) already present in the fork.

🤖 Generated with Claude Code

…eRT/MLX scaffolds (M4/M5) Lets the one streaming-LLM FFI pipe (eliza_inference_llm_stream_*) be served by more than one in-process runtime, selected per-_open, without touching the default llama.cpp path. Realizes M3 of the Gemma 4 cutover and lands the device-gated M4/M5 backends on top. M3 seam (always compiled, inert by default): - src/llm-backend.h — LlmBackendSession / LlmBackendFactory pure-virtual interfaces mirroring the FFI 1:1, plus llm_backend_context_bundle_dir(ctx), the one accessor a backend uses to read the bundle root from the otherwise opaque EliInferenceContext (no can_serve->open bundle-dir caching). - src/llm-backend-selector.cpp — idempotent registry + selection: ELIZA_LLM_BACKEND env hard-select, else highest preference_rank among available()+can_serve(); nullptr+no-error => keep in-tree llama.cpp. With no -DELIZA_ENABLE_* gate, no backend registers, so select() always returns nullptr. - eliza-inference-ffi.cpp — one `if (stream->backend) return stream->backend->X()` branch inserted ABOVE each existing llama.cpp/MTP branch in open/prefill/next/ cancel/reset/reset_keep/save_slot/restore_slot/close. Device-critical path untouched, just guarded. M4 LiteRT-LM (gate -DELIZA_ENABLE_LITERT, OFF): src/backends/litert-backend.{h,cpp} — Engine/Session against the researched LiteRT-LM C++ API, NPU->GPU->CPU ladder, text/*.litertlm probe; no-SDK stub when OFF. M5 CoreML/MLX (gate -DELIZA_ENABLE_MLX + __APPLE__, OFF; FATALs on non-Apple): src/backends/mlx-coreml-backend.{h,mm} — MLX-primary (mlx-c decode graph) + CoreML-alternate (stateful MLState KV); no-SDK stub when OFF. CMake: selector folded into OMNIVOICE_FFI_SOURCES (always built); the two accelerator backends gated with SDK include/link knobs. Default fused build verified on Linux: libelizainference.so links, the FFI pipe stays exported, litert_/mlx_coreml_backend_factory absent (gates OFF) — byte-for-byte the prior llama.cpp path. Every hardware assumption tagged DEVICE-VERIFY. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-22T19:28:17Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 13c2495d-2ec5-4336-8e61-86a6900eb3aa

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch eliza/gemma-kv-swa-checkpoint-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

#35) Generalize the M3 streaming-LLM seam to ALL on-device model ops. A shared eliza_backend::Registry<F> (backend-registry.h) holds the resolution logic (ELIZA_<MOD>_BACKEND/ELIZA_BACKEND hard-select -> highest preference_rank among available()+can_serve() -> nullptr=ggml); each modality adds a tiny factory interface + selector + one FFI chokepoint. Wired for embed/vision/asr/tts/eot: each routes to a backend that ships <bundle>/<modality>/* when present, else falls through to the in-tree ggml path. Inert-by-default (no backend registered => select() returns nullptr => every op byte-identical to before). First real backend: LiteRT text embedding (backends/litert-embed-backend.cpp, gated ELIZA_ENABLE_LITERT) on the LiteRT Next *C* API (the C++ cc/ wrappers are not standalone): env/model/compiled-model lifecycle + NPU->GPU->CPU accelerator ladder (rank 100/20/0) + reads the in-graph-pooled [1,384] output; the WordPiece tokenizer + tensor binding are the one model-specific step (MANIFEST-gated). Serves <bundle>/embedding/*.tflite; auto-promotes to NPU on Pixel-10/G5 or Qualcomm/MediaTek silicon, GPU-delegate (Mali) on a Tensor-G4. Split the LiteRT gates: ELIZA_ENABLE_LITERT = the LiteRT C-API per-op backends (embed); new ELIZA_ENABLE_LITERT_LM = the streaming-LLM backend on the heavier LiteRT-LM Engine SDK (off until that SDK is built). SESSION-OPS-TODO.md documents the vad/wakeword/speaker/diariz extension. Verified: 11/11 TUs compile (inert selectors + self-contained headers + the gated embed backend against the LiteRT SDK); adversarial review confirms inert-by-default + correct chokepoints across all 5 modalities. Co-authored-by: claude <noreply@anthropic.com>

ggerganov and others added 2 commits June 22, 2026 12:26

kv-cache : SWA checkpoints store only non-masked cells (#23981)

3e81729

github-actions Bot added the examples label Jun 22, 2026

This was referenced Jun 22, 2026

Sync fork to upstream master (+604 commits) — Gemma4 verified #29

Open

feat(elizainference): per-op backend seam + LiteRT C-API embed backend #35

Merged

lalalune merged commit 02020a6 into main Jun 24, 2026
15 of 98 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache: SWA checkpoints store only non-masked cells (cherry-pick #23981)#27

kv-cache: SWA checkpoints store only non-masked cells (cherry-pick #23981)#27
lalalune merged 3 commits into
mainfrom
eliza/gemma-kv-swa-checkpoint-fix

lalalune commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lalalune commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading