Skip to content

kv-cache: SWA checkpoints store only non-masked cells (cherry-pick #23981)#27

Merged
lalalune merged 3 commits into
mainfrom
eliza/gemma-kv-swa-checkpoint-fix
Jun 24, 2026
Merged

kv-cache: SWA checkpoints store only non-masked cells (cherry-pick #23981)#27
lalalune merged 3 commits into
mainfrom
eliza/gemma-kv-swa-checkpoint-fix

Conversation

@lalalune

Copy link
Copy Markdown
Member

Cherry-picks upstream ggml-org/llama.cpp#23981 (commit 2365315) into the eliza fork.

Why: Gemma 4 is notorious for KV-checkpoint RAM blow-up (upstream #21690 OOM). This makes llama_kv_cache::state_write skip SWA-masked cells, shrinking every spec-decode rollback checkpoint (our FFI common_prompt_checkpoint → llama_state_seq_get_data_ext → state_write path). Directly relevant to the eliza-1 Gemma 4 cutover (#9033 in elizaOS/eliza).

Verified: cherry-pick clean; CPU rebuild green (build 10027); Gemma 4 E2B (Q8_0) still runs (llama-bench pp64/tg32 nominal). Deps (is_masked_swa/n_swa/swa_type) already present in the fork.

🤖 Generated with Claude Code

ggerganov and others added 2 commits June 22, 2026 12:26
…eRT/MLX scaffolds (M4/M5)

Lets the one streaming-LLM FFI pipe (eliza_inference_llm_stream_*) be served
by more than one in-process runtime, selected per-_open, without touching the
default llama.cpp path. Realizes M3 of the Gemma 4 cutover and lands the
device-gated M4/M5 backends on top.

M3 seam (always compiled, inert by default):
- src/llm-backend.h — LlmBackendSession / LlmBackendFactory pure-virtual
  interfaces mirroring the FFI 1:1, plus llm_backend_context_bundle_dir(ctx),
  the one accessor a backend uses to read the bundle root from the otherwise
  opaque EliInferenceContext (no can_serve->open bundle-dir caching).
- src/llm-backend-selector.cpp — idempotent registry + selection: ELIZA_LLM_BACKEND
  env hard-select, else highest preference_rank among available()+can_serve();
  nullptr+no-error => keep in-tree llama.cpp. With no -DELIZA_ENABLE_* gate, no
  backend registers, so select() always returns nullptr.
- eliza-inference-ffi.cpp — one `if (stream->backend) return stream->backend->X()`
  branch inserted ABOVE each existing llama.cpp/MTP branch in open/prefill/next/
  cancel/reset/reset_keep/save_slot/restore_slot/close. Device-critical path
  untouched, just guarded.

M4 LiteRT-LM (gate -DELIZA_ENABLE_LITERT, OFF): src/backends/litert-backend.{h,cpp}
  — Engine/Session against the researched LiteRT-LM C++ API, NPU->GPU->CPU ladder,
  text/*.litertlm probe; no-SDK stub when OFF.

M5 CoreML/MLX (gate -DELIZA_ENABLE_MLX + __APPLE__, OFF; FATALs on non-Apple):
  src/backends/mlx-coreml-backend.{h,mm} — MLX-primary (mlx-c decode graph) +
  CoreML-alternate (stateful MLState KV); no-SDK stub when OFF.

CMake: selector folded into OMNIVOICE_FFI_SOURCES (always built); the two
accelerator backends gated with SDK include/link knobs. Default fused build
verified on Linux: libelizainference.so links, the FFI pipe stays exported,
litert_/mlx_coreml_backend_factory absent (gates OFF) — byte-for-byte the prior
llama.cpp path. Every hardware assumption tagged DEVICE-VERIFY.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 13c2495d-2ec5-4336-8e61-86a6900eb3aa

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch eliza/gemma-kv-swa-checkpoint-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

#35)

Generalize the M3 streaming-LLM seam to ALL on-device model ops. A shared
eliza_backend::Registry<F> (backend-registry.h) holds the resolution logic
(ELIZA_<MOD>_BACKEND/ELIZA_BACKEND hard-select -> highest preference_rank among
available()+can_serve() -> nullptr=ggml); each modality adds a tiny factory
interface + selector + one FFI chokepoint. Wired for embed/vision/asr/tts/eot:
each routes to a backend that ships <bundle>/<modality>/* when present, else
falls through to the in-tree ggml path. Inert-by-default (no backend registered
=> select() returns nullptr => every op byte-identical to before).

First real backend: LiteRT text embedding (backends/litert-embed-backend.cpp,
gated ELIZA_ENABLE_LITERT) on the LiteRT Next *C* API (the C++ cc/ wrappers are
not standalone): env/model/compiled-model lifecycle + NPU->GPU->CPU accelerator
ladder (rank 100/20/0) + reads the in-graph-pooled [1,384] output; the WordPiece
tokenizer + tensor binding are the one model-specific step (MANIFEST-gated).
Serves <bundle>/embedding/*.tflite; auto-promotes to NPU on Pixel-10/G5 or
Qualcomm/MediaTek silicon, GPU-delegate (Mali) on a Tensor-G4.

Split the LiteRT gates: ELIZA_ENABLE_LITERT = the LiteRT C-API per-op backends
(embed); new ELIZA_ENABLE_LITERT_LM = the streaming-LLM backend on the heavier
LiteRT-LM Engine SDK (off until that SDK is built). SESSION-OPS-TODO.md documents
the vad/wakeword/speaker/diariz extension.

Verified: 11/11 TUs compile (inert selectors + self-contained headers + the
gated embed backend against the LiteRT SDK); adversarial review confirms
inert-by-default + correct chokepoints across all 5 modalities.

Co-authored-by: claude <noreply@anthropic.com>
@lalalune lalalune merged commit 02020a6 into main Jun 24, 2026
15 of 98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants