kv-cache: SWA checkpoints store only non-masked cells (cherry-pick #23981)#27
Merged
Conversation
…eRT/MLX scaffolds (M4/M5)
Lets the one streaming-LLM FFI pipe (eliza_inference_llm_stream_*) be served
by more than one in-process runtime, selected per-_open, without touching the
default llama.cpp path. Realizes M3 of the Gemma 4 cutover and lands the
device-gated M4/M5 backends on top.
M3 seam (always compiled, inert by default):
- src/llm-backend.h — LlmBackendSession / LlmBackendFactory pure-virtual
interfaces mirroring the FFI 1:1, plus llm_backend_context_bundle_dir(ctx),
the one accessor a backend uses to read the bundle root from the otherwise
opaque EliInferenceContext (no can_serve->open bundle-dir caching).
- src/llm-backend-selector.cpp — idempotent registry + selection: ELIZA_LLM_BACKEND
env hard-select, else highest preference_rank among available()+can_serve();
nullptr+no-error => keep in-tree llama.cpp. With no -DELIZA_ENABLE_* gate, no
backend registers, so select() always returns nullptr.
- eliza-inference-ffi.cpp — one `if (stream->backend) return stream->backend->X()`
branch inserted ABOVE each existing llama.cpp/MTP branch in open/prefill/next/
cancel/reset/reset_keep/save_slot/restore_slot/close. Device-critical path
untouched, just guarded.
M4 LiteRT-LM (gate -DELIZA_ENABLE_LITERT, OFF): src/backends/litert-backend.{h,cpp}
— Engine/Session against the researched LiteRT-LM C++ API, NPU->GPU->CPU ladder,
text/*.litertlm probe; no-SDK stub when OFF.
M5 CoreML/MLX (gate -DELIZA_ENABLE_MLX + __APPLE__, OFF; FATALs on non-Apple):
src/backends/mlx-coreml-backend.{h,mm} — MLX-primary (mlx-c decode graph) +
CoreML-alternate (stateful MLState KV); no-SDK stub when OFF.
CMake: selector folded into OMNIVOICE_FFI_SOURCES (always built); the two
accelerator backends gated with SDK include/link knobs. Default fused build
verified on Linux: libelizainference.so links, the FFI pipe stays exported,
litert_/mlx_coreml_backend_factory absent (gates OFF) — byte-for-byte the prior
llama.cpp path. Every hardware assumption tagged DEVICE-VERIFY.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This was referenced Jun 22, 2026
#35) Generalize the M3 streaming-LLM seam to ALL on-device model ops. A shared eliza_backend::Registry<F> (backend-registry.h) holds the resolution logic (ELIZA_<MOD>_BACKEND/ELIZA_BACKEND hard-select -> highest preference_rank among available()+can_serve() -> nullptr=ggml); each modality adds a tiny factory interface + selector + one FFI chokepoint. Wired for embed/vision/asr/tts/eot: each routes to a backend that ships <bundle>/<modality>/* when present, else falls through to the in-tree ggml path. Inert-by-default (no backend registered => select() returns nullptr => every op byte-identical to before). First real backend: LiteRT text embedding (backends/litert-embed-backend.cpp, gated ELIZA_ENABLE_LITERT) on the LiteRT Next *C* API (the C++ cc/ wrappers are not standalone): env/model/compiled-model lifecycle + NPU->GPU->CPU accelerator ladder (rank 100/20/0) + reads the in-graph-pooled [1,384] output; the WordPiece tokenizer + tensor binding are the one model-specific step (MANIFEST-gated). Serves <bundle>/embedding/*.tflite; auto-promotes to NPU on Pixel-10/G5 or Qualcomm/MediaTek silicon, GPU-delegate (Mali) on a Tensor-G4. Split the LiteRT gates: ELIZA_ENABLE_LITERT = the LiteRT C-API per-op backends (embed); new ELIZA_ENABLE_LITERT_LM = the streaming-LLM backend on the heavier LiteRT-LM Engine SDK (off until that SDK is built). SESSION-OPS-TODO.md documents the vad/wakeword/speaker/diariz extension. Verified: 11/11 TUs compile (inert selectors + self-contained headers + the gated embed backend against the LiteRT SDK); adversarial review confirms inert-by-default + correct chokepoints across all 5 modalities. Co-authored-by: claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picks upstream ggml-org/llama.cpp#23981 (commit 2365315) into the eliza fork.
Why: Gemma 4 is notorious for KV-checkpoint RAM blow-up (upstream #21690 OOM). This makes
llama_kv_cache::state_writeskip SWA-masked cells, shrinking every spec-decode rollback checkpoint (our FFIcommon_prompt_checkpoint → llama_state_seq_get_data_ext → state_writepath). Directly relevant to the eliza-1 Gemma 4 cutover (#9033 in elizaOS/eliza).Verified: cherry-pick clean; CPU rebuild green (build 10027); Gemma 4 E2B (Q8_0) still runs (llama-bench pp64/tg32 nominal). Deps (
is_masked_swa/n_swa/swa_type) already present in the fork.🤖 Generated with Claude Code