feat(elizainference): per-op backend seam + LiteRT C-API embed backend#35
Merged
lalalune merged 1 commit intoJun 24, 2026
Conversation
Generalize the M3 streaming-LLM seam to ALL on-device model ops. A shared eliza_backend::Registry<F> (backend-registry.h) holds the resolution logic (ELIZA_<MOD>_BACKEND/ELIZA_BACKEND hard-select -> highest preference_rank among available()+can_serve() -> nullptr=ggml); each modality adds a tiny factory interface + selector + one FFI chokepoint. Wired for embed/vision/asr/tts/eot: each routes to a backend that ships <bundle>/<modality>/* when present, else falls through to the in-tree ggml path. Inert-by-default (no backend registered => select() returns nullptr => every op byte-identical to before). First real backend: LiteRT text embedding (backends/litert-embed-backend.cpp, gated ELIZA_ENABLE_LITERT) on the LiteRT Next *C* API (the C++ cc/ wrappers are not standalone): env/model/compiled-model lifecycle + NPU->GPU->CPU accelerator ladder (rank 100/20/0) + reads the in-graph-pooled [1,384] output; the WordPiece tokenizer + tensor binding are the one model-specific step (MANIFEST-gated). Serves <bundle>/embedding/*.tflite; auto-promotes to NPU on Pixel-10/G5 or Qualcomm/MediaTek silicon, GPU-delegate (Mali) on a Tensor-G4. Split the LiteRT gates: ELIZA_ENABLE_LITERT = the LiteRT C-API per-op backends (embed); new ELIZA_ENABLE_LITERT_LM = the streaming-LLM backend on the heavier LiteRT-LM Engine SDK (off until that SDK is built). SESSION-OPS-TODO.md documents the vad/wakeword/speaker/diariz extension. Verified: 11/11 TUs compile (inert selectors + self-contained headers + the gated embed backend against the LiteRT SDK); adversarial review confirms inert-by-default + correct chokepoints across all 5 modalities. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
9be54e3
into
eliza/gemma-kv-swa-checkpoint-fix
6 of 36 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacks on the M3 multi-backend seam (#27 /
eliza/gemma-kv-swa-checkpoint-fix) and generalizes it from the streaming-LLM op to every on-device model op, plus lands the first real accelerator backend (LiteRT text embedding).What
backend-registry.h): oneeliza_backend::Registry<F>holds the resolution policy —ELIZA_<MOD>_BACKEND/ELIZA_BACKENDhard-select → highestpreference_rank()amongavailable()+can_serve()→nullptr(the in-tree ggml path). Each modality adds a tiny factory interface + selector + a single FFI chokepoint that reuses it.<bundle>/<modality>/*when present, else falls through to ggml. Inert by default: with no-DELIZA_ENABLE_*gate, nothing registers,select()returnsnullptr, and every op is byte-identical to before.backends/litert-embed-backend.cpp, gatedELIZA_ENABLE_LITERT) on the LiteRT Next C API (the C++cc/wrappers aren't standalone): env/model/compiled-model lifecycle + an NPU→GPU→CPU accelerator ladder (preference_rank100/20/0) reading the in-graph-pooled[1,384]output. Serves<bundle>/embedding/*.tflite; auto-promotes to NPU on a Pixel-10/G5 or Qualcomm/MediaTek device, GPU-delegate (Mali) on a Tensor-G4. The WordPiece tokenizer + tensor binding are the one model-specific step (a converted all-MiniLM-L6-v2.tflite+ I/O signature exist; the binding is markedTODO(MANIFEST)).ELIZA_ENABLE_LITERT= the LiteRT C-API per-op backends (embed); newELIZA_ENABLE_LITERT_LM= the streaming-LLM backend on the heavier LiteRT-LM Engine SDK (off until that SDK is built). This unblocks the C-API build, which otherwise tried to compile the LLM backend'slitert::lmdeps.Verification
litert-embed-backend.cppagainst the staged LiteRT C SDK (load-bearing — fails without-I .../litert/include).SESSION-OPS-TODO.mddocuments extending the same pattern to the session ops (vad/wakeword/speaker/diariz).🤖 Generated with Claude Code