Skip to content

feat(elizainference): per-op backend seam + LiteRT C-API embed backend#35

Merged
lalalune merged 1 commit into
eliza/gemma-kv-swa-checkpoint-fixfrom
feat/per-op-backend-seam-litert-embed
Jun 24, 2026
Merged

feat(elizainference): per-op backend seam + LiteRT C-API embed backend#35
lalalune merged 1 commit into
eliza/gemma-kv-swa-checkpoint-fixfrom
feat/per-op-backend-seam-litert-embed

Conversation

@lalalune

Copy link
Copy Markdown
Member

Stacks on the M3 multi-backend seam (#27 / eliza/gemma-kv-swa-checkpoint-fix) and generalizes it from the streaming-LLM op to every on-device model op, plus lands the first real accelerator backend (LiteRT text embedding).

What

  • Generic per-op registry (backend-registry.h): one eliza_backend::Registry<F> holds the resolution policy — ELIZA_<MOD>_BACKEND/ELIZA_BACKEND hard-select → highest preference_rank() among available()+can_serve()nullptr (the in-tree ggml path). Each modality adds a tiny factory interface + selector + a single FFI chokepoint that reuses it.
  • Per-op seams for embed, vision, asr, tts, eot — each routes to a backend that ships <bundle>/<modality>/* when present, else falls through to ggml. Inert by default: with no -DELIZA_ENABLE_* gate, nothing registers, select() returns nullptr, and every op is byte-identical to before.
  • LiteRT embed backend (backends/litert-embed-backend.cpp, gated ELIZA_ENABLE_LITERT) on the LiteRT Next C API (the C++ cc/ wrappers aren't standalone): env/model/compiled-model lifecycle + an NPU→GPU→CPU accelerator ladder (preference_rank 100/20/0) reading the in-graph-pooled [1,384] output. Serves <bundle>/embedding/*.tflite; auto-promotes to NPU on a Pixel-10/G5 or Qualcomm/MediaTek device, GPU-delegate (Mali) on a Tensor-G4. The WordPiece tokenizer + tensor binding are the one model-specific step (a converted all-MiniLM-L6-v2 .tflite + I/O signature exist; the binding is marked TODO(MANIFEST)).
  • Gate split: ELIZA_ENABLE_LITERT = the LiteRT C-API per-op backends (embed); new ELIZA_ENABLE_LITERT_LM = the streaming-LLM backend on the heavier LiteRT-LM Engine SDK (off until that SDK is built). This unblocks the C-API build, which otherwise tried to compile the LLM backend's litert::lm deps.

Verification

  • 11/11 TUs compile (NDK 29 aarch64): the 5 inert per-op selectors with no gate, all 5 backend headers self-contained, and the gated litert-embed-backend.cpp against the staged LiteRT C SDK (load-bearing — fails without -I .../litert/include).
  • Adversarial review: inert-by-default + correct chokepoint placement/args confirmed across all 5 modalities; no correctness bugs.

SESSION-OPS-TODO.md documents extending the same pattern to the session ops (vad/wakeword/speaker/diariz).

🤖 Generated with Claude Code

Generalize the M3 streaming-LLM seam to ALL on-device model ops. A shared
eliza_backend::Registry<F> (backend-registry.h) holds the resolution logic
(ELIZA_<MOD>_BACKEND/ELIZA_BACKEND hard-select -> highest preference_rank among
available()+can_serve() -> nullptr=ggml); each modality adds a tiny factory
interface + selector + one FFI chokepoint. Wired for embed/vision/asr/tts/eot:
each routes to a backend that ships <bundle>/<modality>/* when present, else
falls through to the in-tree ggml path. Inert-by-default (no backend registered
=> select() returns nullptr => every op byte-identical to before).

First real backend: LiteRT text embedding (backends/litert-embed-backend.cpp,
gated ELIZA_ENABLE_LITERT) on the LiteRT Next *C* API (the C++ cc/ wrappers are
not standalone): env/model/compiled-model lifecycle + NPU->GPU->CPU accelerator
ladder (rank 100/20/0) + reads the in-graph-pooled [1,384] output; the WordPiece
tokenizer + tensor binding are the one model-specific step (MANIFEST-gated).
Serves <bundle>/embedding/*.tflite; auto-promotes to NPU on Pixel-10/G5 or
Qualcomm/MediaTek silicon, GPU-delegate (Mali) on a Tensor-G4.

Split the LiteRT gates: ELIZA_ENABLE_LITERT = the LiteRT C-API per-op backends
(embed); new ELIZA_ENABLE_LITERT_LM = the streaming-LLM backend on the heavier
LiteRT-LM Engine SDK (off until that SDK is built). SESSION-OPS-TODO.md documents
the vad/wakeword/speaker/diariz extension.

Verified: 11/11 TUs compile (inert selectors + self-contained headers + the
gated embed backend against the LiteRT SDK); adversarial review confirms
inert-by-default + correct chokepoints across all 5 modalities.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c53ac6a0-e57e-4a08-bee0-52d6749f0078

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/per-op-backend-seam-litert-embed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@lalalune lalalune merged commit 9be54e3 into eliza/gemma-kv-swa-checkpoint-fix Jun 24, 2026
6 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants