Skip to content

feat(omnivoice): token-by-token streaming vision describe (ABI v13)#33

Merged
lalalune merged 2 commits into
mainfrom
feat/vision-stream-describe-abi-v13
Jun 24, 2026
Merged

feat(omnivoice): token-by-token streaming vision describe (ABI v13)#33
lalalune merged 2 commits into
mainfrom
feat/vision-stream-describe-abi-v13

Conversation

@lalalune

Copy link
Copy Markdown
Member

Adds eliza_inference_describe_image_stream_open + eliza_inference_vision_stream_supported (ABI 12 → 13).

_stream_open runs the same mmproj prefill as eliza_inference_describe_image (mtmd_tokenize + mtmd_helper_eval_chunks) but returns an EliLlmStream * primed with the image+prompt KV instead of decoding into a buffer. The caller then PULLS tokens with the existing eliza_inference_llm_stream_next loop and frees via eliza_inference_llm_stream_close — reusing the entire streaming-LLM machinery so a vision description streams token-by-token through the same path as chat text.

Pull model (not a callback/push): the host event loop yields between _next steps, so chunks reach the UI live; a push callback would block the caller for the whole decode. The stream carries a greedy sampler + ELIZA_VISION_MAX_TOKENS cap and mtp = null (plain fixed-KV decode path).

Additive + gated on the existing -DELIZA_ENABLE_VISION: a v12 caller is unaffected; a v12 library reports vision_stream_supported() == 0 so loaders fall back to the buffered _describe_image.

Validated on Windows CPU with SmolVLM-500M (mtmd): streams 256 token chunks with real OCR. Consumed by elizaOS/eliza#9289 (JS cascade + handler wiring) — that PR degrades gracefully until this lands and the gitlink is bumped.

🤖 Generated with Claude Code

lalalune and others added 2 commits June 22, 2026 13:13
…Apple/Android-DL)

The kokoro subtree failed three distinct CI lanes, none backend-specific:

- PIC: kokoro_lib is a STATIC archive folded PRIVATE into the fused SHARED
  libelizainference.so, but it never set POSITION_INDEPENDENT_CODE, so ld
  rejected its objects on every BUILD_SHARED_LIBS=ON link ("recompile with
  -fPIC", R_X86_64_PC32 on x86-64 / R_AARCH64_ADR_PREL_PG_HI21 on arm64) —
  breaking the openvino, sycl, vulkan and virtgpu builds. Set PIC ON, mirroring
  eliza_voice_classifiers in the sibling omnivoice subtree.
- Apple: kokoro-tts is a CLI harness but CMake defaults Apple executables to
  MACOSX_BUNDLE, so `install(TARGETS kokoro-tts RUNTIME)` failed configure with
  "no BUNDLE DESTINATION for MACOSX_BUNDLE executable" on every ios/tvos/
  visionos/macos target. Force MACOSX_BUNDLE OFF.
- Android: kokoro.cpp called ggml_backend_cpu_init() directly, which is an
  undefined symbol under -DGGML_BACKEND_DL (the CPU backend is a loadable
  module). Switch to the registry API (ggml_backend_load_all() +
  ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, nullptr)), matching
  omnivoice; works in both DL and statically-linked builds.

Compile-validated on MSVC (kokoro_lib builds); the Linux/Apple effects are CMake
config + a portable registry call requiring no backend SDK to be correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `eliza_inference_describe_image_stream_open` + `eliza_inference_vision_stream_supported`
(ABI 12 -> 13). The open call runs the SAME mmproj prefill as `eliza_inference_describe_image`
(mtmd_tokenize + mtmd_helper_eval_chunks) but, instead of decoding the whole
description into a buffer, returns an `EliLlmStream *` primed with the image+prompt
KV. The caller then PULLS tokens with the existing `eliza_inference_llm_stream_next`
loop and frees the handle with `eliza_inference_llm_stream_close` — reusing the entire
streaming-LLM machinery, so a vision description streams token-by-token through the
same path as chat text (a pull model, so the host event loop yields between steps;
a callback/push model would block the caller for the whole decode).

The returned stream carries a greedy sampler + ELIZA_VISION_MAX_TOKENS cap and no MTP
engine (vision uses the plain fixed-KV decode path). Additive + gated on the existing
-DELIZA_ENABLE_VISION flag: a v12 caller is unaffected and a v12 library reports
vision_stream_supported() == 0, so loaders fall back to the buffered _describe_image.

Validated on Windows CPU (SmolVLM-500M mtmd): streams 256 token chunks with real OCR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 14a650e0-b886-4d9e-8631-01e95d24a474

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/vision-stream-describe-abi-v13

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant