feat(omnivoice): token-by-token streaming vision describe (ABI v13)#33
Merged
Conversation
…Apple/Android-DL)
The kokoro subtree failed three distinct CI lanes, none backend-specific:
- PIC: kokoro_lib is a STATIC archive folded PRIVATE into the fused SHARED
libelizainference.so, but it never set POSITION_INDEPENDENT_CODE, so ld
rejected its objects on every BUILD_SHARED_LIBS=ON link ("recompile with
-fPIC", R_X86_64_PC32 on x86-64 / R_AARCH64_ADR_PREL_PG_HI21 on arm64) —
breaking the openvino, sycl, vulkan and virtgpu builds. Set PIC ON, mirroring
eliza_voice_classifiers in the sibling omnivoice subtree.
- Apple: kokoro-tts is a CLI harness but CMake defaults Apple executables to
MACOSX_BUNDLE, so `install(TARGETS kokoro-tts RUNTIME)` failed configure with
"no BUNDLE DESTINATION for MACOSX_BUNDLE executable" on every ios/tvos/
visionos/macos target. Force MACOSX_BUNDLE OFF.
- Android: kokoro.cpp called ggml_backend_cpu_init() directly, which is an
undefined symbol under -DGGML_BACKEND_DL (the CPU backend is a loadable
module). Switch to the registry API (ggml_backend_load_all() +
ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, nullptr)), matching
omnivoice; works in both DL and statically-linked builds.
Compile-validated on MSVC (kokoro_lib builds); the Linux/Apple effects are CMake
config + a portable registry call requiring no backend SDK to be correct.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `eliza_inference_describe_image_stream_open` + `eliza_inference_vision_stream_supported` (ABI 12 -> 13). The open call runs the SAME mmproj prefill as `eliza_inference_describe_image` (mtmd_tokenize + mtmd_helper_eval_chunks) but, instead of decoding the whole description into a buffer, returns an `EliLlmStream *` primed with the image+prompt KV. The caller then PULLS tokens with the existing `eliza_inference_llm_stream_next` loop and frees the handle with `eliza_inference_llm_stream_close` — reusing the entire streaming-LLM machinery, so a vision description streams token-by-token through the same path as chat text (a pull model, so the host event loop yields between steps; a callback/push model would block the caller for the whole decode). The returned stream carries a greedy sampler + ELIZA_VISION_MAX_TOKENS cap and no MTP engine (vision uses the plain fixed-KV decode path). Additive + gated on the existing -DELIZA_ENABLE_VISION flag: a v12 caller is unaffected and a v12 library reports vision_stream_supported() == 0, so loaders fall back to the buffered _describe_image. Validated on Windows CPU (SmolVLM-500M mtmd): streams 256 token chunks with real OCR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
eliza_inference_describe_image_stream_open+eliza_inference_vision_stream_supported(ABI 12 → 13)._stream_openruns the same mmproj prefill aseliza_inference_describe_image(mtmd_tokenize+mtmd_helper_eval_chunks) but returns anEliLlmStream *primed with the image+prompt KV instead of decoding into a buffer. The caller then PULLS tokens with the existingeliza_inference_llm_stream_nextloop and frees viaeliza_inference_llm_stream_close— reusing the entire streaming-LLM machinery so a vision description streams token-by-token through the same path as chat text.Pull model (not a callback/push): the host event loop yields between
_nextsteps, so chunks reach the UI live; a push callback would block the caller for the whole decode. The stream carries a greedy sampler +ELIZA_VISION_MAX_TOKENScap andmtp = null(plain fixed-KV decode path).Additive + gated on the existing
-DELIZA_ENABLE_VISION: a v12 caller is unaffected; a v12 library reportsvision_stream_supported() == 0so loaders fall back to the buffered_describe_image.Validated on Windows CPU with SmolVLM-500M (mtmd): streams 256 token chunks with real OCR. Consumed by elizaOS/eliza#9289 (JS cascade + handler wiring) — that PR degrades gracefully until this lands and the gitlink is bumped.
🤖 Generated with Claude Code