Add VLM inference infrastructure: engine, protocol, and CLI support by stikves · Pull Request #65 · apple/coreai-models

stikves · 2026-06-25T17:52:21Z

Runtime:

MultimodalInferenceEngine protocol with encodeImage() and generate()
CoreAISequentialVLMEngine: vision encoder + projector + embed_tokens + LLM decoder with scatter-merge of image embeddings at placeholder positions
EmbeddedInput type wrapping NDArray embeddings with position metadata
VisionConfig in LanguageConfig for image_size, patch_size, token count/id
LanguageBundle parses top-level "vision" block from metadata.json

CLI (llm-runner):

--image flag routes through VLM engine when bundle kind is .vlm
Chat template detection with generic fallback for prompt construction
Accumulated token decode for correct spacing
Stop sequence support in VLM path

Supports any VLM that exports 3 components (vision.aimodel, embed.aimodel, model.aimodel) with a vision config block in metadata.json. Model-family- specific export code lives in internal/python.

Runtime: - MultimodalInferenceEngine protocol with encodeImage() and generate() - CoreAISequentialVLMEngine: vision encoder + projector + embed_tokens + LLM decoder with scatter-merge of image embeddings at placeholder positions - EmbeddedInput type wrapping NDArray embeddings with position metadata - VisionConfig in LanguageConfig for image_size, patch_size, token count/id - LanguageBundle parses top-level "vision" block from metadata.json CLI (llm-runner): - --image flag routes through VLM engine when bundle kind is .vlm - Chat template detection with generic fallback for prompt construction - Accumulated token decode for correct spacing - Stop sequence support in VLM path Supports any VLM that exports 3 components (vision.aimodel, embed.aimodel, model.aimodel) with a vision config block in metadata.json. Model-family- specific export code lives in internal/python.

Read per-channel normalization from metadata instead of hardcoding. Fields are optional — bundles without them default to CLIP values (the most common across VLMs). Gemma/SigLIP bundles specify their own [0.5, 0.5, 0.5] values explicitly in metadata.json.

…bfloat16 - LanguageBundle init throws if a .vlm bundle omits the vision block - Rename imageTokenPositions → embeddingPositions (supports future audio/multi-modal embedding injection, not just images) - Accept bfloat16 logits in addition to float16 - Scatter merge uses UInt16 view (type-agnostic for f16/bf16)

carinapeng · 2026-06-29T23:15:56Z

#68 export of Qwen3 VL verifies export of runner with the latest changes, thanks for addressing the comments! @stikves

…arnings - Guard against currentKVCapacity==0 in growKVCache (would loop forever) - Prefix unused protocol args with _ in warmup()

carinapeng mentioned this pull request Jun 29, 2026

Qwen3-VL-2B export #68

Open