Skip to content

Add VLM inference infrastructure: engine, protocol, and CLI support#65

Merged
stikves merged 7 commits into
apple:mainfrom
stikves:sukru/vlm-infra
Jun 30, 2026
Merged

Add VLM inference infrastructure: engine, protocol, and CLI support#65
stikves merged 7 commits into
apple:mainfrom
stikves:sukru/vlm-infra

Conversation

@stikves

@stikves stikves commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Runtime:

  • MultimodalInferenceEngine protocol with encodeImage() and generate()
  • CoreAISequentialVLMEngine: vision encoder + projector + embed_tokens + LLM decoder with scatter-merge of image embeddings at placeholder positions
  • EmbeddedInput type wrapping NDArray embeddings with position metadata
  • VisionConfig in LanguageConfig for image_size, patch_size, token count/id
  • LanguageBundle parses top-level "vision" block from metadata.json

CLI (llm-runner):

  • --image flag routes through VLM engine when bundle kind is .vlm
  • Chat template detection with generic fallback for prompt construction
  • Accumulated token decode for correct spacing
  • Stop sequence support in VLM path

Supports any VLM that exports 3 components (vision.aimodel, embed.aimodel, model.aimodel) with a vision config block in metadata.json. Model-family- specific export code lives in internal/python.

@carinapeng carinapeng mentioned this pull request Jun 29, 2026
Comment thread swift/Sources/CoreAILanguageModels/Bundle/LanguageConfig.swift
stikves added 2 commits June 29, 2026 12:27
Runtime:
- MultimodalInferenceEngine protocol with encodeImage() and generate()
- CoreAISequentialVLMEngine: vision encoder + projector + embed_tokens +
  LLM decoder with scatter-merge of image embeddings at placeholder positions
- EmbeddedInput type wrapping NDArray embeddings with position metadata
- VisionConfig in LanguageConfig for image_size, patch_size, token count/id
- LanguageBundle parses top-level "vision" block from metadata.json

CLI (llm-runner):
- --image flag routes through VLM engine when bundle kind is .vlm
- Chat template detection with generic fallback for prompt construction
- Accumulated token decode for correct spacing
- Stop sequence support in VLM path

Supports any VLM that exports 3 components (vision.aimodel, embed.aimodel,
model.aimodel) with a vision config block in metadata.json. Model-family-
specific export code lives in internal/python.
Read per-channel normalization from metadata instead of hardcoding.
Fields are optional — bundles without them default to CLIP values
(the most common across VLMs). Gemma/SigLIP bundles specify their
own [0.5, 0.5, 0.5] values explicitly in metadata.json.
…bfloat16

- LanguageBundle init throws if a .vlm bundle omits the vision block
- Rename imageTokenPositions → embeddingPositions (supports future
  audio/multi-modal embedding injection, not just images)
- Accept bfloat16 logits in addition to float16
- Scatter merge uses UInt16 view (type-agnostic for f16/bf16)
@stikves stikves marked this pull request as ready for review June 29, 2026 19:49
Comment thread swift/Sources/CoreAILanguageModels/Bundle/LanguageConfig.swift
Comment thread swift/Sources/CoreAILanguageModels/InferenceEngines/InferenceEngine.swift Outdated
Comment thread swift/Sources/Tools/llm-runner/LLMRunnerMain.swift
@carinapeng

Copy link
Copy Markdown
Contributor

#68 export of Qwen3 VL verifies export of runner with the latest changes, thanks for addressing the comments! @stikves

stikves and others added 3 commits June 29, 2026 18:16
…arnings

- Guard against currentKVCapacity==0 in growKVCache (would loop forever)
- Prefix unused protocol args with _ in warmup()
@stikves stikves merged commit 9e1ffa5 into apple:main Jun 30, 2026
3 checks passed
@stikves stikves deleted the sukru/vlm-infra branch June 30, 2026 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants