Skip to content

VLM: fix EmbeddedInput shape handling, error recovery, and parallel loading#79

Open
stikves wants to merge 6 commits into
apple:mainfrom
stikves:sukru/vlm-fixes
Open

VLM: fix EmbeddedInput shape handling, error recovery, and parallel loading#79
stikves wants to merge 6 commits into
apple:mainfrom
stikves:sukru/vlm-fixes

Conversation

@stikves

@stikves stikves commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • EmbeddedInput.tokenCount: correctly handle 2D [seq_len, hidden_dim] tensors (was returning hidden_dim instead of seq_len). Now a stored property computed once at init.
  • EmbeddedInput.seqLen(of:): centralized helper for seq_len extraction — scatterMerge and other call sites use one canonical path instead of inline shape checks.
  • scatterMerge: replace precondition(float16) with a thrown error for bfloat16 model compatibility.
  • LLMRunnerMain: add comment explaining sequential PreparedModel.prepare (parallel caused runtime errors).
  • Removed premature --max-tiles flag (tiling model not implemented yet).

Test plan

  • Verify tokenCount returns correct seq_len for both 2D and 3D embeddings
  • Confirm bfloat16 model gets a catchable error instead of a crash
  • VLM inference still works end-to-end

- EmbeddedInput.tokenCount: handle 2D [seq_len, hidden_dim] vs 3D [batch, seq_len, hidden_dim]
- scatterMerge: replace precondition(float16) with guard/throw for bfloat16 compatibility
@stikves stikves force-pushed the sukru/vlm-fixes branch from e87f992 to 7fdd585 Compare July 2, 2026 20:20
@stikves stikves self-assigned this Jul 2, 2026
@stikves stikves marked this pull request as ready for review July 2, 2026 20:22
Comment thread swift/Sources/Tools/llm-runner/LLMRunnerMain.swift
Comment thread swift/Sources/CoreAILanguageModels/InferenceEngines/EmbeddedInput.swift Outdated
stikves and others added 2 commits July 2, 2026 14:34
The @option attribute for "Maximum tiles" had no associated var,
causing a compile error with ArgumentParser.
Comment thread swift/Sources/Tools/llm-runner/LLMRunnerMain.swift Outdated
Comment thread swift/Sources/CoreAILanguageModels/InferenceEngines/EmbeddedInput.swift Outdated

@carinapeng carinapeng left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me overall but before approving I think a few things should be fixed, one is on how we read maxTiles, and the other is consistency - seems correct for the 2d case but scatter merge still issues 3D: [1, seq, hidden]

And let's do the testing on the Qwen3 VL as well since that is released :)

stikves added 3 commits July 2, 2026 18:31
…shape handling

- Remove --max-tiles CLI option (tiling model not implemented yet)
- Move seq_len extraction into EmbeddedInput.seqLen(of:) so scatterMerge
  and other call sites use one canonical shape resolution path
- tokenCount is now a stored property computed once at init
All current VLM models produce 3D embeddings. Drop the 2D fallback
and the seqLen helper -- tokenCount is just shape[1].
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants