feat: Add image analysis support for Gemini models by chrisbraddock · Pull Request #1 · szeider/consult7

chrisbraddock · 2025-06-28T18:35:52Z

Note: this was vibed. I put enough effort in to it that it's working locally for me and that's as much as I can do at the moment.

I'm submitting it in case it's useful to you, but don't treat it as merge ready.

On the plus side, Claude Code collaborating with Gemini on actual design (via screenshots of running code from Playwright MCP) is working out pretty nice at the moment.

This commit introduces image analysis capabilities to consult7, enabling it to process and send image files to compatible multimodal models, initially targeting Google Gemini.

Key Features & Changes:

Multimodal Content Handling:
- Added an --include-images command-line flag to enable the processing of image files.
- file_processor.py now differentiates between text and image files based on common extensions (PNG, JPG, GIF, WebP, BMP, SVG).
- Image files are read as bytes and base64-encoded.
- format_content now structures image parts as {"inline_data": {"mime_type": ..., "data": ...}} to comply with Google Gemini API expectations.
- File size calculations account for base64 encoding overhead.
Provider-Specific Logic:
- Introduced a supports_images flag in the model_info dictionary (set in consultation.py's get_model_context_info) to determine if a model/provider combination can handle multimodal input.
- consultation_impl uses this flag along with the --include-images CLI flag to decide whether to send structured multimodal content_parts or a concatenated text string to the provider.
- GoogleProvider (providers/google.py) was updated to:
  - Accept List[Dict[str, Any]] (multimodal parts) as input.
  - Correctly assemble the contents list for the Gemini API, including properly formatted inline_data parts for images.
  - Use config= instead of generation_config= in the generate_content API call.
  - Estimate image token costs (currently a fixed 258 tokens per image based on Gemini documentation).
- Text-only providers (OpenAI, OpenRouter) continue to receive concatenated text strings. Warnings are logged if image processing is attempted with them.
Bug Fixes & Robustness:
- Resolved issues with MCP tool parameter passing (consultation_impl argument mismatches) by consistently using keyword arguments for optional and server-provided parameters in server.py.
- Addressed an MCP tool registration issue by initially simplifying and then planning the incremental restoration of list_tools in server.py (though the final step of restoring list_tools was deferred after confirming the core vision functionality).
Token Handling & Utilities:
- Added estimate_image_tokens to token_utils.py.
Documentation:
- README.md updated to include the --include-images flag, image analysis capabilities for Gemini, supported formats, token usage, and example use cases.

This series of changes allows consult7 to effectively leverage Gemini's vision capabilities for tasks involving image analysis alongside text or code, while maintaining compatibility with existing text-only providers.

This commit introduces image analysis capabilities to `consult7`, enabling it to process and send image files to compatible multimodal models, initially targeting Google Gemini. Key Features & Changes: * **Multimodal Content Handling:** * Added an `--include-images` command-line flag to enable the processing of image files. * `file_processor.py` now differentiates between text and image files based on common extensions (PNG, JPG, GIF, WebP, BMP, SVG). * Image files are read as bytes and base64-encoded. * `format_content` now structures image parts as `{"inline_data": {"mime_type": ..., "data": ...}}` to comply with Google Gemini API expectations. * File size calculations account for base64 encoding overhead. * **Provider-Specific Logic:** * Introduced a `supports_images` flag in the `model_info` dictionary (set in `consultation.py`'s `get_model_context_info`) to determine if a model/provider combination can handle multimodal input. * `consultation_impl` uses this flag along with the `--include-images` CLI flag to decide whether to send structured multimodal `content_parts` or a concatenated text string to the provider. * `GoogleProvider` (`providers/google.py`) was updated to: * Accept `List[Dict[str, Any]]` (multimodal parts) as input. * Correctly assemble the `contents` list for the Gemini API, including properly formatted `inline_data` parts for images. * Use `config=` instead of `generation_config=` in the `generate_content` API call. * Estimate image token costs (currently a fixed 258 tokens per image based on Gemini documentation). * Text-only providers (OpenAI, OpenRouter) continue to receive concatenated text strings. Warnings are logged if image processing is attempted with them. * **Bug Fixes & Robustness:** * Resolved issues with MCP tool parameter passing (`consultation_impl` argument mismatches) by consistently using keyword arguments for optional and server-provided parameters in `server.py`. * Addressed an MCP tool registration issue by initially simplifying and then planning the incremental restoration of `list_tools` in `server.py` (though the final step of restoring `list_tools` was deferred after confirming the core vision functionality). * **Token Handling & Utilities:** * Added `estimate_image_tokens` to `token_utils.py`. * **Documentation:** * `README.md` updated to include the `--include-images` flag, image analysis capabilities for Gemini, supported formats, token usage, and example use cases. This series of changes allows `consult7` to effectively leverage Gemini's vision capabilities for tasks involving image analysis alongside text or code, while maintaining compatibility with existing text-only providers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add image analysis support for Gemini models#1

feat: Add image analysis support for Gemini models#1
chrisbraddock wants to merge 1 commit into
szeider:mainfrom
chrisbraddock:feature/gemini-vision-support

chrisbraddock commented Jun 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrisbraddock commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chrisbraddock commented Jun 28, 2025 •

edited

Loading