feat: Add image analysis support for Gemini models#1
Open
chrisbraddock wants to merge 1 commit into
Open
Conversation
This commit introduces image analysis capabilities to `consult7`, enabling it to process and send image files to compatible multimodal models, initially targeting Google Gemini.
Key Features & Changes:
* **Multimodal Content Handling:**
* Added an `--include-images` command-line flag to enable the processing of image files.
* `file_processor.py` now differentiates between text and image files based on common extensions (PNG, JPG, GIF, WebP, BMP, SVG).
* Image files are read as bytes and base64-encoded.
* `format_content` now structures image parts as `{"inline_data": {"mime_type": ..., "data": ...}}` to comply with Google Gemini API expectations.
* File size calculations account for base64 encoding overhead.
* **Provider-Specific Logic:**
* Introduced a `supports_images` flag in the `model_info` dictionary (set in `consultation.py`'s `get_model_context_info`) to determine if a model/provider combination can handle multimodal input.
* `consultation_impl` uses this flag along with the `--include-images` CLI flag to decide whether to send structured multimodal `content_parts` or a concatenated text string to the provider.
* `GoogleProvider` (`providers/google.py`) was updated to:
* Accept `List[Dict[str, Any]]` (multimodal parts) as input.
* Correctly assemble the `contents` list for the Gemini API, including properly formatted `inline_data` parts for images.
* Use `config=` instead of `generation_config=` in the `generate_content` API call.
* Estimate image token costs (currently a fixed 258 tokens per image based on Gemini documentation).
* Text-only providers (OpenAI, OpenRouter) continue to receive concatenated text strings. Warnings are logged if image processing is attempted with them.
* **Bug Fixes & Robustness:**
* Resolved issues with MCP tool parameter passing (`consultation_impl` argument mismatches) by consistently using keyword arguments for optional and server-provided parameters in `server.py`.
* Addressed an MCP tool registration issue by initially simplifying and then planning the incremental restoration of `list_tools` in `server.py` (though the final step of restoring `list_tools` was deferred after confirming the core vision functionality).
* **Token Handling & Utilities:**
* Added `estimate_image_tokens` to `token_utils.py`.
* **Documentation:**
* `README.md` updated to include the `--include-images` flag, image analysis capabilities for Gemini, supported formats, token usage, and example use cases.
This series of changes allows `consult7` to effectively leverage Gemini's vision capabilities for tasks involving image analysis alongside text or code, while maintaining compatibility with existing text-only providers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit introduces image analysis capabilities to
consult7, enabling it to process and send image files to compatible multimodal models, initially targeting Google Gemini.Key Features & Changes:
Multimodal Content Handling:
--include-imagescommand-line flag to enable the processing of image files.file_processor.pynow differentiates between text and image files based on common extensions (PNG, JPG, GIF, WebP, BMP, SVG).format_contentnow structures image parts as{"inline_data": {"mime_type": ..., "data": ...}}to comply with Google Gemini API expectations.Provider-Specific Logic:
supports_imagesflag in themodel_infodictionary (set inconsultation.py'sget_model_context_info) to determine if a model/provider combination can handle multimodal input.consultation_impluses this flag along with the--include-imagesCLI flag to decide whether to send structured multimodalcontent_partsor a concatenated text string to the provider.GoogleProvider(providers/google.py) was updated to:List[Dict[str, Any]](multimodal parts) as input.contentslist for the Gemini API, including properly formattedinline_dataparts for images.config=instead ofgeneration_config=in thegenerate_contentAPI call.Bug Fixes & Robustness:
consultation_implargument mismatches) by consistently using keyword arguments for optional and server-provided parameters inserver.py.list_toolsinserver.py(though the final step of restoringlist_toolswas deferred after confirming the core vision functionality).Token Handling & Utilities:
estimate_image_tokenstotoken_utils.py.Documentation:
README.mdupdated to include the--include-imagesflag, image analysis capabilities for Gemini, supported formats, token usage, and example use cases.This series of changes allows
consult7to effectively leverage Gemini's vision capabilities for tasks involving image analysis alongside text or code, while maintaining compatibility with existing text-only providers.