feat: Add Automatic Multimodal Scoring to llm_judge scorer#302
Merged
Conversation
Enables llm_judge to automatically detect and score Message outputs containing images and audio alongside text. When a Message with images/audio is provided, they are automatically included in the evaluation using vision-capable models. Key changes: - Automatic multimodal detection via Message.image_parts/audio_parts - Zero API changes - backward compatible with text-only scoring - Single combined score for text + images + audio - Extract helper functions to improve code quality - Add observability attributes (has_multimodal, num_images, num_audio) - Example notebook demonstrating text-only, image-only, and multimodal scoring
6f8cb1a to
aeaffbb
Compare
aeaffbb to
eb5345d
Compare
mkultraWasHere
pushed a commit
that referenced
this pull request
Jan 21, 2026
Enables llm_judge to automatically detect and score Message outputs containing images and audio alongside text. When a Message with images/audio is provided, they are automatically included in the evaluation using vision-capable models. Key changes: - Automatic multimodal detection via Message.image_parts/audio_parts - Zero API changes - backward compatible with text-only scoring - Single combined score for text + images + audio - Extract helper functions to improve code quality - Add observability attributes (has_multimodal, num_images, num_audio) - Example notebook demonstrating text-only, image-only, and multimodal scoring
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enables llm_judge to automatically detect and score Message outputs containing images and audio alongside text. When a Message with images/audio is provided, they are automatically included in the evaluation using vision-capable models.
Key Changes:
Added:
llm_judgescorer viaMessage.image_partsandaudio_partshas_multimodal,num_images,num_audioin metricsexamples/airt/multimodal_llm_judge.ipynb- Example notebook demonstrating text-only, image-only, and multimodal scoring scenariosChanged:
dreadnode/scorers/judge.py:_build_multimodal_content()helper for building rigging content from Message_create_judge_pipeline()helper for pipeline creation (keyword-onlyhas_multimodalparameter)_create_judge_metrics()helper for metric creationjudge()prompt docstring to mention multimodal evaluationevaluate()Removed:
Generated Summary:
llm_judgefunction for evaluating text, images, and audio._build_multimodal_contentand_create_judge_pipelineto handle content construction and pipeline generation for multimodal messages._create_judge_metricsto include information on images and audio.judgefunction documentation to clarify that it evaluates all provided content when making judgments.multimodal_llm_judge.ipynbto demonstrate usage of the new multimodal judging capabilities.These updates significantly enhance the judging capabilities, allowing for richer evaluations of outputs that include various forms of content. Existing functionalities remain intact while new features are added.
This summary was generated with ❤️ by rigging