feat: Scorer Improvements#115
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR implements significant improvements to the scorer system by replacing the TaskInput pattern with a new Lookup system and adding many new scoring capabilities. The changes focus on enhancing parameter resolution, expanding scorer functionality, and improving documentation.
- Migrates all scorers from TaskInput to Lookup pattern for better parameter handling
- Adds 6 new scorer modules with comprehensive functionality (classification, format validation, harm detection, lexical analysis, operators)
- Introduces comprehensive scorer documentation with usage examples
Reviewed Changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| dreadnode/task.py | Removes TaskInput class and related functionality |
| dreadnode/lookup.py | Introduces new Lookup system for dynamic parameter resolution |
| dreadnode/scorers/*.py | Updates existing scorers to use Lookup pattern and adds new scorer modules |
| docs/usage/scorers.mdx | Adds comprehensive scorer documentation |
| docs/usage/metrics.mdx | Removes scorer content moved to dedicated scorers page |
| import typing as t | ||
| from difflib import SequenceMatcher | ||
|
|
||
| import litellm |
There was a problem hiding this comment.
The litellm import should be moved inside the function where it's used or made conditional. This reduces startup time for modules that don't use litellm-based similarity.
| import litellm | |
| # Removed the top-level import of litellm. It will be imported inside the relevant function(s). |
| if min_length < 0 or max_length < min_length: | ||
| raise ValueError("Invalid length bounds. Must have 0 <= min <= max.") | ||
|
|
||
| def evaluate(data: t.Any) -> Metric: |
There was a problem hiding this comment.
The validation logic for min_length and max_length should be moved outside the evaluate function to fail fast during scorer creation rather than during evaluation.
| def evaluate(data: t.Any) -> Metric: | ||
| nonlocal target_length | ||
|
|
||
| target_length = int(resolve_lookup(target_length)) | ||
| if target_length < 0: | ||
| raise ValueError("Target length must be non-negative.") | ||
|
|
||
| text = str(data) | ||
| text_len = len(text) | ||
|
|
There was a problem hiding this comment.
The validation logic for target_length should be moved outside the evaluate function to fail fast during scorer creation rather than during evaluation.
| return Metric(value=inverted_value, attributes=original_metric.attributes) | ||
|
|
||
| name = name or f"{scorer.name}_inverted" | ||
| return Scorer.from_callable(evaluate, name=name) # type: ignore [return-value] |
There was a problem hiding this comment.
The type ignore comment suggests a type mismatch. Consider fixing the return type annotation or the function signature to avoid needing type ignore.
| return Scorer.from_callable(evaluate, name=name) # type: ignore [return-value] | |
| return t.cast(ScorerT, Scorer.from_callable(evaluate, name=name)) |
Scorer Improvements
Key Changes:
Added:
Changed:
Removed:
Generated Summary:
detect_refusal_with_zero_shot: Detects refusal to answer using zero-shot classification.detect_bias: Scores presence of potentially biased language in data.is_json: Validates if a string is properly formatted JSON.is_xml: Validates if a string is properly formatted XML.character_consistencyandcontainsto improve data handling.This summary was generated with ❤️ by rigging