Skip to content

fix(eval): improve rubric text normalization for judge-garbled output#6080

Open
tottenjordan wants to merge 3 commits into
google:mainfrom
tottenjordan:fix/rubric-text-normalization
Open

fix(eval): improve rubric text normalization for judge-garbled output#6080
tottenjordan wants to merge 3 commits into
google:mainfrom
tottenjordan:fix/rubric-text-normalization

Conversation

@tottenjordan

Copy link
Copy Markdown

Summary

Fixes #6072

_normalize_text currently only does .lower().strip(), so judge-model garbling (markdown bullets, smart quotes, bold formatting, extra whitespace) causes exact rubric match failures. Rubric scores get silently dropped with only a warning log.

Changes:

  • Replace _normalize_text with NFKC unicode normalization, smart-quote/dash translation, and markdown artifact stripping
  • Add substring fallback with uniqueness guard to convert_auto_rater_response_to_score — accepts a match only when exactly one rubric candidate matches, preventing ambiguous cross-matching

Garbling patterns handled:

Input Normalized Match
- The response correctly uses tools the response correctly uses tools
* **The response correctly uses tools** the response correctly uses tools
"The response correctly uses tools" (smart quotes) the response correctly uses tools
— The response correctly uses tools (em dash) the response correctly uses tools
– The response correctly uses tools (en dash) the response correctly uses tools
• The response correctly uses tools (unicode bullet) the response correctly uses tools
The response correctly uses tools (double spaces) the response correctly uses tools
The response… uses tools (ellipsis) the response... uses tools
réponse (accented chars) réponse (preserved)

Per @surajksharma07's suggestion in #6072: uses NFKC normalization instead of ascii-ignore (preserves non-English rubrics), and adds uniqueness guard on the substring fallback.

Validation

  • Unit tests: 46 tests pass (44 existing + 2 new) in test_rubric_based_evaluator.py
  • E2E pipeline: Ran full GEPA optimization pipeline (gepa-run-8fb68a8f52-20260611-115752) with 4 rubric-based criteria, gemini-2.5-pro judge — zero "not found in rubrics" warnings across all generations

Test plan

  • pytest tests/unittests/evaluation/test_rubric_based_evaluator.py -v — all 46 pass
  • Parametrized TestNormalizeText covers all garbling patterns from issue
  • TestSubstringFallbackUniquenessGuard verifies unique match accepted, ambiguous match rejected
  • All existing tests unchanged and passing

@tottenjordan

Copy link
Copy Markdown
Author

@surajksharma07 PR is up per your suggestion in #6072. Includes the NFKC normalization, smart-char mapping, and uniqueness guard on the substring fallback. 46 tests pass (44 existing + 2 new).

@rohityan rohityan self-assigned this Jun 11, 2026
@rohityan rohityan added the eval [Component] This issue is related to evaluation label Jun 11, 2026
@rohityan

Copy link
Copy Markdown
Collaborator

/adk-pr-analyze

@rohityan

Copy link
Copy Markdown
Collaborator

Hi @tottenjordan , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors.

@tottenjordan

Copy link
Copy Markdown
Author

@googlebot I signed it.

@tottenjordan tottenjordan force-pushed the fix/rubric-text-normalization branch 2 times, most recently from 67faae3 to e29c0f3 Compare June 23, 2026 23:16
Replace _normalize_text's simple lower().strip() with NFKC unicode
normalization, smart-quote/dash translation, and markdown artifact
stripping. Add substring fallback with uniqueness guard to
convert_auto_rater_response_to_score for cases where normalization
alone isn't sufficient.

Fixes google#6072
Address reviewer feedback on google#6072:
- Guard `if not rubric and normalized_rubric_text:` prevents empty
  judge Property: lines from matching every rubric via substring
- Guard `if ct and` prevents empty rubric keys from matching
- Add logger.debug when substring fallback rescues a match to track
  judge drift in eval logs
- Add test_empty_property_text_does_not_match test case
@tottenjordan tottenjordan force-pushed the fix/rubric-text-normalization branch from e29c0f3 to 72ece22 Compare June 23, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation request clarification [Status] The maintainer need clarification or more information from the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RubricBasedEvaluator _normalize_text too basic — fails on judge model markdown output

2 participants