Fix IFEvalG word-counting checkers to ignore punctuation#1718
Conversation
There was a problem hiding this comment.
Code Review
This pull request improves the IFEvalG evaluation logic by addressing issues with word counting and paragraph splitting. It introduces a word_tokens helper function that filters out pure punctuation tokens, ensuring that trailing punctuation does not inflate word counts or cause false duplicate word detections. Additionally, it updates the paragraph splitter to support both spaced (* * *) and space-less (***) dividers, and adds corresponding regression tests. The reviewer suggested making the paragraph splitting regex more robust by using \s* instead of \s? to handle variable spacing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # The divider shown to the model is "* * *" (with spaces); also accept the | ||
| # space-less "***" form. The previous pattern only matched "***", so a | ||
| # response that used the divider exactly as instructed was never split. | ||
| paragraphs = re.split(r"\s?\*\s?\*\s?\*\s?", value) |
There was a problem hiding this comment.
Using \s? only matches at most one whitespace character. If the model outputs multiple spaces or tabs around the asterisks (e.g., * * *), the split will fail. Using \s* instead of \s? makes the regex much more robust against variable spacing.
| paragraphs = re.split(r"\s?\*\s?\*\s?\*\s?", value) | |
| paragraphs = re.split(r"\s*\*\s*\*\s*\*\s*", value) |
`count:counting_composition` and `count:count_unique` counted words with
`nltk.word_tokenize`, which emits punctuation (".", ",", ":") as standalone
tokens. As a result a correctly-formed response scored 0:
- counting_composition: `split_into_sentences` keeps each sentence's terminal
".", so word_tokenize counts it as an extra token and "exactly N words per
sentence" was off by one. Separately, the paragraph split matched "***" while
this instruction specifies the spaced "* * *" divider, so a compliant response
was never split into paragraphs (in the IF_multi_constraints_upto5 train set,
~97% of responses for this constraint use "* * *"). Both defects independently
drove this checker to ~100% false rejection.
- count_unique: a multi-sentence response repeats the "." (and ",") tokens, so
`len(words) == len(set(words))` was False regardless of the words used.
Add `instructions_util.word_tokens`, which keeps word_tokenize's tokenization
(including for non-Latin scripts) but drops pure-punctuation tokens (a word has
at least one alphanumeric char). Both checkers now use it. counting_composition
splits on the "* * *" divider the instruction actually specifies, and skips
empty leading/trailing segments.
Validated on 4,000 real model responses carrying these constraints: 64
counting_composition and 38 count_unique verdicts flip, all False->True
(previously-rejected compliant responses), with zero True->False regressions.
The surgical punctuation-drop avoids the regressions a plain `\w+` tokenizer
causes on non-Latin scripts.
Adds regression tests in test_ground_truth_utils.py.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1b9ca21 to
6e1d9d2
Compare
|
I've requested @ValentinaPy take a look. |
There was a problem hiding this comment.
Please simplify this by making it a parameterized test.
Problem
count:counting_compositionandcount:count_uniquecount words withnltk.word_tokenize, which emits punctuation (.,,,:) as standalone tokens. As a result a correctly-formed response scores 0:counting_composition—split_into_sentenceskeeps each sentence's terminal., soword_tokenize("Cats run.")→["Cats", "run", "."](length 3). "exactly N words per sentence" is therefore off by one and a compliant sentence can never pass. Separately, the paragraph-split regex matched***, while this instruction specifies the spaced* * *divider — so a response that used the divider exactly as instructed was never split into paragraphs. In theIF_multi_constraints_upto5train set ~97% of responses for this constraint use* * *. Either defect alone drives this checker to ~100% false rejection.count_unique— a multi-sentence response repeats the.(and,) tokens, solen(words) == len(set(words))isFalseregardless of the actual words used. Any response longer than one sentence fails.These are the RLVR reward functions for these constraints, so the false negatives directly penalize compliant generations. Because a broken-but-non-crashing reward just looks like "a very hard constraint" when averaged across many constraints, it isn't obvious from inside training — it only shows up when you check pass rates against known-compliant text.
Fix
Add
instructions_util.word_tokens(text), which tokenizes withword_tokenize(so non-Latin scripts tokenize correctly) but keeps only tokens containing at least one alphanumeric character. Both checkers use it for counting / de-duplication.counting_compositionalso splits on the* * *divider the instruction actually specifies and skips empty leading/trailing segments.I used this surgical punctuation-drop rather than switching to a
\w+tokenizer on purpose: a plain\w+tokenizer regresses non-Latin scripts (e.g. it mis-splits Tamil), whereas dropping punctuation tokens fromword_tokenizepreserves the existing tokenization for real words.Tests
Adds
TestIFEvalGWordCountinginopen_instruct/test_ground_truth_utils.pycovering: compliant* * *divider, rejection of the non-instructed***divider, rejection of a wrong word count, unique vs. repeated-wordcount_unique, and an end-to-endIFEvalVerifiercase.🤖 Generated with Claude Code