Fix IFEvalG word-counting checkers to ignore punctuation by pdudeia · Pull Request #1718 · allenai/open-instruct

pdudeia · 2026-06-09T05:25:22Z

Problem

count:counting_composition and count:count_unique count words with nltk.word_tokenize, which emits punctuation (., ,, :) as standalone tokens. As a result a correctly-formed response scores 0:

counting_composition — split_into_sentences keeps each sentence's terminal ., so word_tokenize("Cats run.") → ["Cats", "run", "."] (length 3). "exactly N words per sentence" is therefore off by one and a compliant sentence can never pass. Separately, the paragraph-split regex matched ***, while this instruction specifies the spaced * * * divider — so a response that used the divider exactly as instructed was never split into paragraphs. In the IF_multi_constraints_upto5 train set ~97% of responses for this constraint use * * *. Either defect alone drives this checker to ~100% false rejection.
count_unique — a multi-sentence response repeats the . (and ,) tokens, so len(words) == len(set(words)) is False regardless of the actual words used. Any response longer than one sentence fails.

These are the RLVR reward functions for these constraints, so the false negatives directly penalize compliant generations. Because a broken-but-non-crashing reward just looks like "a very hard constraint" when averaged across many constraints, it isn't obvious from inside training — it only shows up when you check pass rates against known-compliant text.

Fix

Add instructions_util.word_tokens(text), which tokenizes with word_tokenize (so non-Latin scripts tokenize correctly) but keeps only tokens containing at least one alphanumeric character. Both checkers use it for counting / de-duplication. counting_composition also splits on the * * * divider the instruction actually specifies and skips empty leading/trailing segments.

I used this surgical punctuation-drop rather than switching to a \w+ tokenizer on purpose: a plain \w+ tokenizer regresses non-Latin scripts (e.g. it mis-splits Tamil), whereas dropping punctuation tokens from word_tokenize preserves the existing tokenization for real words.

Tests

Adds TestIFEvalGWordCounting in open_instruct/test_ground_truth_utils.py covering: compliant * * * divider, rejection of the non-instructed *** divider, rejection of a wrong word count, unique vs. repeated-word count_unique, and an end-to-end IFEvalVerifier case.

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request improves the IFEvalG evaluation logic by addressing issues with word counting and paragraph splitting. It introduces a word_tokens helper function that filters out pure punctuation tokens, ensuring that trailing punctuation does not inflate word counts or cause false duplicate word detections. Additionally, it updates the paragraph splitter to support both spaced (* * *) and space-less (***) dividers, and adds corresponding regression tests. The reviewer suggested making the paragraph splitting regex more robust by using \s* instead of \s? to handle variable spacing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-09T05:26:57Z

+        # The divider shown to the model is "* * *" (with spaces); also accept the
+        # space-less "***" form. The previous pattern only matched "***", so a
+        # response that used the divider exactly as instructed was never split.
+        paragraphs = re.split(r"\s?\*\s?\*\s?\*\s?", value)


Using \s? only matches at most one whitespace character. If the model outputs multiple spaces or tabs around the asterisks (e.g., * * *), the split will fail. Using \s* instead of \s? makes the regex much more robust against variable spacing.

Suggested change

paragraphs = re.split(r"\s?\*\s?\*\s?\*\s?", value)

paragraphs = re.split(r"\s*\*\s*\*\s*\*\s*", value)

`count:counting_composition` and `count:count_unique` counted words with `nltk.word_tokenize`, which emits punctuation (".", ",", ":") as standalone tokens. As a result a correctly-formed response scored 0: - counting_composition: `split_into_sentences` keeps each sentence's terminal ".", so word_tokenize counts it as an extra token and "exactly N words per sentence" was off by one. Separately, the paragraph split matched "***" while this instruction specifies the spaced "* * *" divider, so a compliant response was never split into paragraphs (in the IF_multi_constraints_upto5 train set, ~97% of responses for this constraint use "* * *"). Both defects independently drove this checker to ~100% false rejection. - count_unique: a multi-sentence response repeats the "." (and ",") tokens, so `len(words) == len(set(words))` was False regardless of the words used. Add `instructions_util.word_tokens`, which keeps word_tokenize's tokenization (including for non-Latin scripts) but drops pure-punctuation tokens (a word has at least one alphanumeric char). Both checkers now use it. counting_composition splits on the "* * *" divider the instruction actually specifies, and skips empty leading/trailing segments. Validated on 4,000 real model responses carrying these constraints: 64 counting_composition and 38 count_unique verdicts flip, all False->True (previously-rejected compliant responses), with zero True->False regressions. The surgical punctuation-drop avoids the regressions a plain `\w+` tokenizer causes on non-Latin scripts. Adds regression tests in test_ground_truth_utils.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

finbarrtimbers · 2026-06-09T22:49:59Z

I've requested @ValentinaPy take a look.

finbarrtimbers · 2026-06-09T22:47:05Z

Please simplify this by making it a parameterized test.

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

pdudeia force-pushed the fix-ifevalg-punctuation-tokenization branch from 1b9ca21 to 6e1d9d2 Compare June 9, 2026 05:44

finbarrtimbers requested a review from ValentinaPy June 9, 2026 22:49

finbarrtimbers reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix IFEvalG word-counting checkers to ignore punctuation#1718

Fix IFEvalG word-counting checkers to ignore punctuation#1718
pdudeia wants to merge 1 commit into
allenai:mainfrom
pdudeia:fix-ifevalg-punctuation-tokenization

pdudeia commented Jun 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

finbarrtimbers commented Jun 9, 2026

Uh oh!

finbarrtimbers Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	paragraphs = re.split(r"\s?\\s?\\s?\*\s?", value)
	paragraphs = re.split(r"\s\\s\\s\\s*", value)

Uh oh!

Conversation

pdudeia commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers commented Jun 9, 2026

Uh oh!

finbarrtimbers Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pdudeia commented Jun 9, 2026 •

edited

Loading