Skip to content

Fix IFEvalG word-counting checkers to ignore punctuation#1718

Open
pdudeia wants to merge 1 commit into
allenai:mainfrom
pdudeia:fix-ifevalg-punctuation-tokenization
Open

Fix IFEvalG word-counting checkers to ignore punctuation#1718
pdudeia wants to merge 1 commit into
allenai:mainfrom
pdudeia:fix-ifevalg-punctuation-tokenization

Conversation

@pdudeia

@pdudeia pdudeia commented Jun 9, 2026

Copy link
Copy Markdown

Problem

count:counting_composition and count:count_unique count words with nltk.word_tokenize, which emits punctuation (., ,, :) as standalone tokens. As a result a correctly-formed response scores 0:

  • counting_compositionsplit_into_sentences keeps each sentence's terminal ., so word_tokenize("Cats run.")["Cats", "run", "."] (length 3). "exactly N words per sentence" is therefore off by one and a compliant sentence can never pass. Separately, the paragraph-split regex matched ***, while this instruction specifies the spaced * * * divider — so a response that used the divider exactly as instructed was never split into paragraphs. In the IF_multi_constraints_upto5 train set ~97% of responses for this constraint use * * *. Either defect alone drives this checker to ~100% false rejection.
  • count_unique — a multi-sentence response repeats the . (and ,) tokens, so len(words) == len(set(words)) is False regardless of the actual words used. Any response longer than one sentence fails.

These are the RLVR reward functions for these constraints, so the false negatives directly penalize compliant generations. Because a broken-but-non-crashing reward just looks like "a very hard constraint" when averaged across many constraints, it isn't obvious from inside training — it only shows up when you check pass rates against known-compliant text.

Fix

Add instructions_util.word_tokens(text), which tokenizes with word_tokenize (so non-Latin scripts tokenize correctly) but keeps only tokens containing at least one alphanumeric character. Both checkers use it for counting / de-duplication. counting_composition also splits on the * * * divider the instruction actually specifies and skips empty leading/trailing segments.

I used this surgical punctuation-drop rather than switching to a \w+ tokenizer on purpose: a plain \w+ tokenizer regresses non-Latin scripts (e.g. it mis-splits Tamil), whereas dropping punctuation tokens from word_tokenize preserves the existing tokenization for real words.

Tests

Adds TestIFEvalGWordCounting in open_instruct/test_ground_truth_utils.py covering: compliant * * * divider, rejection of the non-instructed *** divider, rejection of a wrong word count, unique vs. repeated-word count_unique, and an end-to-end IFEvalVerifier case.

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the IFEvalG evaluation logic by addressing issues with word counting and paragraph splitting. It introduces a word_tokens helper function that filters out pure punctuation tokens, ensuring that trailing punctuation does not inflate word counts or cause false duplicate word detections. Additionally, it updates the paragraph splitter to support both spaced (* * *) and space-less (***) dividers, and adds corresponding regression tests. The reviewer suggested making the paragraph splitting regex more robust by using \s* instead of \s? to handle variable spacing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread open_instruct/IFEvalG/instructions.py Outdated
# The divider shown to the model is "* * *" (with spaces); also accept the
# space-less "***" form. The previous pattern only matched "***", so a
# response that used the divider exactly as instructed was never split.
paragraphs = re.split(r"\s?\*\s?\*\s?\*\s?", value)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using \s? only matches at most one whitespace character. If the model outputs multiple spaces or tabs around the asterisks (e.g., * * *), the split will fail. Using \s* instead of \s? makes the regex much more robust against variable spacing.

Suggested change
paragraphs = re.split(r"\s?\*\s?\*\s?\*\s?", value)
paragraphs = re.split(r"\s*\*\s*\*\s*\*\s*", value)

`count:counting_composition` and `count:count_unique` counted words with
`nltk.word_tokenize`, which emits punctuation (".", ",", ":") as standalone
tokens. As a result a correctly-formed response scored 0:

- counting_composition: `split_into_sentences` keeps each sentence's terminal
  ".", so word_tokenize counts it as an extra token and "exactly N words per
  sentence" was off by one. Separately, the paragraph split matched "***" while
  this instruction specifies the spaced "* * *" divider, so a compliant response
  was never split into paragraphs (in the IF_multi_constraints_upto5 train set,
  ~97% of responses for this constraint use "* * *"). Both defects independently
  drove this checker to ~100% false rejection.
- count_unique: a multi-sentence response repeats the "." (and ",") tokens, so
  `len(words) == len(set(words))` was False regardless of the words used.

Add `instructions_util.word_tokens`, which keeps word_tokenize's tokenization
(including for non-Latin scripts) but drops pure-punctuation tokens (a word has
at least one alphanumeric char). Both checkers now use it. counting_composition
splits on the "* * *" divider the instruction actually specifies, and skips
empty leading/trailing segments.

Validated on 4,000 real model responses carrying these constraints: 64
counting_composition and 38 count_unique verdicts flip, all False->True
(previously-rejected compliant responses), with zero True->False regressions.
The surgical punctuation-drop avoids the regressions a plain `\w+` tokenizer
causes on non-Latin scripts.

Adds regression tests in test_ground_truth_utils.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pdudeia pdudeia force-pushed the fix-ifevalg-punctuation-tokenization branch from 1b9ca21 to 6e1d9d2 Compare June 9, 2026 05:44
@finbarrtimbers

Copy link
Copy Markdown
Collaborator

I've requested @ValentinaPy take a look.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please simplify this by making it a parameterized test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants