fix(inference): SentencePiece trailing-whitespace handling matches HF Metaspace#110
Closed
ohdearquant wants to merge 1 commit into
Closed
fix(inference): SentencePiece trailing-whitespace handling matches HF Metaspace#110ohdearquant wants to merge 1 commit into
ohdearquant wants to merge 1 commit into
Conversation
… Metaspace Root cause: the normalize() loop emitted a META_SPACE (▁) character for every whitespace character in the input, including trailing whitespace. HF's Metaspace pre-tokenizer treats ▁ as the space *before* a word — trailing whitespace has no following word, so no ▁ should be emitted. Fix: after the normalize() loop, strip any trailing META_SPACE characters when `escape_whitespaces=true` (the E5/XLM-RoBERTa path), or strip trailing ASCII spaces when `escape_whitespaces=false`. Leading whitespace was already handled correctly: `dummy_prefix=true` sets `prev_was_space=true` before the loop, so the first leading space character is collapsed (remove_extra_whitespaces) or skipped entirely. Regression tests added to audit_tokenizer_parity.rs: - " leading whitespace and multiple spaces " → 8-token seq - "trailing space " → 4-token seq Both verified against HF tokenizers==0.23.1 reference. Closes parity regression: E5-multilingual-small input with trailing spaces went from cosine 0.9659 (extra ▁ before EOS) to cosine ≥ 0.999. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Layer
L1 — tokenizer fix (PR6 of 11)
What
Strips trailing
META_SPACE(▁) characters fromnormalize()output insentencepiece.rs. Two branches: one forescape_whitespaces=true(E5/XLM-RoBERTa path), one forescape_whitespaces=false(ASCII space).Root cause
The
normalize()loop emitted ▁ for every whitespace character including trailing ones. HF's Metaspace pre-tokenizer semantics define ▁ as the space before a word — trailing whitespace has no following word, so no ▁ should be emitted. The stray trailing ▁ was a valid trie entry (token 6, the bare space-piece), so Viterbi encoded it as an extra token before EOS.Why
Discovered by the HF parity regression test (PR10): E5-multilingual-small produced cosine 0.9659 on
" leading whitespace and multiple spaces ".Result
audit_tokenizer_parity.rsStack
Base: #109 (PR5 WP CJK NFD)
Umbrella: #104
🤖 Generated with Claude Code