Skip to content

fix(inference): SentencePiece trailing-whitespace handling matches HF Metaspace#110

Closed
ohdearquant wants to merge 1 commit into
pr-embedperf-05-wp-cjk-nfdfrom
pr-embedperf-06-sp-trailing-ws
Closed

fix(inference): SentencePiece trailing-whitespace handling matches HF Metaspace#110
ohdearquant wants to merge 1 commit into
pr-embedperf-05-wp-cjk-nfdfrom
pr-embedperf-06-sp-trailing-ws

Conversation

@ohdearquant
Copy link
Copy Markdown
Owner

Layer

L1 — tokenizer fix (PR6 of 11)

What

Strips trailing META_SPACE (▁) characters from normalize() output in sentencepiece.rs. Two branches: one for escape_whitespaces=true (E5/XLM-RoBERTa path), one for escape_whitespaces=false (ASCII space).

Root cause

The normalize() loop emitted ▁ for every whitespace character including trailing ones. HF's Metaspace pre-tokenizer semantics define ▁ as the space before a word — trailing whitespace has no following word, so no ▁ should be emitted. The stray trailing ▁ was a valid trie entry (token 6, the bare space-piece), so Viterbi encoded it as an extra token before EOS.

Why

Discovered by the HF parity regression test (PR10): E5-multilingual-small produced cosine 0.9659 on " leading whitespace and multiple spaces ".

Result

  • E5 parity cosine: 0.9659 → 0.9999 (leading-whitespace input)
  • Adds 2 whitespace regression cases to audit_tokenizer_parity.rs

Stack

Base: #109 (PR5 WP CJK NFD)
Umbrella: #104

🤖 Generated with Claude Code

… Metaspace

Root cause: the normalize() loop emitted a META_SPACE (▁) character for
every whitespace character in the input, including trailing whitespace.
HF's Metaspace pre-tokenizer treats ▁ as the space *before* a word —
trailing whitespace has no following word, so no ▁ should be emitted.

Fix: after the normalize() loop, strip any trailing META_SPACE characters
when `escape_whitespaces=true` (the E5/XLM-RoBERTa path), or strip
trailing ASCII spaces when `escape_whitespaces=false`.

Leading whitespace was already handled correctly: `dummy_prefix=true`
sets `prev_was_space=true` before the loop, so the first leading space
character is collapsed (remove_extra_whitespaces) or skipped entirely.

Regression tests added to audit_tokenizer_parity.rs:
  - "   leading whitespace and    multiple    spaces   " → 8-token seq
  - "trailing space " → 4-token seq
Both verified against HF tokenizers==0.23.1 reference.

Closes parity regression: E5-multilingual-small input with trailing
spaces went from cosine 0.9659 (extra ▁ before EOS) to cosine ≥ 0.999.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ohdearquant
Copy link
Copy Markdown
Owner Author

Subsumed by #104 merge (umbrella PR brought all 11 PRs' content to main in one merge commit after stacked-PR base branches collapsed). Codex round-1 findings tracked in #116. Closing as superseded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant