fix(inference): SentencePiece trailing-whitespace handling matches HF Metaspace by ohdearquant · Pull Request #110 · ohdearquant/lattice

ohdearquant · 2026-05-25T20:19:55Z

Layer

L1 — tokenizer fix (PR6 of 11)

What

Strips trailing META_SPACE (▁) characters from normalize() output in sentencepiece.rs. Two branches: one for escape_whitespaces=true (E5/XLM-RoBERTa path), one for escape_whitespaces=false (ASCII space).

Root cause

The normalize() loop emitted ▁ for every whitespace character including trailing ones. HF's Metaspace pre-tokenizer semantics define ▁ as the space before a word — trailing whitespace has no following word, so no ▁ should be emitted. The stray trailing ▁ was a valid trie entry (token 6, the bare space-piece), so Viterbi encoded it as an extra token before EOS.

Why

Discovered by the HF parity regression test (PR10): E5-multilingual-small produced cosine 0.9659 on " leading whitespace and multiple spaces ".

Result

E5 parity cosine: 0.9659 → 0.9999 (leading-whitespace input)
Adds 2 whitespace regression cases to audit_tokenizer_parity.rs

Stack

Base: #109 (PR5 WP CJK NFD)
Umbrella: #104

🤖 Generated with Claude Code

… Metaspace Root cause: the normalize() loop emitted a META_SPACE (▁) character for every whitespace character in the input, including trailing whitespace. HF's Metaspace pre-tokenizer treats ▁ as the space *before* a word — trailing whitespace has no following word, so no ▁ should be emitted. Fix: after the normalize() loop, strip any trailing META_SPACE characters when `escape_whitespaces=true` (the E5/XLM-RoBERTa path), or strip trailing ASCII spaces when `escape_whitespaces=false`. Leading whitespace was already handled correctly: `dummy_prefix=true` sets `prev_was_space=true` before the loop, so the first leading space character is collapsed (remove_extra_whitespaces) or skipped entirely. Regression tests added to audit_tokenizer_parity.rs: - " leading whitespace and multiple spaces " → 8-token seq - "trailing space " → 4-token seq Both verified against HF tokenizers==0.23.1 reference. Closes parity regression: E5-multilingual-small input with trailing spaces went from cosine 0.9659 (extra ▁ before EOS) to cosine ≥ 0.999. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ohdearquant · 2026-05-25T20:49:19Z

Subsumed by #104 merge (umbrella PR brought all 11 PRs' content to main in one merge commit after stacked-PR base branches collapsed). Codex round-1 findings tracked in #116. Closing as superseded.

This was referenced May 25, 2026

test(embed): end-to-end tokenizer parity through embedding service (P0-E1) #111

Closed

embed-perf-quality show (umbrella draft — slice into ordered PRs below) #104

Merged

embed-perf-quality codex review follow-ups (PR #105-#115) #116

Open

ohdearquant closed this May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): SentencePiece trailing-whitespace handling matches HF Metaspace#110

fix(inference): SentencePiece trailing-whitespace handling matches HF Metaspace#110
ohdearquant wants to merge 1 commit into
pr-embedperf-05-wp-cjk-nfdfrom
pr-embedperf-06-sp-trailing-ws

ohdearquant commented May 25, 2026

Uh oh!

ohdearquant commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented May 25, 2026

Layer

What

Root cause

Why

Result

Stack

Uh oh!

ohdearquant commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant