fix(inference): WordPiece CJK character splitting via Hiragana/Katakana NFD fold by ohdearquant · Pull Request #109 · ohdearquant/lattice

ohdearquant · 2026-05-25T20:19:40Z

Layer

L1 — tokenizer fix (PR5 of 11)

What

Adds 58 Hiragana/Katakana voicing entries to fold_diacritic in wordpiece.rs. Each entry maps a voiced/semi-voiced syllable (e.g. で U+3067) to its NFD base syllable (e.g. て U+3066).

Root cause

で Unicode-decomposes under NFD into て + U+3099 (combining voicing mark). HF's BertNormalizer applies NFD then strips Mn-category combining chars. BGE vocab contains ##て (id 30191) but not ##で. Lattice's fold_diacritic only covered Latin diacritics, so で passed through unchanged → UNK (id 100).

Why

Discovered by the HF parity regression test (PR10): BGE-small produced cosine 0.9906 on "短い日本語のテストです。" because position 10 emitted UNK where HF emitted 30191.

Result

BGE parity cosine: 0.9906 → 0.9999 (CJK input)
Adds 2 CJK regression cases to audit_tokenizer_parity.rs
Zero new dependencies

Stack

Base: #108 (PR4 WP AddedToken)
Umbrella: #104

🤖 Generated with Claude Code

…nizers Voiced/semi-voiced Hiragana and Katakana (e.g. で U+3067, が U+304C) NFD-decompose to a base syllable + combining dakuten (U+3099/U+309A). HF BertNormalizer with strip_accents strips those combining marks, mapping で → て. Lattice's fold_diacritic table only covered Latin diacritics so these characters passed through unchanged, causing WordPiece to miss the ##て vocab entry and emit [UNK] (id 100) instead. Fix: extend fold_diacritic with all 58 voiced/semi-voiced Hiragana and Katakana, returning their base syllable strings (matching HF BertNormalizer NFD+strip behaviour). Add 2 CJK regression cases to audit_tokenizer_parity so this never regresses silently. Closes parity regression: BGE-small "短い日本語のテストです。" cosine 0.9906 → 0.9999 (UNK at position 10 replaced by correct ##て id 30191). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ohdearquant · 2026-05-25T20:49:18Z

Subsumed by #104 merge (umbrella PR brought all 11 PRs' content to main in one merge commit after stacked-PR base branches collapsed). Codex round-1 findings tracked in #116. Closing as superseded.

This was referenced May 25, 2026

fix(inference): SentencePiece trailing-whitespace handling matches HF Metaspace #110

Closed

embed-perf-quality show (umbrella draft — slice into ordered PRs below) #104

Merged

embed-perf-quality codex review follow-ups (PR #105-#115) #116

Open

ohdearquant closed this May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): WordPiece CJK character splitting via Hiragana/Katakana NFD fold#109

fix(inference): WordPiece CJK character splitting via Hiragana/Katakana NFD fold#109
ohdearquant wants to merge 1 commit into
pr-embedperf-04-wp-addedtokenfrom
pr-embedperf-05-wp-cjk-nfd

ohdearquant commented May 25, 2026

Uh oh!

ohdearquant commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented May 25, 2026

Layer

What

Root cause

Why

Result

Stack

Uh oh!

ohdearquant commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant