fix(inference): WordPiece CJK character splitting via Hiragana/Katakana NFD fold#109
Closed
ohdearquant wants to merge 1 commit into
Closed
fix(inference): WordPiece CJK character splitting via Hiragana/Katakana NFD fold#109ohdearquant wants to merge 1 commit into
ohdearquant wants to merge 1 commit into
Conversation
…nizers Voiced/semi-voiced Hiragana and Katakana (e.g. で U+3067, が U+304C) NFD-decompose to a base syllable + combining dakuten (U+3099/U+309A). HF BertNormalizer with strip_accents strips those combining marks, mapping で → て. Lattice's fold_diacritic table only covered Latin diacritics so these characters passed through unchanged, causing WordPiece to miss the ##て vocab entry and emit [UNK] (id 100) instead. Fix: extend fold_diacritic with all 58 voiced/semi-voiced Hiragana and Katakana, returning their base syllable strings (matching HF BertNormalizer NFD+strip behaviour). Add 2 CJK regression cases to audit_tokenizer_parity so this never regresses silently. Closes parity regression: BGE-small "短い日本語のテストです。" cosine 0.9906 → 0.9999 (UNK at position 10 replaced by correct ##て id 30191). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Layer
L1 — tokenizer fix (PR5 of 11)
What
Adds 58 Hiragana/Katakana voicing entries to
fold_diacriticinwordpiece.rs. Each entry maps a voiced/semi-voiced syllable (e.g.でU+3067) to its NFD base syllable (e.g.てU+3066).Root cause
でUnicode-decomposes under NFD intoて+ U+3099 (combining voicing mark). HF'sBertNormalizerapplies NFD then strips Mn-category combining chars. BGE vocab contains##て(id 30191) but not##で. Lattice'sfold_diacriticonly covered Latin diacritics, soでpassed through unchanged → UNK (id 100).Why
Discovered by the HF parity regression test (PR10): BGE-small produced cosine 0.9906 on
"短い日本語のテストです。"because position 10 emitted UNK where HF emitted 30191.Result
audit_tokenizer_parity.rsStack
Base: #108 (PR4 WP AddedToken)
Umbrella: #104
🤖 Generated with Claude Code