Skip to content

fix(inference): WordPiece CJK character splitting via Hiragana/Katakana NFD fold#109

Closed
ohdearquant wants to merge 1 commit into
pr-embedperf-04-wp-addedtokenfrom
pr-embedperf-05-wp-cjk-nfd
Closed

fix(inference): WordPiece CJK character splitting via Hiragana/Katakana NFD fold#109
ohdearquant wants to merge 1 commit into
pr-embedperf-04-wp-addedtokenfrom
pr-embedperf-05-wp-cjk-nfd

Conversation

@ohdearquant
Copy link
Copy Markdown
Owner

Layer

L1 — tokenizer fix (PR5 of 11)

What

Adds 58 Hiragana/Katakana voicing entries to fold_diacritic in wordpiece.rs. Each entry maps a voiced/semi-voiced syllable (e.g. U+3067) to its NFD base syllable (e.g. U+3066).

Root cause

Unicode-decomposes under NFD into + U+3099 (combining voicing mark). HF's BertNormalizer applies NFD then strips Mn-category combining chars. BGE vocab contains ##て (id 30191) but not ##で. Lattice's fold_diacritic only covered Latin diacritics, so passed through unchanged → UNK (id 100).

Why

Discovered by the HF parity regression test (PR10): BGE-small produced cosine 0.9906 on "短い日本語のテストです。" because position 10 emitted UNK where HF emitted 30191.

Result

  • BGE parity cosine: 0.9906 → 0.9999 (CJK input)
  • Adds 2 CJK regression cases to audit_tokenizer_parity.rs
  • Zero new dependencies

Stack

Base: #108 (PR4 WP AddedToken)
Umbrella: #104

🤖 Generated with Claude Code

…nizers

Voiced/semi-voiced Hiragana and Katakana (e.g. で U+3067, が U+304C) NFD-decompose
to a base syllable + combining dakuten (U+3099/U+309A). HF BertNormalizer with
strip_accents strips those combining marks, mapping で → て. Lattice's fold_diacritic
table only covered Latin diacritics so these characters passed through unchanged,
causing WordPiece to miss the ##て vocab entry and emit [UNK] (id 100) instead.

Fix: extend fold_diacritic with all 58 voiced/semi-voiced Hiragana and Katakana,
returning their base syllable strings (matching HF BertNormalizer NFD+strip behaviour).
Add 2 CJK regression cases to audit_tokenizer_parity so this never regresses silently.

Closes parity regression: BGE-small "短い日本語のテストです。" cosine 0.9906 → 0.9999
(UNK at position 10 replaced by correct ##て id 30191).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ohdearquant
Copy link
Copy Markdown
Owner Author

Subsumed by #104 merge (umbrella PR brought all 11 PRs' content to main in one merge commit after stacked-PR base branches collapsed). Codex round-1 findings tracked in #116. Closing as superseded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant