fix(inference): preserve AddedToken metadata for WordPiece pre-splitting by ohdearquant · Pull Request #108 · ohdearquant/lattice

ohdearquant · 2026-05-25T20:19:26Z

Layer

L1 — tokenizer fix (PR4 of 11)

What

Switches WordPiece content-only added-token parsing to structured AddedToken with content, lstrip, rstrip, normalized, single_word fields
Adds longest-match lookup pass BEFORE BERT normalization — input is split into segments separated by added-token matches, then each segment goes through normal WordPiece flow

Why

Audit gap: special tokens ([CLS], [SEP]) appearing literally in input text were being split by BERT normalization instead of matched as whole tokens. Pre-fix lattice produced 8/9 PASS on BGE WordPiece audit cases.

Result

BGE audit cases: 8/9 → 9/9 PASS
No changes to BPE/SP — they'd benefit from same pattern but no failures measured (noted as follow-up)

Stack

Base: #107 (PR3 Qwen BPE)
Umbrella: #104

🤖 Generated with Claude Code

Parse added_tokens from tokenizer.json and match them in raw text before BERT normalization. Special tokens like [CLS] and [SEP] appearing literally in input text are now recognized as whole tokens instead of being split character-by-character through the normalizer. BGE (bge-small-en-v1.5) parity moves from 8/9 to 9/9. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ohdearquant · 2026-05-25T20:49:16Z

Subsumed by #104 merge (umbrella PR brought all 11 PRs' content to main in one merge commit after stacked-PR base branches collapsed). Codex round-1 findings tracked in #116. Closing as superseded.

This was referenced May 25, 2026

fix(inference): WordPiece CJK character splitting via Hiragana/Katakana NFD fold #109

Closed

embed-perf-quality show (umbrella draft — slice into ordered PRs below) #104

Merged

embed-perf-quality codex review follow-ups (PR #105-#115) #116

Open

ohdearquant closed this May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): preserve AddedToken metadata for WordPiece pre-splitting#108

fix(inference): preserve AddedToken metadata for WordPiece pre-splitting#108
ohdearquant wants to merge 1 commit into
pr-embedperf-03-qwen-bpe-eosfrom
pr-embedperf-04-wp-addedtoken

ohdearquant commented May 25, 2026

Uh oh!

ohdearquant commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented May 25, 2026

Layer

What

Why

Result

Stack

Uh oh!

ohdearquant commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant