Skip to content

fix(inference): preserve AddedToken metadata for WordPiece pre-splitting#108

Closed
ohdearquant wants to merge 1 commit into
pr-embedperf-03-qwen-bpe-eosfrom
pr-embedperf-04-wp-addedtoken
Closed

fix(inference): preserve AddedToken metadata for WordPiece pre-splitting#108
ohdearquant wants to merge 1 commit into
pr-embedperf-03-qwen-bpe-eosfrom
pr-embedperf-04-wp-addedtoken

Conversation

@ohdearquant
Copy link
Copy Markdown
Owner

Layer

L1 — tokenizer fix (PR4 of 11)

What

  • Switches WordPiece content-only added-token parsing to structured AddedToken with content, lstrip, rstrip, normalized, single_word fields
  • Adds longest-match lookup pass BEFORE BERT normalization — input is split into segments separated by added-token matches, then each segment goes through normal WordPiece flow

Why

Audit gap: special tokens ([CLS], [SEP]) appearing literally in input text were being split by BERT normalization instead of matched as whole tokens. Pre-fix lattice produced 8/9 PASS on BGE WordPiece audit cases.

Result

  • BGE audit cases: 8/9 → 9/9 PASS
  • No changes to BPE/SP — they'd benefit from same pattern but no failures measured (noted as follow-up)

Stack

Base: #107 (PR3 Qwen BPE)
Umbrella: #104

🤖 Generated with Claude Code

Parse added_tokens from tokenizer.json and match them in raw text before
BERT normalization. Special tokens like [CLS] and [SEP] appearing
literally in input text are now recognized as whole tokens instead of
being split character-by-character through the normalizer.

BGE (bge-small-en-v1.5) parity moves from 8/9 to 9/9.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ohdearquant
Copy link
Copy Markdown
Owner Author

Subsumed by #104 merge (umbrella PR brought all 11 PRs' content to main in one merge commit after stacked-PR base branches collapsed). Codex round-1 findings tracked in #116. Closing as superseded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant