Skip to content

fix(inference): inject SentencePiece BOS/EOS from tokenizer.json post_processor#106

Merged
ohdearquant merged 2 commits into
pr-embedperf-01-audit-baselinefrom
pr-embedperf-02-sp-bos-eos
May 25, 2026
Merged

fix(inference): inject SentencePiece BOS/EOS from tokenizer.json post_processor#106
ohdearquant merged 2 commits into
pr-embedperf-01-audit-baselinefrom
pr-embedperf-02-sp-bos-eos

Conversation

@ohdearquant
Copy link
Copy Markdown
Owner

Layer

L1 — tokenizer fix (PR2 of 11 in embed-perf-quality show)

What

Parses tokenizer.json post_processor field of type TemplateProcessing for SentencePiece tokenizers (E5 family). Extracts single template, detects <s>/</s> IDs, injects them at the SP encode boundary AFTER subword tokenization and BEFORE returning TokenizedInput.

Why

Audit gap: pre-fix lattice produced 0/9 PASS on E5 SentencePiece cases. HF reference always wraps SP outputs in BOS/EOS via the template processor.

Result

  • E5 audit cases: 0/9 → 9/9 PASS
  • No new dependencies

Stack

Base: #105 (PR1 audit baseline)
Umbrella: #104

🤖 Generated with Claude Code

…_processor

Parse TemplateProcessing post_processor to detect BOS/EOS framing.
E5 (multilingual-e5-small) parity moves from 0/9 to 9/9 — payload
tokens were already correct, only <s>/<\/s> injection was missing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses codex round-1 finding: when a SentencePiece tokenizer has both
`<s>`/`</s>` in vocab AND a TemplateProcessing post_processor that uses
different special tokens (e.g. `[CLS]`/`[SEP]`), the prior fallback order
(`bos_id.or(pp.bos_id)`) silently injected the vocab guesses instead of
the template-supplied IDs. Inverts to `pp.bos_id.or(bos_id)` so the
template IDs win when present, with vocab-name fallback only when the
template omits explicit IDs.

Adds two focused unit tests in `common.rs`:
- BERT-style template ([CLS]/[SEP], IDs 101/102) → verifies template IDs win
- XLM-RoBERTa-style template (<s>/</s>, IDs 0/2) → verifies the E5 path
  still resolves correctly

E5 audit case still passes (9/9 multilingual_e5_small). Workspace clippy
clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant