fix(inference): inject Qwen BPE EOS and align GPT-4 pre-tokenizer regex by ohdearquant · Pull Request #107 · ohdearquant/lattice

ohdearquant · 2026-05-25T20:19:11Z

Layer

L1 — tokenizer fix (PR3 of 11)

What

Parses TemplateProcessing for Qwen BPE; injects EOS id 151643 at encode boundary
Aligns pre-tokenizer regex with GPT-4-style behavior for URL/punctuation cases (leading punctuation attaches to following letter run)

Why

Audit gap: pre-fix lattice produced 0/10 PASS on Qwen BPE cases (9 EOS-only + 1 URL regex divergence).

Result

Qwen audit cases: 0/10 → 10/10 PASS
Hand-coded regex; follow-up to consume HF's regex literally noted in show notes

Stack

Base: #106 (PR2 SP BOS/EOS)
Umbrella: #104

🤖 Generated with Claude Code

Parse post_processor for EOS injection (id=151643 from TemplateProcessing) and detect regex-based Split pre-tokenizer in tokenizer.json. Implements a hand-coded GPT-4 regex pattern that attaches leading punctuation to following letter runs (e.g. ".com", "/path"), matching HF behavior. Qwen3-Embedding-0.6B parity moves from 0/10 to 10/10. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ohdearquant · 2026-05-25T20:49:15Z

Subsumed by #104 merge (umbrella PR brought all 11 PRs' content to main in one merge commit after stacked-PR base branches collapsed). Codex round-1 findings tracked in #116. Closing as superseded.

This was referenced May 25, 2026

fix(inference): preserve AddedToken metadata for WordPiece pre-splitting #108

Closed

embed-perf-quality show (umbrella draft — slice into ordered PRs below) #104

Merged

embed-perf-quality codex review follow-ups (PR #105-#115) #116

Open

ohdearquant closed this May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): inject Qwen BPE EOS and align GPT-4 pre-tokenizer regex#107

fix(inference): inject Qwen BPE EOS and align GPT-4 pre-tokenizer regex#107
ohdearquant wants to merge 1 commit into
pr-embedperf-02-sp-bos-eosfrom
pr-embedperf-03-qwen-bpe-eos

ohdearquant commented May 25, 2026

Uh oh!

ohdearquant commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented May 25, 2026

Layer

What

Why

Result

Stack

Uh oh!

ohdearquant commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant