Skip to content

fix(inference): inject Qwen BPE EOS and align GPT-4 pre-tokenizer regex#107

Closed
ohdearquant wants to merge 1 commit into
pr-embedperf-02-sp-bos-eosfrom
pr-embedperf-03-qwen-bpe-eos
Closed

fix(inference): inject Qwen BPE EOS and align GPT-4 pre-tokenizer regex#107
ohdearquant wants to merge 1 commit into
pr-embedperf-02-sp-bos-eosfrom
pr-embedperf-03-qwen-bpe-eos

Conversation

@ohdearquant
Copy link
Copy Markdown
Owner

Layer

L1 — tokenizer fix (PR3 of 11)

What

  • Parses TemplateProcessing for Qwen BPE; injects EOS id 151643 at encode boundary
  • Aligns pre-tokenizer regex with GPT-4-style behavior for URL/punctuation cases (leading punctuation attaches to following letter run)

Why

Audit gap: pre-fix lattice produced 0/10 PASS on Qwen BPE cases (9 EOS-only + 1 URL regex divergence).

Result

  • Qwen audit cases: 0/10 → 10/10 PASS
  • Hand-coded regex; follow-up to consume HF's regex literally noted in show notes

Stack

Base: #106 (PR2 SP BOS/EOS)
Umbrella: #104

🤖 Generated with Claude Code

Parse post_processor for EOS injection (id=151643 from TemplateProcessing)
and detect regex-based Split pre-tokenizer in tokenizer.json. Implements
a hand-coded GPT-4 regex pattern that attaches leading punctuation to
following letter runs (e.g. ".com", "/path"), matching HF behavior.

Qwen3-Embedding-0.6B parity moves from 0/10 to 10/10.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ohdearquant
Copy link
Copy Markdown
Owner Author

Subsumed by #104 merge (umbrella PR brought all 11 PRs' content to main in one merge commit after stacked-PR base branches collapsed). Codex round-1 findings tracked in #116. Closing as superseded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant