Skip to content

test(embed): end-to-end tokenizer parity through embedding service (P0-E1)#111

Closed
ohdearquant wants to merge 1 commit into
pr-embedperf-06-sp-trailing-wsfrom
pr-embedperf-07-tokenizer-e2e-test
Closed

test(embed): end-to-end tokenizer parity through embedding service (P0-E1)#111
ohdearquant wants to merge 1 commit into
pr-embedperf-06-sp-trailing-wsfrom
pr-embedperf-07-tokenizer-e2e-test

Conversation

@ohdearquant
Copy link
Copy Markdown
Owner

Layer

L2 — test addition (PR7 of 11)

What

Adds crates/embed/tests/tokenizer_parity_e2e.rs with 3 tests (BGE/WordPiece, E5/SentencePiece, Qwen/BPE). Each test calls load_tokenizer() directly — the same code path NativeEmbeddingService uses — with the same model configs, asserting token IDs match HF reference values.

Why

Confirms the tokenizer fixes from PR2-PR6 flow through the embed service without model weights being required. Tests skip with explicit message when tokenizer JSON is absent; no stubs.

Result

  • All 3 e2e tests pass on this stack
  • No model weights required to exercise (CI-friendly)

Stack

Base: #110 (PR6 SP trailing-ws)
Umbrella: #104

🤖 Generated with Claude Code

…0-E1)

Confirms the tokenizer fixes from impl-tokenizer-fixes (SentencePiece
BOS/EOS, Qwen BPE EOS, AddedToken longest-match) flow through the
load_tokenizer path that NativeEmbeddingService uses. Three tests cover
BGE/WordPiece, E5/SentencePiece, and Qwen/BPE at the token-ID level so
no model weights are needed; tests skip when the HF snapshot is absent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ohdearquant
Copy link
Copy Markdown
Owner Author

Subsumed by #104 merge (umbrella PR brought all 11 PRs' content to main in one merge commit after stacked-PR base branches collapsed). Codex round-1 findings tracked in #116. Closing as superseded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant