Skip to content

Refresh encoder benchmarks and add per-variant tokenizer parity#27

Merged
gabewillen merged 3 commits into
mainfrom
feat/encoder-bench-parity-refresh
Mar 2, 2026
Merged

Refresh encoder benchmarks and add per-variant tokenizer parity#27
gabewillen merged 3 commits into
mainfrom
feat/encoder-bench-parity-refresh

Conversation

@gabewillen

Copy link
Copy Markdown
Contributor

Summary

  • add one benchmark source file per text encoder variant under tools/bench/text/encoders
  • compile encoder benchmarks by default and update benchmark domain mappings for new source locations
  • regenerate benchmark outputs/docs via quality gates (docs/benchmarks.md, bench snapshots, timing snapshot)
  • add tokenizer parity coverage in tools/paritychecker by loading model vocab from GGUF and comparing token streams against llama.cpp
  • split tokenizer parity into per-variant source files (spm, bpe, wpm, ugm, rwkv, plamo2, fallback) and wire dispatch by tokenizer model id
  • include tokenizer parity sources in both paritychecker and paritychecker_tests

Validation

  • scripts/quality_gates.sh (pass)
  • ctest --test-dir build/paritychecker -R paritychecker_tests --output-on-failure (pass)

Copilot AI review requested due to automatic review settings March 2, 2026 04:41

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refreshes the benchmark harness for text encoders and upgrades tools/paritychecker to perform tokenizer parity by loading vocab data from GGUF and comparing token streams against llama.cpp, with per-tokenizer-variant dispatch.

Changes:

  • Added per-encoder benchmark sources under tools/bench/text/encoders and wired them into the bench runner by default.
  • Implemented tokenizer parity in tools/paritychecker by mapping llama.cpp vocab structures into emel::model::data::vocab, then running per-variant parity checks.
  • Removed the EMEL_ENABLE_TENSOR_PARSER_TEXT_MACHINES build option and made related sources/tests/docs generation unconditional, then regenerated benchmark docs/snapshots.

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tools/paritychecker/tokenizer_parity.hpp Declares shared tokenizer parity runner and per-variant entrypoints.
tools/paritychecker/tokenizer_parity_common.cpp Implements shared tokenization + token-stream comparison logic for parity runs.
tools/paritychecker/tokenizer_spm_parity.cpp Wires SPM variant parity to the shared runner.
tools/paritychecker/tokenizer_bpe_parity.cpp Wires BPE variant parity to the shared runner.
tools/paritychecker/tokenizer_wpm_parity.cpp Wires WPM variant parity to the shared runner.
tools/paritychecker/tokenizer_ugm_parity.cpp Wires UGM variant parity to the shared runner.
tools/paritychecker/tokenizer_rwkv_parity.cpp Wires RWKV variant parity to the shared runner.
tools/paritychecker/tokenizer_plamo2_parity.cpp Wires PLAMO2 variant parity to the shared runner.
tools/paritychecker/tokenizer_fallback_parity.cpp Adds fallback parity path for unknown/none tokenizer model ids.
tools/paritychecker/parity_runner.cpp Implements vocab-only llama model loading, llama→emel vocab mapping, and variant dispatch for tokenizer parity.
tools/paritychecker/paritychecker_tests.cpp Makes paritychecker tests assert execution rather than skipping.
tools/paritychecker/CMakeLists.txt Adds tokenizer parity sources to both paritychecker and paritychecker_tests targets.
tools/bench/text/encoders/bench_common.hpp Introduces shared helpers for encoder benchmark cases (vocab building, encode runner, measuring).
tools/bench/text/encoders/bpe_bench.cpp Adds BPE encoder benchmark cases.
tools/bench/text/encoders/spm_bench.cpp Adds SPM encoder benchmark cases.
tools/bench/text/encoders/wpm_bench.cpp Adds WPM encoder benchmark cases.
tools/bench/text/encoders/ugm_bench.cpp Adds UGM encoder benchmark cases.
tools/bench/text/encoders/rwkv_bench.cpp Adds RWKV encoder benchmark cases.
tools/bench/text/encoders/plamo2_bench.cpp Adds PLAMO2 encoder benchmark cases.
tools/bench/text/encoders/fallback_bench.cpp Adds fallback encoder benchmark cases.
tools/bench/bench_cases.hpp Declares encoder benchmark append functions for EMEL and reference paths.
tools/bench/bench_main.cpp Registers encoder benchmark cases (and removes conditional compilation gating).
tools/bench/CMakeLists.txt Builds text benchmarks by default (and removes the prior gating option/compile defs).
tools/docsgen/CMakeLists.txt Removes header filtering based on the removed build option.
CMakeLists.txt Removes EMEL_ENABLE_TENSOR_PARSER_TEXT_MACHINES and makes related tests/fuzz targets unconditional.
scripts/fuzz_smoke.sh Stops passing the removed CMake option.
tests/text/encoders/test_support.hpp Adds [[maybe_unused]] to helpers to reduce unused warnings.
src/emel/text/encoders/ugm/detail.hpp Adjusts a local replacement array storage specifier.
src/emel/text/encoders/plamo2/detail.hpp Refactors conditional selection into loop-based assignments.
docs/benchmarks.md Regenerates benchmark documentation output with new benchmark set/locations.
snapshots/bench/benchmarks.txt Updates benchmark snapshot output.
snapshots/bench/benchmarks_compare.txt Updates benchmark compare snapshot output.
snapshots/quality_gates/timing.txt Updates quality gate timing snapshot.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/bench/text/encoders/bpe_bench.cpp
@gabewillen gabewillen merged commit 04d6d6d into main Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants