Refresh encoder benchmarks and add per-variant tokenizer parity by gabewillen · Pull Request #27 · stateforward/emel.cpp

gabewillen · 2026-03-02T04:41:32Z

Summary

add one benchmark source file per text encoder variant under tools/bench/text/encoders
compile encoder benchmarks by default and update benchmark domain mappings for new source locations
regenerate benchmark outputs/docs via quality gates (docs/benchmarks.md, bench snapshots, timing snapshot)
add tokenizer parity coverage in tools/paritychecker by loading model vocab from GGUF and comparing token streams against llama.cpp
split tokenizer parity into per-variant source files (spm, bpe, wpm, ugm, rwkv, plamo2, fallback) and wire dispatch by tokenizer model id
include tokenizer parity sources in both paritychecker and paritychecker_tests

Validation

scripts/quality_gates.sh (pass)
ctest --test-dir build/paritychecker -R paritychecker_tests --output-on-failure (pass)

…ariant

Copilot

Pull request overview

This PR refreshes the benchmark harness for text encoders and upgrades tools/paritychecker to perform tokenizer parity by loading vocab data from GGUF and comparing token streams against llama.cpp, with per-tokenizer-variant dispatch.

Changes:

Added per-encoder benchmark sources under tools/bench/text/encoders and wired them into the bench runner by default.
Implemented tokenizer parity in tools/paritychecker by mapping llama.cpp vocab structures into emel::model::data::vocab, then running per-variant parity checks.
Removed the EMEL_ENABLE_TENSOR_PARSER_TEXT_MACHINES build option and made related sources/tests/docs generation unconditional, then regenerated benchmark docs/snapshots.

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tools/paritychecker/tokenizer_parity.hpp	Declares shared tokenizer parity runner and per-variant entrypoints.
tools/paritychecker/tokenizer_parity_common.cpp	Implements shared tokenization + token-stream comparison logic for parity runs.
tools/paritychecker/tokenizer_spm_parity.cpp	Wires SPM variant parity to the shared runner.
tools/paritychecker/tokenizer_bpe_parity.cpp	Wires BPE variant parity to the shared runner.
tools/paritychecker/tokenizer_wpm_parity.cpp	Wires WPM variant parity to the shared runner.
tools/paritychecker/tokenizer_ugm_parity.cpp	Wires UGM variant parity to the shared runner.
tools/paritychecker/tokenizer_rwkv_parity.cpp	Wires RWKV variant parity to the shared runner.
tools/paritychecker/tokenizer_plamo2_parity.cpp	Wires PLAMO2 variant parity to the shared runner.
tools/paritychecker/tokenizer_fallback_parity.cpp	Adds fallback parity path for unknown/none tokenizer model ids.
tools/paritychecker/parity_runner.cpp	Implements vocab-only llama model loading, llama→emel vocab mapping, and variant dispatch for tokenizer parity.
tools/paritychecker/paritychecker_tests.cpp	Makes paritychecker tests assert execution rather than skipping.
tools/paritychecker/CMakeLists.txt	Adds tokenizer parity sources to both `paritychecker` and `paritychecker_tests` targets.
tools/bench/text/encoders/bench_common.hpp	Introduces shared helpers for encoder benchmark cases (vocab building, encode runner, measuring).
tools/bench/text/encoders/bpe_bench.cpp	Adds BPE encoder benchmark cases.
tools/bench/text/encoders/spm_bench.cpp	Adds SPM encoder benchmark cases.
tools/bench/text/encoders/wpm_bench.cpp	Adds WPM encoder benchmark cases.
tools/bench/text/encoders/ugm_bench.cpp	Adds UGM encoder benchmark cases.
tools/bench/text/encoders/rwkv_bench.cpp	Adds RWKV encoder benchmark cases.
tools/bench/text/encoders/plamo2_bench.cpp	Adds PLAMO2 encoder benchmark cases.
tools/bench/text/encoders/fallback_bench.cpp	Adds fallback encoder benchmark cases.
tools/bench/bench_cases.hpp	Declares encoder benchmark append functions for EMEL and reference paths.
tools/bench/bench_main.cpp	Registers encoder benchmark cases (and removes conditional compilation gating).
tools/bench/CMakeLists.txt	Builds text benchmarks by default (and removes the prior gating option/compile defs).
tools/docsgen/CMakeLists.txt	Removes header filtering based on the removed build option.
CMakeLists.txt	Removes `EMEL_ENABLE_TENSOR_PARSER_TEXT_MACHINES` and makes related tests/fuzz targets unconditional.
scripts/fuzz_smoke.sh	Stops passing the removed CMake option.
tests/text/encoders/test_support.hpp	Adds `[[maybe_unused]]` to helpers to reduce unused warnings.
src/emel/text/encoders/ugm/detail.hpp	Adjusts a local replacement array storage specifier.
src/emel/text/encoders/plamo2/detail.hpp	Refactors conditional selection into loop-based assignments.
docs/benchmarks.md	Regenerates benchmark documentation output with new benchmark set/locations.
snapshots/bench/benchmarks.txt	Updates benchmark snapshot output.
snapshots/bench/benchmarks_compare.txt	Updates benchmark compare snapshot output.
snapshots/quality_gates/timing.txt	Updates quality gate timing snapshot.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bench/parity: refresh encoder benches and split tokenizer parity by v…

c3b219a

…ariant

Copilot AI review requested due to automatic review settings March 2, 2026 04:41

Copilot started reviewing on behalf of gabewillen March 2, 2026 04:42 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

Comment thread tools/bench/text/encoders/bpe_bench.cpp

gabewillen added 2 commits March 1, 2026 23:12

bench: split machine benchmarks by domain and refresh generated outputs

cc0d300

bench: fix bpe encoder cases and skip unsupported kernel arch

442c197

gabewillen merged commit 04d6d6d into main Mar 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh encoder benchmarks and add per-variant tokenizer parity#27

Refresh encoder benchmarks and add per-variant tokenizer parity#27
gabewillen merged 3 commits into
mainfrom
feat/encoder-bench-parity-refresh

gabewillen commented Mar 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gabewillen commented Mar 2, 2026

Summary

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants