perf: regexp2 pretokenizer, pooled buffers, offset-free fast encode path by aeryncaen · Pull Request #80 · sugarme/tokenizer

aeryncaen · 2026-02-23T07:40:27Z

Summary

This PR improves tokenization throughput with three complementary changes:

Switch RegexpPattern from stdlib regexp to dlclark/regexp2 — enables
lookahead/lookbehind syntax required by modern tokenizers (GPT-4, Qwen, Llama 3)
and provides significantly faster matching on complex pretokenizer patterns.
Pooled scratch buffers and ASCII fast-path in the regexp2 matching layer —
reduces per-call allocations for rune→byte mapping and match index collection.
EncodeIDsOnly API and offset-free NormalizedString — a new
Tokenizer.EncodeIDsOnly(string) ([]int, error) method that skips all
per-byte alignment tracking and Encoding struct construction, returning
only token IDs. This is the appropriate choice for any workload that does
not need character-level offset mappings (training data pipelines, inference
preprocessing, token counting, search indexing, etc.).

Additional changes

ByteLevel.UseRegex field + SetUseRegex() + use_regex JSON config —
honors the HuggingFace use_regex option so that when a prior Split
pretokenizer already handles regex splitting, ByteLevel can skip its
redundant GPT-2 split pass.
Sequence.PreTokenizers() accessor — exposes the underlying pretokenizer
slice for pipeline introspection.
invert() pre-allocates its result slice.

New public API

Symbol	Package	Description
`Tokenizer.EncodeIDsOnly(input string) ([]int, error)`	`tokenizer`	Fast token-IDs-only encode; skips offset tracking and Encoding construction. Returns the same IDs as `EncodeSingle(input).Ids` (with `addSpecialTokens=false`, no truncation/padding).
`PreTokenizedString.IntoIDs() ([]int, error)`	`tokenizer`	Collects token IDs from a tokenized `PreTokenizedString` without building an `Encoding`.
`NewPreTokenizedStringFast(s string)`	`tokenizer`	Creates a `PreTokenizedString` backed by offset-free `NormalizedString`.
`NewNormalizedFromFast(s string)`	`normalizer`	Creates a `NormalizedString` that skips alignment array allocation. All mutation operations (Split, Transform, Prepend, etc.) still produce correct normalized text.
`AddedVocabulary.ExtractAndNormalizeFast(...)`	`tokenizer`	Offset-free variant of `ExtractAndNormalize`.
`ByteLevel.UseRegex` / `SetUseRegex(bool)`	`pretokenizer`	Controls whether ByteLevel applies its built-in GPT-2 regex split.
`Sequence.PreTokenizers()`	`pretokenizer`	Returns the underlying pretokenizer slice.

Benchmark

Setup: Single-threaded EncodeSingle / EncodeIDsOnly over 60,000 JSONL
documents on an Apple M4 Max. Median of 3 runs (Qwen3 before: 1 run due to
its 108 s runtime).

Qwen3 pretokenizer (complex GPT-style regex split + ByteLevel)

The Qwen3 tokenizer uses a complex regex pattern with Unicode character classes
(\p{L}, \p{N}) and alternations. The original pattern also contains a
negative lookahead (\s+(?!\S)) which is unsupported by Go's stdlib regexp.
For the baseline comparison, we used a simplified variant with the lookahead
removed (\s+(?!\S)|\s+ → \s+|\s+), since the lookahead branch is
redundant with the following \s+ alternative. This simplified pattern runs on
both the upstream stdlib regexp engine and our regexp2 engine.

Note: the token count differs between before and after (2,732,116 vs 2,462,116)
because regexp2 has more accurate .NET-style semantics for inline
case-insensitive groups ((?i:...)) and Unicode character classes compared to
Go's RE2-based regexp.

Method	tok/s	vs baseline
Before `EncodeSingle` (upstream, stdlib `regexp`)	25,289	1.0×
After `EncodeSingle` (`regexp2` + pools)	668,000	26×
After `EncodeIDsOnly` (`regexp2` + pools + skip offsets)	1,400,000	55×

BERT pretokenizer (simple `BertPreTokenizer`, no regex split)

Included to show the improvement on a lightweight pretokenizer where regex
matching is not the bottleneck — the gains here come from pooled buffers and
the offset-free path.

Method	tok/s	vs baseline
Before `EncodeSingle` (upstream)	593,000	1.0×
After `EncodeSingle`	693,000	1.17×
After `EncodeIDsOnly`	1,719,000	2.9×

Compatibility

Backward-compatible: EncodeSingle and all existing APIs are unchanged
in behavior. The regexp2 engine is a strict superset of stdlib regexp
for the patterns used by HuggingFace tokenizers.
New dependency: github.com/dlclark/regexp2 v1.11.5 (stock, no fork).
No replace directives in go.mod.

aeryncaen · 2026-02-23T07:42:15Z

...Claude is a moron and closed the PR for some reason.

AddTokensWithIds registers tokens with explicit IDs from tokenizer.json instead of recomputing them as model.GetVocabSize()+offset. This fixes compacted vocabularies where added token IDs are non-sequential. - AddedVocabulary.AddTokensWithIds: register with specified IDs - Tokenizer.AddTokensWithIds + GetAddedVocab: public API - CreateAddedTokensWithIds: preserves Id from TokenConfig - FromReader: uses ID-preserving loading path

aeryncaen closed this Feb 23, 2026

aeryncaen reopened this Feb 23, 2026

perf: regexp2 pretokenizer, pooled buffers, offset-free fast encode path

36a1f5b

aeryncaen force-pushed the perf/qwen-pretokenizer-regexp2 branch from 5df3025 to 36a1f5b Compare February 23, 2026 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: regexp2 pretokenizer, pooled buffers, offset-free fast encode path#80

perf: regexp2 pretokenizer, pooled buffers, offset-free fast encode path#80
aeryncaen wants to merge 2 commits intosugarme:masterfrom
aeryncaen:perf/qwen-pretokenizer-regexp2

aeryncaen commented Feb 23, 2026

Uh oh!

aeryncaen commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aeryncaen commented Feb 23, 2026

Summary

Additional changes

New public API

Benchmark

Qwen3 pretokenizer (complex GPT-style regex split + ByteLevel)

BERT pretokenizer (simple BertPreTokenizer, no regex split)

Compatibility

Uh oh!

aeryncaen commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BERT pretokenizer (simple `BertPreTokenizer`, no regex split)