perf: regexp2 pretokenizer, pooled buffers, offset-free fast encode path#80
Open
aeryncaen wants to merge 2 commits intosugarme:masterfrom
Open
perf: regexp2 pretokenizer, pooled buffers, offset-free fast encode path#80aeryncaen wants to merge 2 commits intosugarme:masterfrom
aeryncaen wants to merge 2 commits intosugarme:masterfrom
Conversation
Author
|
...Claude is a moron and closed the PR for some reason. |
5df3025 to
36a1f5b
Compare
AddTokensWithIds registers tokens with explicit IDs from tokenizer.json instead of recomputing them as model.GetVocabSize()+offset. This fixes compacted vocabularies where added token IDs are non-sequential. - AddedVocabulary.AddTokensWithIds: register with specified IDs - Tokenizer.AddTokensWithIds + GetAddedVocab: public API - CreateAddedTokensWithIds: preserves Id from TokenConfig - FromReader: uses ID-preserving loading path
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves tokenization throughput with three complementary changes:
RegexpPatternfrom stdlibregexptodlclark/regexp2— enableslookahead/lookbehind syntax required by modern tokenizers (GPT-4, Qwen, Llama 3)
and provides significantly faster matching on complex pretokenizer patterns.
reduces per-call allocations for rune→byte mapping and match index collection.
EncodeIDsOnlyAPI and offset-freeNormalizedString— a newTokenizer.EncodeIDsOnly(string) ([]int, error)method that skips allper-byte alignment tracking and
Encodingstruct construction, returningonly token IDs. This is the appropriate choice for any workload that does
not need character-level offset mappings (training data pipelines, inference
preprocessing, token counting, search indexing, etc.).
Additional changes
ByteLevel.UseRegexfield +SetUseRegex()+use_regexJSON config —honors the HuggingFace
use_regexoption so that when a priorSplitpretokenizer already handles regex splitting,
ByteLevelcan skip itsredundant GPT-2 split pass.
Sequence.PreTokenizers()accessor — exposes the underlying pretokenizerslice for pipeline introspection.
invert()pre-allocates its result slice.New public API
Tokenizer.EncodeIDsOnly(input string) ([]int, error)tokenizerEncodeSingle(input).Ids(withaddSpecialTokens=false, no truncation/padding).PreTokenizedString.IntoIDs() ([]int, error)tokenizerPreTokenizedStringwithout building anEncoding.NewPreTokenizedStringFast(s string)tokenizerPreTokenizedStringbacked by offset-freeNormalizedString.NewNormalizedFromFast(s string)normalizerNormalizedStringthat skips alignment array allocation. All mutation operations (Split, Transform, Prepend, etc.) still produce correct normalized text.AddedVocabulary.ExtractAndNormalizeFast(...)tokenizerExtractAndNormalize.ByteLevel.UseRegex/SetUseRegex(bool)pretokenizerSequence.PreTokenizers()pretokenizerBenchmark
Setup: Single-threaded
EncodeSingle/EncodeIDsOnlyover 60,000 JSONLdocuments on an Apple M4 Max. Median of 3 runs (Qwen3 before: 1 run due to
its 108 s runtime).
Qwen3 pretokenizer (complex GPT-style regex split + ByteLevel)
The Qwen3 tokenizer uses a complex regex pattern with Unicode character classes
(
\p{L},\p{N}) and alternations. The original pattern also contains anegative lookahead (
\s+(?!\S)) which is unsupported by Go's stdlibregexp.For the baseline comparison, we used a simplified variant with the lookahead
removed (
\s+(?!\S)|\s+→\s+|\s+), since the lookahead branch isredundant with the following
\s+alternative. This simplified pattern runs onboth the upstream stdlib
regexpengine and ourregexp2engine.Note: the token count differs between before and after (2,732,116 vs 2,462,116)
because
regexp2has more accurate.NET-style semantics for inlinecase-insensitive groups (
(?i:...)) and Unicode character classes compared toGo's RE2-based
regexp.EncodeSingle(upstream, stdlibregexp)EncodeSingle(regexp2+ pools)EncodeIDsOnly(regexp2+ pools + skip offsets)BERT pretokenizer (simple
BertPreTokenizer, no regex split)Included to show the improvement on a lightweight pretokenizer where regex
matching is not the bottleneck — the gains here come from pooled buffers and
the offset-free path.
EncodeSingle(upstream)EncodeSingleEncodeIDsOnlyCompatibility
EncodeSingleand all existing APIs are unchangedin behavior. The
regexp2engine is a strict superset of stdlibregexpfor the patterns used by HuggingFace tokenizers.
github.com/dlclark/regexp2 v1.11.5(stock, no fork).replacedirectives ingo.mod.