I'm testing Go alternatives to tiktoken for o200k_base and checking any divergences. For tiktoken-go, the string "\u1C89\\u" produces different results. The root cause seems to be that Go considers \u1C89 as lowercase, and Python does not, which it's probably because of a mismatch in the Unicode version.
I understand fixing this might well be out of scope for tiktoken-go, but it might be worth a documentation note that this might be a source of tokenization mismatches between tiktoken and tiktoken-go.
AI-generated repro info
Environment
- Go:
go1.25.6 linux/amd64
github.com/tiktoken-go/tokenizer: v0.7.0
- Python
tiktoken: 0.12.0
Minimal repro
Input string:
This is U+1C89 followed by the two literal characters \ and u.
Go (tiktoken-go)
package main
import (
"encoding/json"
"fmt"
"github.com/tiktoken-go/tokenizer"
)
func main() {
input := "\u1C89\\u"
tk, err := tokenizer.Get(tokenizer.O200kBase)
if err != nil {
panic(err)
}
tokens, _, err := tk.Encode(input)
if err != nil {
panic(err)
}
b, _ := json.Marshal(tokens)
fmt.Printf("input=%q\n", input)
fmt.Printf("tokens=%s\n", b)
}
Output:
input="\u1c89\\u"
tokens=[157,110,231,59,84]
Python (openai/tiktoken)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4.1-mini") # o200k_base
input_text = "\u1C89\\u"
print(f"input={input_text!r}")
print(f"tokens={enc.encode(input_text)}")
Output:
input='\\u'
tokens=[157, 110, 231, 7570]
Why this seems wrong in tiktoken-go
Token 7570 corresponds to "\\u" in o200k_base vocab, so the Python output is consistent with splitting this as:
"\u1C89" -> [157,110,231]
"\\u" -> [7570]
But tiktoken-go yields [... ,59,84], which implies it did not keep "\\u" as one pre-BPE piece.
Investigation notes
The o200k_base regex pattern appears the same in both projects:
tiktoken-go: codec/o200k_base.go
tiktoken: tiktoken_ext/openai_public.py (o200k_base())
However, regex matching behavior diverges for the same input:
- Python regex split behaves like:
['\u1C89', '\\u']
- Go (
regexp2) behaves like: ['\u1C89\\', 'u']
That split difference alone explains the token divergence.
The likely root cause is Unicode property/classification differences in the regex engine stack (Go regexp2 + Go unicode tables) for U+1C89, which affect \p{L}/\p{Lu} matching and therefore pre-BPE segmentation.
I'm testing Go alternatives to tiktoken for o200k_base and checking any divergences. For tiktoken-go, the string
"\u1C89\\u"produces different results. The root cause seems to be that Go considers\u1C89as lowercase, and Python does not, which it's probably because of a mismatch in the Unicode version.I understand fixing this might well be out of scope for tiktoken-go, but it might be worth a documentation note that this might be a source of tokenization mismatches between tiktoken and tiktoken-go.
AI-generated repro info
Environment
go1.25.6 linux/amd64github.com/tiktoken-go/tokenizer:v0.7.0tiktoken:0.12.0Minimal repro
Input string:
This is U+1C89 followed by the two literal characters
\andu.Go (
tiktoken-go)Output:
Python (
openai/tiktoken)Output:
Why this seems wrong in
tiktoken-goToken
7570corresponds to"\\u"ino200k_basevocab, so the Python output is consistent with splitting this as:"\u1C89"->[157,110,231]"\\u"->[7570]But
tiktoken-goyields[... ,59,84], which implies it did not keep"\\u"as one pre-BPE piece.Investigation notes
The
o200k_baseregex pattern appears the same in both projects:tiktoken-go:codec/o200k_base.gotiktoken:tiktoken_ext/openai_public.py(o200k_base())However, regex matching behavior diverges for the same input:
['\u1C89', '\\u']regexp2) behaves like:['\u1C89\\', 'u']That split difference alone explains the token divergence.
The likely root cause is Unicode property/classification differences in the regex engine stack (Go
regexp2+ Go unicode tables) for U+1C89, which affect\p{L}/\p{Lu}matching and therefore pre-BPE segmentation.