Skip to content

Pre-BPE regex diverges from tiktoken on some Unicode characters #23

@smola

Description

@smola

I'm testing Go alternatives to tiktoken for o200k_base and checking any divergences. For tiktoken-go, the string "\u1C89\\u" produces different results. The root cause seems to be that Go considers \u1C89 as lowercase, and Python does not, which it's probably because of a mismatch in the Unicode version.

I understand fixing this might well be out of scope for tiktoken-go, but it might be worth a documentation note that this might be a source of tokenization mismatches between tiktoken and tiktoken-go.

AI-generated repro info

Environment

  • Go: go1.25.6 linux/amd64
  • github.com/tiktoken-go/tokenizer: v0.7.0
  • Python tiktoken: 0.12.0

Minimal repro

Input string:

\u1C89\\u

This is U+1C89 followed by the two literal characters \ and u.

Go (tiktoken-go)

package main

import (
    "encoding/json"
    "fmt"

    "github.com/tiktoken-go/tokenizer"
)

func main() {
    input := "\u1C89\\u"

    tk, err := tokenizer.Get(tokenizer.O200kBase)
    if err != nil {
        panic(err)
    }

    tokens, _, err := tk.Encode(input)
    if err != nil {
        panic(err)
    }

    b, _ := json.Marshal(tokens)
    fmt.Printf("input=%q\n", input)
    fmt.Printf("tokens=%s\n", b)
}

Output:

input="\u1c89\\u"
tokens=[157,110,231,59,84]

Python (openai/tiktoken)

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4.1-mini")  # o200k_base
input_text = "\u1C89\\u"
print(f"input={input_text!r}")
print(f"tokens={enc.encode(input_text)}")

Output:

input='Ᲊ\\u'
tokens=[157, 110, 231, 7570]

Why this seems wrong in tiktoken-go

Token 7570 corresponds to "\\u" in o200k_base vocab, so the Python output is consistent with splitting this as:

  • "\u1C89" -> [157,110,231]
  • "\\u" -> [7570]

But tiktoken-go yields [... ,59,84], which implies it did not keep "\\u" as one pre-BPE piece.

Investigation notes

The o200k_base regex pattern appears the same in both projects:

  • tiktoken-go: codec/o200k_base.go
  • tiktoken: tiktoken_ext/openai_public.py (o200k_base())

However, regex matching behavior diverges for the same input:

  • Python regex split behaves like: ['\u1C89', '\\u']
  • Go (regexp2) behaves like: ['\u1C89\\', 'u']

That split difference alone explains the token divergence.

The likely root cause is Unicode property/classification differences in the regex engine stack (Go regexp2 + Go unicode tables) for U+1C89, which affect \p{L}/\p{Lu} matching and therefore pre-BPE segmentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions