Pre-BPE regex diverges from tiktoken on some Unicode characters

I'm testing Go alternatives to tiktoken for o200k_base and checking any divergences. For tiktoken-go, the string `"\u1C89\\u"` produces different results. The root cause seems to be that Go considers `\u1C89` as lowercase, and Python does not, which it's probably because of a mismatch in the Unicode version.

I understand fixing this might well be out of scope for tiktoken-go, but it might be worth a documentation note that this might be a source of tokenization mismatches between tiktoken and tiktoken-go.

<details>
<summary>AI-generated repro info</summary>

## Environment

- Go: `go1.25.6 linux/amd64`
- `github.com/tiktoken-go/tokenizer`: `v0.7.0`
- Python `tiktoken`: `0.12.0`

## Minimal repro

Input string:

```text
\u1C89\\u
```

This is U+1C89 followed by the two literal characters `\` and `u`.

### Go (`tiktoken-go`)

```go
package main

import (
    "encoding/json"
    "fmt"

    "github.com/tiktoken-go/tokenizer"
)

func main() {
    input := "\u1C89\\u"

    tk, err := tokenizer.Get(tokenizer.O200kBase)
    if err != nil {
        panic(err)
    }

    tokens, _, err := tk.Encode(input)
    if err != nil {
        panic(err)
    }

    b, _ := json.Marshal(tokens)
    fmt.Printf("input=%q\n", input)
    fmt.Printf("tokens=%s\n", b)
}
```

Output:

```text
input="\u1c89\\u"
tokens=[157,110,231,59,84]
```

### Python (`openai/tiktoken`)

```python
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4.1-mini")  # o200k_base
input_text = "\u1C89\\u"
print(f"input={input_text!r}")
print(f"tokens={enc.encode(input_text)}")
```

Output:

```text
input='Ᲊ\\u'
tokens=[157, 110, 231, 7570]
```

## Why this seems wrong in `tiktoken-go`

Token `7570` corresponds to `"\\u"` in `o200k_base` vocab, so the Python output is consistent with splitting this as:

- `"\u1C89"` -> `[157,110,231]`
- `"\\u"` -> `[7570]`

But `tiktoken-go` yields `[... ,59,84]`, which implies it did **not** keep `"\\u"` as one pre-BPE piece.

## Investigation notes

The `o200k_base` regex pattern appears the same in both projects:

- `tiktoken-go`: `codec/o200k_base.go`
- `tiktoken`: `tiktoken_ext/openai_public.py` (`o200k_base()`)

However, regex matching behavior diverges for the same input:

- Python regex split behaves like: `['\u1C89', '\\u']`
- Go (`regexp2`) behaves like: `['\u1C89\\', 'u']`

That split difference alone explains the token divergence.

The likely root cause is Unicode property/classification differences in the regex engine stack (Go `regexp2` + Go unicode tables) for U+1C89, which affect `\p{L}`/`\p{Lu}` matching and therefore pre-BPE segmentation.

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-BPE regex diverges from tiktoken on some Unicode characters #23

Environment

Minimal repro

Go (`tiktoken-go`)

Python (`openai/tiktoken`)

Why this seems wrong in `tiktoken-go`

Investigation notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pre-BPE regex diverges from tiktoken on some Unicode characters #23

Description

Environment

Minimal repro

Go (tiktoken-go)

Python (openai/tiktoken)

Why this seems wrong in tiktoken-go

Investigation notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Go (`tiktoken-go`)

Python (`openai/tiktoken`)

Why this seems wrong in `tiktoken-go`