Add Anthropic Claude tokenizer support by vwilson · Pull Request #52 · dmitry-brazhenko/SharpToken

vwilson · 2026-03-24T18:00:34Z

Summary

Closes #32

Adds a new "claude" encoding based on Anthropic's official @anthropic-ai/tokenizer BPE vocabulary (~65K tokens)
Adds NFKC text normalization support (the key algorithmic difference between Claude's tokenizer and OpenAI's)
Adds model name mappings for Claude models (claude-instant-1 through claude-4-sonnet) with prefix matching for dated variants
Reuses existing Regex50KBase pattern (identical to Claude's pat_str)

Note: Per Anthropic's documentation, this tokenizer is accurate for pre-Claude 3 models and serves as a rough approximation for Claude 3+.

Usage

// By encoding name
var encoding = GptEncoding.GetEncoding("claude");

// By model name
var encoding = GptEncoding.GetEncodingForModel("claude-3.5-sonnet");

var tokens = encoding.Encode("Hello, world!");
var text = encoding.Decode(tokens);

Test plan

All 889 tests pass (885 existing + 4 new)
Encode/decode roundtrip works for Claude encoding
NFKC normalization verified (fullwidth chars normalize correctly)
Model name resolution works for all Claude variants
Special token encoding/decoding works
Zero performance impact on existing encodings (normalization null-check short-circuits)
Builds on all target frameworks (netstandard2.0, net6.0, net8.0)

🤖 Generated with Claude Code

Add a new "claude" encoding based on Anthropic's official tokenizer data, enabling token counting and encoding/decoding for Claude models. - Convert and embed BPE vocabulary from Anthropic's @anthropic-ai/tokenizer package (~65K tokens) as claude.tiktoken - Add NFKC text normalization support to ModelParams and GptEncoding, matching Anthropic's tokenizer behavior - Add Claude model name mappings (claude-instant-1 through claude-4-sonnet) with prefix matching for dated variants - Reuse existing Regex50KBase pattern (identical to Claude's pat_str) - Add tests for roundtrip encoding, model mappings, NFKC normalization, and special token handling Note: This tokenizer is accurate for pre-Claude 3 models and serves as a rough approximation for Claude 3+, per Anthropic's documentation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dmitry-brazhenko approved these changes Mar 25, 2026

View reviewed changes

dmitry-brazhenko merged commit b4203b9 into dmitry-brazhenko:main Mar 25, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Anthropic Claude tokenizer support#52

Add Anthropic Claude tokenizer support#52
dmitry-brazhenko merged 1 commit intodmitry-brazhenko:mainfrom
vwilson:feature/claude-tokenizer

vwilson commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vwilson commented Mar 24, 2026

Summary

Usage

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants