Skip to content

Add Anthropic Claude tokenizer support#52

Merged
dmitry-brazhenko merged 1 commit intodmitry-brazhenko:mainfrom
vwilson:feature/claude-tokenizer
Mar 25, 2026
Merged

Add Anthropic Claude tokenizer support#52
dmitry-brazhenko merged 1 commit intodmitry-brazhenko:mainfrom
vwilson:feature/claude-tokenizer

Conversation

@vwilson
Copy link
Copy Markdown
Contributor

@vwilson vwilson commented Mar 24, 2026

Summary

Closes #32

  • Adds a new "claude" encoding based on Anthropic's official @anthropic-ai/tokenizer BPE vocabulary (~65K tokens)
  • Adds NFKC text normalization support (the key algorithmic difference between Claude's tokenizer and OpenAI's)
  • Adds model name mappings for Claude models (claude-instant-1 through claude-4-sonnet) with prefix matching for dated variants
  • Reuses existing Regex50KBase pattern (identical to Claude's pat_str)

Note: Per Anthropic's documentation, this tokenizer is accurate for pre-Claude 3 models and serves as a rough approximation for Claude 3+.

Usage

// By encoding name
var encoding = GptEncoding.GetEncoding("claude");

// By model name
var encoding = GptEncoding.GetEncodingForModel("claude-3.5-sonnet");

var tokens = encoding.Encode("Hello, world!");
var text = encoding.Decode(tokens);

Test plan

  • All 889 tests pass (885 existing + 4 new)
  • Encode/decode roundtrip works for Claude encoding
  • NFKC normalization verified (fullwidth chars normalize correctly)
  • Model name resolution works for all Claude variants
  • Special token encoding/decoding works
  • Zero performance impact on existing encodings (normalization null-check short-circuits)
  • Builds on all target frameworks (netstandard2.0, net6.0, net8.0)

🤖 Generated with Claude Code

Add a new "claude" encoding based on Anthropic's official tokenizer data,
enabling token counting and encoding/decoding for Claude models.

- Convert and embed BPE vocabulary from Anthropic's @anthropic-ai/tokenizer
  package (~65K tokens) as claude.tiktoken
- Add NFKC text normalization support to ModelParams and GptEncoding,
  matching Anthropic's tokenizer behavior
- Add Claude model name mappings (claude-instant-1 through claude-4-sonnet)
  with prefix matching for dated variants
- Reuse existing Regex50KBase pattern (identical to Claude's pat_str)
- Add tests for roundtrip encoding, model mappings, NFKC normalization,
  and special token handling

Note: This tokenizer is accurate for pre-Claude 3 models and serves as
a rough approximation for Claude 3+, per Anthropic's documentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dmitry-brazhenko dmitry-brazhenko merged commit b4203b9 into dmitry-brazhenko:main Mar 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Anthropic (claude) support

2 participants