MMBPE, a Byte Pair Encoding Tokenizer in Swift

This is a small command line tool to do Byte Pair Encoding tokenization, similar to what LLMs do. I wrote it while learning about tokenization for LLMs to better understand some concepts, so it's not really for production use.

It is based on:

Sebastian Raschka's post https://sebastianraschka.com/blog/2025/bpe-from-scratch.html
OpenAI's GPT-2 encoder https://github.com/openai/gpt-2/blob/master/src/encoder.py
Andrej Karpathy minibpe https://github.com/karpathy/minbpe

The wikipedia article for Byte Pair Encoding <> explains the basic algorithm for training, encoding and decoding. This implementation, as most of the ones used for LLMs, use the utf8 byte array representation of a string and use those bytes as the initial tokens. The initial vocabulary is just a dictionary mapping the first 256 Ints to strings built with just one UTF-8 byte. Half of those single-byte strings are invalid, since they should only use values below 128, but we need the bytes to encode any UTF-8 string.

Once training is done,

Implementation Details

The Tokenizer class is the one doing the training, encoding and decoding. BytePairTool uses ArgumentParser to make Tokenizer accesible from a command line tool. In the source code, a tokenId -an Int- is an index into token -a String- in the vocabulary. So wherever you see tokenId, it's an Int, and token is a String.

Training transforms the training text (a) String) to a list of bytes based on its utf-8 encoding, those bytes being the initial tokenIds. Afterwards the most frequent pair of tokenIds is merged into a new tokenId, and a merge rule is saved. This process is repeated until the maximum vocabulary size is reached or no more pairs that repeat more than once are found. The merge rules list can then be used to encode and decode further texts.

Training

Use the train parameter to train a byte pair encoder/decoder based on a training text. Once the training is done, it will write the merge rules to a file. This merge rules are just pair of Ints, that is, two tokenIds. The first pair merges to the new tokenId 256, the second pair to the tokenId 257, etc. Using the initial vocabulary of the first 256 Ints, the vocabulary can be easily reconstructed.

Regex splitting

The trainer and encoder splits the text in categories, like words, numbers, (some) punctuation signs, and then creates merge rules (for training) or applies merges (for encoding). This is to avoid merges across categorie and preserve words, contractions (like 'll, 'nt, etc), numbers and punctuation. This Tokenizer uses the GPT-2 split pattern.

GPT-2 compatibility

OpenAI's GPT-2 tokenizer does some kind of initial byte shuffling after utf-8 encoding, mapping the initial 256 bytes to printable characters instead of control characters and such. I don't really know why, maybe so everything is printable. Their merges file, strangely called vocab.bpe, uses strings instead of ints. The space character is replaced by "Ġ", for example. This tokenizer can parse the gpt-2 merges file with a special format flag, and transform it to its internal tokenId based representation.

Future work

Special token parsing still needs to be implemented.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Sources		Sources
Tests		Tests
.gitignore		.gitignore
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMBPE, a Byte Pair Encoding Tokenizer in Swift

Implementation Details

Training

Regex splitting

GPT-2 compatibility

Future work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MMBPE, a Byte Pair Encoding Tokenizer in Swift

Implementation Details

Training

Regex splitting

GPT-2 compatibility

Future work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages