Skip to content

mmustapic/mmbpe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMBPE, a Byte Pair Encoding Tokenizer in Swift

This is a small command line tool to do Byte Pair Encoding tokenization, similar to what LLMs do. I wrote it while learning about tokenization for LLMs to better understand some concepts, so it's not really for production use.

It is based on:

The wikipedia article for Byte Pair Encoding <> explains the basic algorithm for training, encoding and decoding. This implementation, as most of the ones used for LLMs, use the utf8 byte array representation of a string and use those bytes as the initial tokens. The initial vocabulary is just a dictionary mapping the first 256 Ints to strings built with just one UTF-8 byte. Half of those single-byte strings are invalid, since they should only use values below 128, but we need the bytes to encode any UTF-8 string.

Once training is done,

Implementation Details

The Tokenizer class is the one doing the training, encoding and decoding. BytePairTool uses ArgumentParser to make Tokenizer accesible from a command line tool. In the source code, a tokenId -an Int- is an index into token -a String- in the vocabulary. So wherever you see tokenId, it's an Int, and token is a String.

Training transforms the training text (a) String) to a list of bytes based on its utf-8 encoding, those bytes being the initial tokenIds. Afterwards the most frequent pair of tokenIds is merged into a new tokenId, and a merge rule is saved. This process is repeated until the maximum vocabulary size is reached or no more pairs that repeat more than once are found. The merge rules list can then be used to encode and decode further texts.

Training

Use the train parameter to train a byte pair encoder/decoder based on a training text. Once the training is done, it will write the merge rules to a file. This merge rules are just pair of Ints, that is, two tokenIds. The first pair merges to the new tokenId 256, the second pair to the tokenId 257, etc. Using the initial vocabulary of the first 256 Ints, the vocabulary can be easily reconstructed.

Regex splitting

The trainer and encoder splits the text in categories, like words, numbers, (some) punctuation signs, and then creates merge rules (for training) or applies merges (for encoding). This is to avoid merges across categorie and preserve words, contractions (like 'll, 'nt, etc), numbers and punctuation. This Tokenizer uses the GPT-2 split pattern.

GPT-2 compatibility

OpenAI's GPT-2 tokenizer does some kind of initial byte shuffling after utf-8 encoding, mapping the initial 256 bytes to printable characters instead of control characters and such. I don't really know why, maybe so everything is printable. Their merges file, strangely called vocab.bpe, uses strings instead of ints. The space character is replaced by "Ġ", for example. This tokenizer can parse the gpt-2 merges file with a special format flag, and transform it to its internal tokenId based representation.

Future work

Special token parsing still needs to be implemented.

About

A simple Byte Pair Encoding tokenizer, similar to the ones used by LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages