A parallelized BPE tokenizer built from scratch as part of Stanford's CS336 assignment.
No HuggingFace. No SentencePiece. Just raw Python and a lot of profiling.
train.py- BPE training with multiprocessing for pre-tokenizationtokenizer.py- CLI for BPE encoding and decodingtrained-tokenizers/- Trained vocabulary and merge files for TinyStories (10K) and OpenWebText (32K)
# Train a tokenizer
python train.py --input sample-data/TinyStoriesV2-GPT4-valid.txt --vocab-size 10000
# Encode text
python tokenizer.py --encode "Hello world" --vocab trained-tokenizers/TinyStories/vocab.json --merges trained-tokenizers/TinyStories/merges.txt
# Decode tokens
python tokenizer.py --decode "15496 995" --vocab trained-tokenizers/TinyStories/vocab.json --merges trained-tokenizers/TinyStories/merges.txtProfiled with Scalene.
Evaluated on validation sets:
- OpenWebText (32K vocab): 4.37
- TinyStories (10K vocab): 4.12
Wrote about the whole process here: Building a BPE Tokenizer from Scratch