light-tokenizer

A parallelized BPE tokenizer built from scratch as part of Stanford's CS336 assignment.

No HuggingFace. No SentencePiece. Just raw Python and a lot of profiling.

What's here

train.py - BPE training with multiprocessing for pre-tokenization
tokenizer.py - CLI for BPE encoding and decoding
trained-tokenizers/ - Trained vocabulary and merge files for TinyStories (10K) and OpenWebText (32K)

Quick start

# Train a tokenizer
python train.py --input sample-data/TinyStoriesV2-GPT4-valid.txt --vocab-size 10000

# Encode text
python tokenizer.py --encode "Hello world" --vocab trained-tokenizers/TinyStories/vocab.json --merges trained-tokenizers/TinyStories/merges.txt

# Decode tokens
python tokenizer.py --decode "15496 995" --vocab trained-tokenizers/TinyStories/vocab.json --merges trained-tokenizers/TinyStories/merges.txt

Performance

Profiled with Scalene.

Compression Ratios

Evaluated on validation sets:

OpenWebText (32K vocab): 4.37
TinyStories (10K vocab): 4.12

Blog post

Wrote about the whole process here: Building a BPE Tokenizer from Scratch

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
sample-data		sample-data
trained-tokenizers		trained-tokenizers
.gitignore		.gitignore
README.md		README.md
tokenizer.py		tokenizer.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

light-tokenizer

What's here

Quick start

Performance

Compression Ratios

Blog post

About

Uh oh!

Releases

Packages

Languages

dotvignesh/light-tokenizer

Folders and files

Latest commit

History

Repository files navigation

light-tokenizer

What's here

Quick start

Performance

Compression Ratios

Blog post

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages