Commands

General code structure inspired heavily from Andrej Karpathy's nanochat and nanogpt repositories.

Switched from FA2 to torch varlen-attn (same backend) to simplify dependencies, however this means torch 2.10 is required.

Build rustbpe with: uv run maturin develop --release --manifest-path rustbpe/Cargo.toml

Download 100M word plain text sample of climbix with: python -m scripts.sample_climbmix

Train tokenizer with: python -m scripts.train_tokenizer --dataset climbmix100Mwords.txt

Tokenize data with: python -m scripts.tokenize_data --dataset climbmix100Mwords.txt --tokenizer climbmix100Mwords_tokenizer.pkl

Subdivide tokenized data into 10%, 25%, and 50% subsets with: python -m scripts.subdivide_dataset --dataset climbmix100Mwords.parquet

Train model with: python -m scripts.train_model --dataset climbmix100Mwords.parquet

Convert to hf with (Requires uv group 'hf'): python -m scripts.convert_hf model_name.pth --tokenizer tokenizer_name.pkl

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
evals		evals
recursive_lm		recursive_lm
rustbpe		rustbpe
scattermoe		scattermoe
scripts		scripts
tokenizers		tokenizers
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
banner.txt		banner.txt
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback