A compact Transformer language model with utilities for tokenizer management, single-GPU and multi-GPU training, and text generation. The codebase targets fast experimentation on modest hardware while keeping close to modern PyTorch best practices.
- Rotary-aware decoder-only Transformer with tied embeddings and configurable depth/width (
model.py). - Tokenizer tooling for training or reusing byte-level BPE vocabularies (
data.py). - Training loops for both single GPU (
train.py) and Distributed Data Parallel training (data_parallel_train.py) with AMP, gradient clipping, and cosine warmup scheduling. - Structured configuration via dataclasses and YAML (
utils.py,configs.yaml). - Logging & monitoring through TensorBoard and
tqdmprogress bars. - Inference utilities for prompt-based generation (
infer.py).
SLM/
├── utils.py # Config dataclasses and YAML loader
├── data.py # Tokenizer helpers and Dataset definition
├── model.py # Transformer decoder implementation
├── infer.py # Prompt-based text generation script
├── train.py # Single-GPU training loop with evaluation
├── data_parallel_train.py # DDP training entry point
├── configs.yaml # YAML configuration consumed by Config dataclasses
├── requirements.txt # Project dependencies
└── tokenizer/ # Stores pretrained or newly trained tokenizers
- Install dependencies
python -m venv slm_env && source slm_env/bin/activate pip install -r requirements.txt
- Prepare a tokenizer (skip if
tokenizer/bpe_4096.jsonis already available):# quick one-off script from data import train_tokenizer train_tokenizer(["data/100-0.txt"], "tokenizer/bpe_4096.json", vocab_size=4096)
- Edit
configs.yamlto point to your dataset, tokenizer, and desired hyperparameters.
-
Single GPU / CPU
python train.py configs.yaml
Checkpoints land in
training.checkpoint_path; the best model (by validation loss) is updated automatically. -
Multi-GPU Distributed (uses NCCL + DDP)
torchrun --standalone --nproc_per_node=NUM_GPUS data_parallel_train.py configs.yaml
Snapshots and TensorBoard logs are controlled by the training config (
save_every,eval_every,logdir). -
Resume Training Set
training.checkpoint_path(single GPU) ortraining.snapshot_path(DDP) to an existing file. The trainer detects and reloads state automatically.
python infer.py \
--config configs.yaml \
--ckpt_path checkpoints/best_model.pth \
--prompt "Once upon a time" \
--max-length 100Supports temperature scaling and top-k filtering inside infer.py.
- Launch TensorBoard to inspect training curves:
tensorboard --logdir runs
tqdmprogress bars show per-device status during training.
model: vocabulary size, embedding dim, head count, feedforward size, number of decoder blocks.data: tokenizer path, raw text path, context length, batch size, validation split ratio, dataloader workers.training: epochs, warmup steps, base learning rate, gradient clipping threshold, gradient accumulation, logging/checkpoint cadence, and output paths.
This project is released under the terms of the LICENSE file included in the repository.