GitHub - divyagupta2002/SLM: Small decoder only language model

SLM (Small Language Model)

A compact Transformer language model with utilities for tokenizer management, single-GPU and multi-GPU training, and text generation. The codebase targets fast experimentation on modest hardware while keeping close to modern PyTorch best practices.

Highlighted Features

Rotary-aware decoder-only Transformer with tied embeddings and configurable depth/width (model.py).
Tokenizer tooling for training or reusing byte-level BPE vocabularies (data.py).
Training loops for both single GPU (train.py) and Distributed Data Parallel training (data_parallel_train.py) with AMP, gradient clipping, and cosine warmup scheduling.
Structured configuration via dataclasses and YAML (utils.py, configs.yaml).
Logging & monitoring through TensorBoard and tqdm progress bars.
Inference utilities for prompt-based generation (infer.py).

Project Layout

SLM/
├── utils.py               # Config dataclasses and YAML loader
├── data.py                # Tokenizer helpers and Dataset definition
├── model.py               # Transformer decoder implementation
├── infer.py               # Prompt-based text generation script
├── train.py               # Single-GPU training loop with evaluation
├── data_parallel_train.py # DDP training entry point
├── configs.yaml           # YAML configuration consumed by Config dataclasses
├── requirements.txt       # Project dependencies
└── tokenizer/             # Stores pretrained or newly trained tokenizers

Getting Started

Install dependencies

python -m venv slm_env && source slm_env/bin/activate
pip install -r requirements.txt

Prepare a tokenizer (skip if tokenizer/bpe_4096.json is already available):

# quick one-off script
from data import train_tokenizer
train_tokenizer(["data/100-0.txt"], "tokenizer/bpe_4096.json", vocab_size=4096)

Edit configs.yaml to point to your dataset, tokenizer, and desired hyperparameters.

Training Workflows

Single GPU / CPU
```
python train.py configs.yaml
```
Checkpoints land in training.checkpoint_path; the best model (by validation loss) is updated automatically.
Multi-GPU Distributed (uses NCCL + DDP)
```
torchrun --standalone --nproc_per_node=NUM_GPUS data_parallel_train.py configs.yaml
```
Snapshots and TensorBoard logs are controlled by the training config (save_every, eval_every, logdir).
Resume Training Set training.checkpoint_path (single GPU) or training.snapshot_path (DDP) to an existing file. The trainer detects and reloads state automatically.

Text Generation

python infer.py \
    --config configs.yaml \
    --ckpt_path checkpoints/best_model.pth \
    --prompt "Once upon a time" \
    --max-length 100

Supports temperature scaling and top-k filtering inside infer.py.

Monitoring & Logging

Launch TensorBoard to inspect training curves:
```
tensorboard --logdir runs
```
tqdm progress bars show per-device status during training.

Configuration Reference (`configs.yaml`)

model: vocabulary size, embedding dim, head count, feedforward size, number of decoder blocks.
data: tokenizer path, raw text path, context length, batch size, validation split ratio, dataloader workers.
training: epochs, warmup steps, base learning rate, gradient clipping threshold, gradient accumulation, logging/checkpoint cadence, and output paths.

License

This project is released under the terms of the LICENSE file included in the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLM (Small Language Model)

Highlighted Features

Project Layout

Getting Started

Training Workflows

Text Generation

Monitoring & Logging

Configuration Reference (`configs.yaml`)

License

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
checkpoints		checkpoints
data		data
tokenizer		tokenizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
configs.yaml		configs.yaml
data.py		data.py
data_parallel_train.py		data_parallel_train.py
infer.py		infer.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

License

divyagupta2002/SLM

Folders and files

Latest commit

History

Repository files navigation

SLM (Small Language Model)

Highlighted Features

Project Layout

Getting Started

Training Workflows

Text Generation

Monitoring & Logging

Configuration Reference (configs.yaml)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

Configuration Reference (`configs.yaml`)