This repository contains a custom implementation of a decoder-only GPT model trained on the Tiny Shakespeare dataset, using the MLX library.
The model is a decoder-only GPT based on the Transformer decoder from Attention Is All You Need.
- Embedding & Positional Encoding: Standard token embeddings with sinusoidal positional encodings.
- Transformer Blocks: 6 layers of decoder blocks, each with:
- Multi-Head Self-Attention (6 heads, dimension 384) with causal masking
- Feed-Forward Network (hidden dimension 4× model dimension)
- Layer Normalization and residual connections
- Output: Linear projection to the vocabulary size for next-token prediction.
- Sequence length: 256
- Batch size: 32
- Dropout: 0.1 throughout the model
- Loss function: Cross-Entropy
- Optimizer: AdamW with weight decay = 0.1
- Learning rate schedule: hybrid schedule similar to OneCycle:
- Linear warmup for 10% of total steps
- Cosine decay to final LR for remaining steps
- Max LR: 3e-4
- Total steps: 10,000
- The best model is saved based on the lowest validation loss during training.
modules/– core modules for the project:dataloader.py– is responsible for creating training batchesmodel.py– contains all GPT building blocks and the full model implementationtokenizer.py– implements a character-level tokenizer with encoding and decoding
analysis.ipynb– notebook with training and inference analysistrain.py– full training script
Created by Denys Bondarchuk. Feel free to reach out or contribute to the project.