This repository is a hands on guide to building a ChatGPT like LLM in PyTorch. It breaks the architecture into simple parts and explains each one step by step.
Let us have a bird's eye view of the Generative Pretrained Transformer (GPT) like LLM architecture.
Example: Every moment is a beginning
LLMs work by predicting one token at a time. LLMs generate text iteratively. Each predicted token is appended to the previous input to form the context for the next prediction.
- Tokenization
- Token Embeddings
- Positional Embeddings
- Self Attention Mechanism
- Masked Multi-Head Attention
- Feedforward Neural Networks
- Residual Connections
- Layer Normalization
- Transformer Block
- MiniGPT
Dive into the hands-on examples for each LLM component using interactive Jupyter notebooks.
| Topic | Code |
|---|---|
| Tokenization | 01_tokenization.ipynb |
| Token Embeddings | 02_token_embeddings.ipynb |
| Positional Embeddings | 03_positional_embeddings.ipynb |
| Self Attention Mechanism | 04_self_attention_mechanism.ipynb |
| Masked Multi-Head Attention | 05_masked_multi_head_attention.ipynb |
| Feedforward Neural Networks | 06_feedforward_neural_networks.ipynb |
| Residual Connections | 07_residual_connections.ipynb |
| Layer Normalization | 08_layer_normalization.ipynb |
| Transformer Block | 09_transformer_block.ipynb |
| MiniGPT | 10_mini_gpt.ipynb |
pip install -r requirements.txt
If you're installing torch with CUDA support, make sure to use the correct installation command from PyTorch's official website, as some versions require a specific installation method.
Tokenization is the process of splitting a text into smaller units called tokens. These tokens are the fundamental building blocks an LLM works with.
Input Sentence: “Every moment is a beginning”
Tokens: [“Every”, “moment”, “is”, “a”, “beginning”]
This shows how a tokenizer can split a sentence into tokens. After tokenization, each unique token is assigned a unique numerical ID.
Here’s a simple visual showing tokenization:
Now we have a list of numbers, but these numbers alone don’t carry any meaning. The ID “15745” for “Every” does not contain information about how the token is used in language. This is where embeddings help.
Token Embeddings are essentially numerical representations (vectors) of tokens basically a long list of numbers (a vector) that describes its characteristics.
Imagine the sentences:
- The dog jumps on the cat.
- The cat jumps on the dog.
The words are the same, but the meaning is entirely different because their positions are different. Our numerical token IDs and token embeddings, by themselves, don’t tell the LLM anything about the order of words.
This is solved with Positional Embeddings.
Positional embeddings are another list of numbers (a vector) added to the token embeddings. These vectors help the model understand the absolute or relative position of each token in the sequence.
The Self Attention Mechanism allows the model to understand how words relate to each other. Instead of reading each word in isolation, every token looks at the other tokens and decides which ones matter most.
Take this sentence:
“Every moment is a beginning.”
To understand the word “beginning”, the model pays attention to words like “moment” and “Every”. This gives context and helps the model capture the idea that each moment can represent a fresh start.
Step 1: To achieve this mathematically, the model uses three vectors derived from the input embeddings: Queries (Q), Keys (K), and Values (V).
Step 2: We compute the dot product between all Queries and Keys to measure how well they match.
Step 3: The result is scaled by the square root of the key dimension dk to keep values stable during training.
Step 4: Apply softmax to obtain attention weights.
Step 5: Calculate context vectors.
Complete Self Attention
After attention, each token now contains information gathered from other tokens in the sequence. This is the core idea behind transformers.
In standard self attention, each token can attend to all other tokens in the sequence. But in language models, future tokens should not be visible during prediction.
Causal self attention solves this using a mask that blocks access to future tokens.
The mask looks like this:
1means attention is allowed0means attention is blocked
This is implemented by masking the blocked positions and replacing their attention scores with negative infinity before applying the softmax function. After softmax, these positions receive a probability of 0, preventing the model from attending to future tokens.
This ensures:
- token 1 sees only itself
- token 2 sees tokens 1 and 2
- token 3 sees tokens 1, 2, and 3
Now each token can only attend to itself and previous tokens. This is the mechanism used in decoder only transformer models like GPT.
A single attention mechanism can only learn one type of relationship at a time. To capture grammar, meaning, long-range dependencies, and subject-object relationships simultaneously, the model uses Masked Multi-Head Attention. Multiple attention heads run in parallel, looking at the exact same sentence differently.
For “Every moment is a beginning,” different heads might focus on different nuances:
Attention Head 1 (Meaning): Connects “moment” ←→ “beginning” to understand the concept of renewal.
Attention Head 2 (Grammar): Connects “Every” ←→ “moment” to understand that “Every” is describing “moment”.
Attention Head 3 (Structure): Connects “is” ←→ “beginning” to anchor the main statement of the sentence.
How it works:
- The input embeddings are split into smaller parts called heads.
- Each head performs attention independently.
- The outputs from all heads are concatenated together.
- A final linear layer combines the information into one representation.
Attention allows tokens to communicate with each other and exchange information across the sequence.
For example, in the sentence:
"Every moment is a beginning."
Attention helps the token "beginning" gather context from words like "moment" and "Every".
But after this information is mixed together, each token still needs additional processing to learn more complex patterns. This is the role of the Feed Forward Network, often called the FFN or MLP block.
A Feedforward Neural Network typically consists of two linear layers with an activation function (like GELU) in between, temporarily expanding the hidden dimension (often by 4x) to help the model learn more complex patterns.
- Linear layer
- Activation function
- Second linear layer
Deep neural networks are difficult to train.
As networks become deeper, gradients can become extremely small or extremely large during backpropagation. This is known as the vanishing gradient or exploding gradient problem.
When gradients vanish, earlier layers learn very slowly because the training signal fades as it moves backward through many layers.
Residual connections, also called skip connections, help solve this problem. Instead of learning a completely new transformation, the model learns how to modify the input relative to its original value.
The original input is added back to the output of a layer:
Transformers use residual connections around both:
- Masked multi-head attention
- Feedforward neural networks
Residual connections help transformers:
- train deeper networks
- stabilize gradients
- preserve information
They are one of the core building blocks of modern deep learning architectures.
Neural network activations can become unstable during training. As data passes through many layers, the values can grow too large or become too small. This makes optimization difficult and can slow down learning.
Layer Normalization helps stabilize these activations. It normalizes the features of each token independently so that the values maintain:
- mean ≈ 0
- standard deviation ≈ 1
This makes training faster, more stable, and more reliable.
Suppose a token embedding is:
LayerNorm computes:
The mean
The variance
The normalized output
This transforms the features so they have approximately zero mean and unit variance.
Layer normalization is applied multiple times inside each transformer block.
A transformer block combines:
- Masked multi-head attention
- Feedforward neural network
- Residual connections
- Layer normalization
The flow through a Transformer block is:
-
Layer Normalization: Normalizes the input representations to improve training stability.
-
Masked Multi-Head Attention: Allows each token to gather information from itself and previous tokens while preventing access to future tokens.
-
Residual Connection (Add): The original input is added back to the attention output, helping preserve information and improve gradient flow.
-
Layer Normalization: Re-normalizes the updated representations before further processing.
-
Feedforward Neural Network (FFN): Applies non linear transformations to learn more complex patterns and relationships.
-
Residual Connection (Add): The input from before the second Layer Normalization is added to the FFN output, preserving information while incorporating the new transformations.
Note: Dropout is often applied after the attention and feedforward operations during training. This helps reduce overfitting and improves the model’s ability to generalize.
Modern GPT models stack many transformer blocks on top of each other. Each block refines the token representations.
MiniGPT is a small GPT style language model built using transformer blocks.
It combines:
- Token embeddings
- Positional embeddings
- Transformer blocks
- Layer normalization
- Output layer
The model processes input tokens and predicts the next token in the sequence.
| Parameter | Description |
|---|---|
vocab_size |
Total number of tokens in the vocabulary |
block_size |
Maximum sequence length |
embed_dim |
Size of token embeddings |
num_heads |
Number of attention heads |
hidden_dim |
Hidden size of the feedforward neural network |
num_layers |
Number of transformer blocks |
Input Tokens
↓
Token Embeddings
↓
Positional Embeddings
↓
Transformer Blocks
↓
LayerNorm
↓
Output Layer
MiniGPT is trained autoregressively. It predicts the next token using previous tokens. This is the core idea behind GPT style language models.
Read the full breakdown and insights in the accompanying blogs.
- A Visual Guide to LLMs (Part 1): Text to Numbers: Tokenization and Embeddings
- A Visual Guide to LLMs (Part 2): Inside the Transformer Architecture
✅ Learn AI for FREE with visuals, easy-to-follow insights.
✅ Get cutting-edge topics like GenAI, RAGs, and LLMs in your inbox every week.
We welcome contributions! If you have improvements, new notebooks, or fixes to suggest:
- Fork the repository.
- Create a feature branch:
git checkout -b feature/YourTopic. - Add or update notebooks in the
notebooks/folder. - Commit your changes:
git commit -m 'Add or update YourTopic notebook'. - Push your branch:
git push origin feature/YourTopic. - Open a pull request for review.
This project is licensed under MIT License
⭐️ If you find this repository helpful, please consider giving it a star!
Keywords: AI, Machine Learning, Deep Learning, PyTorch, Generative AI, LLMs, Transformers