Skip to content

modular/max-llm-book

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

211 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build an LLM from scratch with MAX

A guided tour of a complete GPT-2 implementation using the MAX framework. Each section walks through the code in gpt2.py and explains what it does and why—from model configuration through streaming text generation.

What you'll learn

  • Transformer architecture: Every component of GPT-2, explained through working code
  • MAX Python API: How MAX's experimental.nn builds and compiles neural networks
  • Inference patterns: Weight loading, lazy initialization, model compilation, and autoregressive generation

Quick start

Prerequisites

Installation

git clone https://github.com/modular/max-llm-book
cd max-llm-book
pixi install

Run the model

Serve GPT-2 via an OpenAI-compatible HTTP endpoint:

pixi run serve

Then query it:

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt2","prompt":"In the beginning","max_tokens":30,"temperature":0}'

Or run the model directly in your terminal:

pixi run gpt2

This downloads the pretrained GPT-2 weights from Hugging Face, compiles the model, and starts an interactive prompt where you can enter text and see generated completions.

Additional modes:

pixi run gpt2 -- --prompt "In the beginning"   # single generation, then exit
pixi run chat                                    # streaming multi-turn chat
pixi run gpt2 -- --benchmark                    # tokens/sec benchmark

Read the book

pixi run book

Or read it online at llm.modular.com.

What the book covers

The tutorial walks through gpt2.py section by section:

Section Topic What you'll learn
Run the model Serve GPT-2 with pixi run serve before diving into code
1 Model configuration Architecture hyperparameters and Hugging Face compatibility
2 Feed-forward network Two-layer MLP with GELU activation
3 Causal masking Preventing attention to future tokens
4 Multi-head attention Parallel attention across 12 heads
5 Layer normalization Pre-norm pattern for stable activations
6 Transformer block Residual connections and component wiring
7 Stacking transformer blocks Embeddings and the 12-layer model body
8 Language model head Projecting hidden states to vocabulary logits
9 Tokens, weights, and sampling BPE tokenization, weight loading, and Gumbel-max sampling
10 Serving GPT-2 with MAX Connect the model to max serve; the custom architecture pattern

Project structure

max-llm-book/
├── book/                  # mdBook tutorial documentation
│   └── src/
│       ├── introduction.md
│       ├── serve_first.md
│       ├── step_01.md ... step_10.md
│       └── SUMMARY.md
├── gpt2.py               # Complete GPT-2 implementation
├── gpt2_arch/            # Custom architecture package for `max serve`
├── tests/                # Tests for gpt2.py and gpt2_arch/
├── pixi.toml             # Project dependencies and tasks
└── README.md             # This file

Learning resources

Contributing

Found an issue or want to improve the tutorial? Contributions welcome:

  1. File issues for bugs or unclear explanations
  2. Suggest improvements to code examples or visualizations
  3. Open a pull request with fixes or additions

About

Build an LLM from scratch with MAX

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Contributors