I build every core component of large language model (LLM) architectures from scratch, following Sebastian Raschka's book — covering everything from data preparation and multi-head self-attention modules to classification and instruction fine-tuning of open-source models and deployed it on AWS Sagemaker!
- For each token
t₁, t₂, ..., tₜ, you recompute:- Query
qᵢ, Keykᵢ, and Valuevᵢ
- Query
- You build full
Q,K, andVmatrices fresh every time - Compute all attention scores, softmax weights, and context vectors
[c₁, c₂, ..., cₜ]in parallel using full sequences (because we know all tokens ahead of time)
- We generate one token at a time (we don’t know the next tokens yet)
- So we don’t recompute all
KandVat each step — we cache (store) the previous ones to avoid redundant computation
Sentence generation: "The cat sat on"
Now you're generating the next token: "the"
At token t₅ = "the":
- Recompute:
k₁, k₂, k₃, k₄, k₅v₁, v₂, v₃, v₄, v₅q₅
- Compute:
q₅ × [k₁, k₂, k₃, k₄, k₅]ᵀ→ attention scores- Softmax → context vector
c₅
❌ Inefficient — recomputing everything at each step!
At token t₅ = "the":
- Already stored:
k₁..₄,v₁..₄
- Compute:
q₅,k₅,v₅
- Update caches:
K_cache = [k₁, ..., k₄] → [k₁, ..., k₅] V_cache = [v₁, ..., v₄] → [v₁, ..., v₅] - Compute attention scores -> attention weights -> single row of context vector -> NN architecture -> softmax -> probabilities.
