LLM-From-Scratch

I build every core component of large language model (LLM) architectures from scratch, following Sebastian Raschka's book — covering everything from data preparation and multi-head self-attention modules to classification and instruction fine-tuning of open-source models and deployed it on AWS Sagemaker!

Additional Concepts Implemented

KV Caching

🔁 In normal attention (used in training):

For each token t₁, t₂, ..., tₜ, you recompute:
- Query qᵢ, Key kᵢ, and Value vᵢ
You build full Q, K, and V matrices fresh every time
Compute all attention scores, softmax weights, and context vectors [c₁, c₂, ..., cₜ] in parallel using full sequences (because we know all tokens ahead of time)

⚡ In KV caching (used in inference/generation):

We generate one token at a time (we don’t know the next tokens yet)
So we don’t recompute all K and V at each step — we cache (store) the previous ones to avoid redundant computation

🧱 Step-by-Step Example

Sentence generation: "The cat sat on"

Now you're generating the next token: "the"

🟡 Without KV Caching (Normal Inference)

At token t₅ = "the":

Recompute:
- k₁, k₂, k₃, k₄, k₅
- v₁, v₂, v₃, v₄, v₅
- q₅
Compute:
- q₅ × [k₁, k₂, k₃, k₄, k₅]ᵀ → attention scores
- Softmax → context vector c₅

❌ Inefficient — recomputing everything at each step!

🟢 With KV Caching (Efficient Inference)

At token t₅ = "the":

Already stored:
- k₁..₄, v₁..₄
Compute:
- q₅, k₅, v₅

Update caches:

K_cache = [k₁, ..., k₄] → [k₁, ..., k₅]
V_cache = [v₁, ..., v₄] → [v₁, ..., v₅]

Compute attention scores -> attention weights -> single row of context vector -> NN architecture -> softmax -> probabilities.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
AWS-Instruction-Finetuning		AWS-Instruction-Finetuning
Pretraining-with-KV-Caching		Pretraining-with-KV-Caching
README.md		README.md
byte_pair_encoding.py		byte_pair_encoding.py
causal_self_attention.py		causal_self_attention.py
gpt2_architecture.py		gpt2_architecture.py
gpt2_classifier.py		gpt2_classifier.py
gpt2_instruction.py		gpt2_instruction.py
gpt2_mhsa.py		gpt2_mhsa.py
gpt2_pretrained.py		gpt2_pretrained.py
gpt2_training.py		gpt2_training.py
gpt_download.py		gpt_download.py
mhsa.py		mhsa.py
self_attention.py		self_attention.py
simple_self_attention.py		simple_self_attention.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-From-Scratch

Additional Concepts Implemented

KV Caching

🔁 In normal attention (used in training):

⚡ In KV caching (used in inference/generation):

🧱 Step-by-Step Example

🟡 Without KV Caching (Normal Inference)

🟢 With KV Caching (Efficient Inference)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-From-Scratch

Additional Concepts Implemented

KV Caching

🔁 In normal attention (used in training):

⚡ In KV caching (used in inference/generation):

🧱 Step-by-Step Example

🟡 Without KV Caching (Normal Inference)

🟢 With KV Caching (Efficient Inference)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages