Last week, I explored KV caching from scratch while working on a GPT-2 model. I ran my experiments on a Colab T4 GPU to better understand how caching improves inference speed in large language models.
In autoregressive generation, LLMs generate one token at a time, and each new token has to attend to all previous tokens. So if your model generated: "White → Fluffy → Cat" …the attention block still recomputes the Keys and Values for "White" and "Fluffy" every time.
That’s a lot of unnecessary computation, especially as the output grows longer.
I implemented a caching mechanism where: • The model caches the Keys & Values for input tokens during prefill. • For each new token, it only computes the K/V for that token. • Previous tokens just pull from the cache, no recompute needed.
Tested on a Colab T4 GPU for GPT-2 using different output lengths:
For shorter outputs, KV caching doesn't always help. In my tests, device communication overhead on CUDA sometimes outweighed the gains for small models like GPT-2.
Shoutout to Sebastian Raschka, PhD for his amazing blog post on KV Caching
- Attention in Transformers, Step-by-Step | Deep Learning Chapter 6
- Understanding and Coding the KV Cache in LLMs from Scratch
- Mastering Tensor Dimensions in Transformers
- The Illustrated Transformer – Jay Alammar
- Transformers KV Caching Explained | João Lages
- LLM Inference Series: 3. KV Caching Explained | Pierre Lienhart
- tanishqkumar/beyond-nanogpt: Minimal, annotated implementations

