Skip to content

Commit 8a8f70e

Browse files
committed
add gpu basics blog
1 parent 0420e77 commit 8a8f70e

5 files changed

Lines changed: 719 additions & 0 deletions

File tree

_posts/2025-12-15-gpu-basics.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
---
2+
layout: post
3+
title: "The Beginner's Guide to Understanding NVIDIA GPUs"
4+
date: 2025-12-15 12:00:00
5+
description: how GPU works to speed up your code
6+
tags: gpu
7+
categories: tutorials
8+
featured: true
9+
giscus_comments: true
10+
---
11+
12+
For the record, this blog is a summary of what I learnt from [GPU_glossary](https://modal.com/gpu-glossary).
13+
14+
# From Abstraction to Reality: How Your Code Runs on NVIDIA GPUs
15+
16+
If you've ever looked at NVIDIA's GPU documentation, you might feel like you're reading a foreign language. There are "SMs", "Warps", "Grids", "Blocks", "Tensor Cores"... and keeping them all straight is a nightmare.
17+
18+
This guide is designed to bridge the gap between the **code you write** (the software abstraction) and the **metal that runs it** (the hardware reality). We're going to ignore the marketing fluff and focus on exactly how these pieces fit together.
19+
20+
## The Mental Shift: CPU vs. GPU
21+
22+
Before we dive into the jargon, let's establish the fundamental difference between the processor in your laptop (CPU) and the graphics card (GPU).
23+
24+
* **The CPU is a Ferrari.** It's designed to take a small number of passengers (threads) from Point A to Point B as fast as humanly possible. It has giant caches and complex logic to make sure *one single task* finishes quickly. This is **Latency** optimization.
25+
* **The GPU is a Bus Service.** It's not trying to get one person across town in record time. It's trying to move *thousands* of people across town at once. It might take a bit longer for the bus to start and stop, but the sheer volume of people moved per minute is massive. This is **Throughput** optimization.
26+
27+
{% include figure.liquid loading="eager" path="assets/img/gpu_basics/cpu-gpu.svg" title="cpu-vs-gpu" class="img-fluid rounded z-depth-1" %}
28+
29+
Because of this difference, GPUs have a completely different architecture. They don't have a few powerful cores; they have thousands of tiny, simple ones.
30+
31+
## 1. The Core vs. The Thread
32+
33+
Let's start at the absolute bottom.
34+
35+
**The Hardware Reality: The Core**
36+
Deep inside the silicon, the most basic unit of computation is the **Core** (specifically, the [CUDA Core](https://modal.com/gpu-glossary/device-hardware/cuda-core)). This is the workerbee. It can do basic math (add, multiply) for one piece of data at a time. Newer GPUs also have **[Tensor Cores](https://modal.com/gpu-glossary/device-hardware/tensor-core)**, which are specialized workers that can only do one specific job: multiply tiny matrices together really, really fast (essential for AI and scientific computing).
37+
38+
**The Software Abstraction: The Thread**
39+
When you write code for a GPU, you don't talk to Cores directly. You write a function (called a **[Kernel](https://modal.com/gpu-glossary/device-software/kernel)**) and tell the GPU: "Run this function 10,000 times."
40+
Each individual execution of that function is called a **[Thread](https://modal.com/gpu-glossary/device-software/thread)**.
41+
42+
> **The Bridge**: A **Thread** is a set of instructions. A **Core** is the physical spot where those instructions get executed.
43+
44+
## 2. The Streaming Multiprocessor (SM) vs. The Thread Block
45+
46+
One core isn't very useful. So NVIDIA groups them together.
47+
48+
**The Hardware Reality: The Streaming Multiprocessor (SM)**
49+
This is the *real* heart of the GPU. An **[SM](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor)** is like a department in a factory. It contains:
50+
* A bunch of CUDA Cores (usually 64 or 128).
51+
* Some Tensor Cores.
52+
* Some fast memory (we'll get to that).
53+
* Schedulers to hand out work.
54+
55+
An H100 GPU, for example, has 144 of these SMs.
56+
57+
{% include figure.liquid loading="eager" path="assets/img/gpu_basics/H100_FULLGPU_144SMs.png" title="H100_FULLGPU_144SMs" class="img-fluid rounded z-depth-1" %}
58+
59+
**The Software Abstraction: The Thread Block**
60+
Managing 10,000 individual threads would be chaos. So, in your code, you group threads into **[Thread Blocks](https://modal.com/gpu-glossary/device-software/thread-block)**. You might say, "I want 10,000 threads, grouped into blocks of 256."
61+
62+
> **The Bridge**: When you launch your program, the GPU assigns an entire **Thread Block** to a single **SM**. The SM is responsible for running all the threads in that block. Once assigned, that block stays on that SM until it's finished.
63+
64+
## 3. The Grid vs. The GPU
65+
66+
**The Hardware Reality: The Device**
67+
The entire GPU itself (the "Device") is just a collection of all those SMs we just talked about, connected to some big memory banks (VRAM).
68+
69+
**The Software Abstraction: The Grid**
70+
The collection of *all* your Thread Blocks is called the **[Grid](https://modal.com/gpu-glossary/device-software/thread-block-grid)**.
71+
72+
> **The Bridge**: The **Grid** covers the entire problem you are solving. The GPU's hardware scheduler breaks up this Grid and feeds the Blocks to the available SMs. If you have a huge Grid and a small GPU, the hardware just queues up the Blocks and runs them as fast as it can. This is why CUDA code scales automatically: a better GPU just executes more Blocks at once.
73+
74+
| Software Concept | Hardware Home |
75+
| :--- | :--- |
76+
| **Thread** | **Core** |
77+
| **Thread Block** | **Streaming Multiprocessor (SM)** |
78+
| **Grid** | **Entire Device (GPU)** |
79+
80+
{% include figure.liquid loading="eager" path="assets/img/gpu_basics/cuda-programming-model.svg" title="cuda-programming-model" class="img-fluid rounded z-depth-1" %}
81+
82+
(Figure from GPU Glossary)
83+
84+
## 4. Memory: Where Data Lives
85+
86+
Understanding where your data lives is the single most important part of GPU performance.
87+
88+
1. **[Global Memory](https://modal.com/gpu-glossary/device-software/global-memory) (GPU RAM)**:
89+
* **Analogy**: This is the warehouse down the street. It's huge (80GB+), but it takes a long time to travel there to pick up a package (data).
90+
* **Reality**: This is the VRAM on the card.
91+
92+
2. **[Shared Memory](https://modal.com/gpu-glossary/device-software/shared-memory) (L1 Cache)**:
93+
* **Analogy**: This is a communal workbench shared by all the workers (threads) in the same room (Block). It's really fast, but small.
94+
* **Reality**: Physically located inside the SM. You, the programmer, have to manually move data here if you want to use it efficiently.
95+
96+
3. **[Registers](https://modal.com/gpu-glossary/device-software/registers)**:
97+
* **Analogy**: This is the pocket of the individual worker. It's instant to reach, but you only have so many pockets.
98+
* **Reality**: Private memory for each thread to store its local variables.
99+
100+
## 5. The Secret Sauce: Warps and Latency Hiding
101+
102+
Here is the "gotcha" that catches every beginner.
103+
104+
You might think that if you have 32 threads, they all run independently. They don't.
105+
The hardware groups threads into bundles of 32 called **[Warps](https://modal.com/gpu-glossary/device-software/warp)**.
106+
107+
**The Drill Sergeant (SIMT)**
108+
A Warp executes in "Lock-step". It's like a drill sergeant commanding a platoon: "Everyone take a step forward!" If one soldier needs to tie their shoe (an `if` statement that takes a different path), *everyone else has to wait*. This is why "branching" code is bad on GPUs.
109+
110+
**Running the Bus Service (Latency Hiding)**
111+
Remember the "Bus Service" analogy?
112+
When a Warp needs to load data from Global Memory (the warehouse), it takes a long time (hundreds of clock cycles). The SM doesn't just sit there and wait. It says, "Okay, Warp 1 is waiting for memory. Warp 2, you're up!"
113+
114+
It instantly switches to another Warp that is ready to calculate. By the time Warp 2 needs memory, Warp 3 is ready. Eventually, Warp 1's data arrives, and it jumps back in line.
115+
116+
{% include figure.liquid loading="eager" path="assets/img/gpu_basics/wave-scheduling.png" title="wave-scheduling" class="img-fluid rounded z-depth-1" %}
117+
118+
This is **Latency Hiding**, and it is the key to GPU performance. You need to launch *way more threads* than you have cores, just to keep the hardware busy while it waits for memory.
119+
120+
## Summary
121+
122+
* **Software**: You write a **Kernel**. You launch a **Grid** of **Thread Blocks**, each containing hundreds of **Threads**.
123+
* **Hardware**: The GPU assigns Blocks to **SMs**. The SMs group threads into **Warps** of 32. These Warps execute instructions on **Cores**.
124+
* **Performance**: If you align your software structure (Blocks/Threads) to respect the hardware reality (SMs/Warps), you get incredible speed. If you fight the hardware, you get a very expensive space heater.
593 KB
Loading

0 commit comments

Comments
 (0)