|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "The Beginner's Guide to Understanding NVIDIA GPUs" |
| 4 | +date: 2025-12-15 12:00:00 |
| 5 | +description: how GPU works to speed up your code |
| 6 | +tags: gpu |
| 7 | +categories: tutorials |
| 8 | +featured: true |
| 9 | +giscus_comments: true |
| 10 | +--- |
| 11 | + |
| 12 | +For the record, this blog is a summary of what I learnt from [GPU_glossary](https://modal.com/gpu-glossary). |
| 13 | + |
| 14 | +# From Abstraction to Reality: How Your Code Runs on NVIDIA GPUs |
| 15 | + |
| 16 | +If you've ever looked at NVIDIA's GPU documentation, you might feel like you're reading a foreign language. There are "SMs", "Warps", "Grids", "Blocks", "Tensor Cores"... and keeping them all straight is a nightmare. |
| 17 | + |
| 18 | +This guide is designed to bridge the gap between the **code you write** (the software abstraction) and the **metal that runs it** (the hardware reality). We're going to ignore the marketing fluff and focus on exactly how these pieces fit together. |
| 19 | + |
| 20 | +## The Mental Shift: CPU vs. GPU |
| 21 | + |
| 22 | +Before we dive into the jargon, let's establish the fundamental difference between the processor in your laptop (CPU) and the graphics card (GPU). |
| 23 | + |
| 24 | +* **The CPU is a Ferrari.** It's designed to take a small number of passengers (threads) from Point A to Point B as fast as humanly possible. It has giant caches and complex logic to make sure *one single task* finishes quickly. This is **Latency** optimization. |
| 25 | +* **The GPU is a Bus Service.** It's not trying to get one person across town in record time. It's trying to move *thousands* of people across town at once. It might take a bit longer for the bus to start and stop, but the sheer volume of people moved per minute is massive. This is **Throughput** optimization. |
| 26 | + |
| 27 | +{% include figure.liquid loading="eager" path="assets/img/gpu_basics/cpu-gpu.svg" title="cpu-vs-gpu" class="img-fluid rounded z-depth-1" %} |
| 28 | + |
| 29 | +Because of this difference, GPUs have a completely different architecture. They don't have a few powerful cores; they have thousands of tiny, simple ones. |
| 30 | + |
| 31 | +## 1. The Core vs. The Thread |
| 32 | + |
| 33 | +Let's start at the absolute bottom. |
| 34 | + |
| 35 | +**The Hardware Reality: The Core** |
| 36 | +Deep inside the silicon, the most basic unit of computation is the **Core** (specifically, the [CUDA Core](https://modal.com/gpu-glossary/device-hardware/cuda-core)). This is the workerbee. It can do basic math (add, multiply) for one piece of data at a time. Newer GPUs also have **[Tensor Cores](https://modal.com/gpu-glossary/device-hardware/tensor-core)**, which are specialized workers that can only do one specific job: multiply tiny matrices together really, really fast (essential for AI and scientific computing). |
| 37 | + |
| 38 | +**The Software Abstraction: The Thread** |
| 39 | +When you write code for a GPU, you don't talk to Cores directly. You write a function (called a **[Kernel](https://modal.com/gpu-glossary/device-software/kernel)**) and tell the GPU: "Run this function 10,000 times." |
| 40 | +Each individual execution of that function is called a **[Thread](https://modal.com/gpu-glossary/device-software/thread)**. |
| 41 | + |
| 42 | +> **The Bridge**: A **Thread** is a set of instructions. A **Core** is the physical spot where those instructions get executed. |
| 43 | +
|
| 44 | +## 2. The Streaming Multiprocessor (SM) vs. The Thread Block |
| 45 | + |
| 46 | +One core isn't very useful. So NVIDIA groups them together. |
| 47 | + |
| 48 | +**The Hardware Reality: The Streaming Multiprocessor (SM)** |
| 49 | +This is the *real* heart of the GPU. An **[SM](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor)** is like a department in a factory. It contains: |
| 50 | +* A bunch of CUDA Cores (usually 64 or 128). |
| 51 | +* Some Tensor Cores. |
| 52 | +* Some fast memory (we'll get to that). |
| 53 | +* Schedulers to hand out work. |
| 54 | + |
| 55 | +An H100 GPU, for example, has 144 of these SMs. |
| 56 | + |
| 57 | +{% include figure.liquid loading="eager" path="assets/img/gpu_basics/H100_FULLGPU_144SMs.png" title="H100_FULLGPU_144SMs" class="img-fluid rounded z-depth-1" %} |
| 58 | + |
| 59 | +**The Software Abstraction: The Thread Block** |
| 60 | +Managing 10,000 individual threads would be chaos. So, in your code, you group threads into **[Thread Blocks](https://modal.com/gpu-glossary/device-software/thread-block)**. You might say, "I want 10,000 threads, grouped into blocks of 256." |
| 61 | + |
| 62 | +> **The Bridge**: When you launch your program, the GPU assigns an entire **Thread Block** to a single **SM**. The SM is responsible for running all the threads in that block. Once assigned, that block stays on that SM until it's finished. |
| 63 | +
|
| 64 | +## 3. The Grid vs. The GPU |
| 65 | + |
| 66 | +**The Hardware Reality: The Device** |
| 67 | +The entire GPU itself (the "Device") is just a collection of all those SMs we just talked about, connected to some big memory banks (VRAM). |
| 68 | + |
| 69 | +**The Software Abstraction: The Grid** |
| 70 | +The collection of *all* your Thread Blocks is called the **[Grid](https://modal.com/gpu-glossary/device-software/thread-block-grid)**. |
| 71 | + |
| 72 | +> **The Bridge**: The **Grid** covers the entire problem you are solving. The GPU's hardware scheduler breaks up this Grid and feeds the Blocks to the available SMs. If you have a huge Grid and a small GPU, the hardware just queues up the Blocks and runs them as fast as it can. This is why CUDA code scales automatically: a better GPU just executes more Blocks at once. |
| 73 | +
|
| 74 | +| Software Concept | Hardware Home | |
| 75 | +| :--- | :--- | |
| 76 | +| **Thread** | **Core** | |
| 77 | +| **Thread Block** | **Streaming Multiprocessor (SM)** | |
| 78 | +| **Grid** | **Entire Device (GPU)** | |
| 79 | + |
| 80 | +{% include figure.liquid loading="eager" path="assets/img/gpu_basics/cuda-programming-model.svg" title="cuda-programming-model" class="img-fluid rounded z-depth-1" %} |
| 81 | + |
| 82 | +(Figure from GPU Glossary) |
| 83 | + |
| 84 | +## 4. Memory: Where Data Lives |
| 85 | + |
| 86 | +Understanding where your data lives is the single most important part of GPU performance. |
| 87 | + |
| 88 | +1. **[Global Memory](https://modal.com/gpu-glossary/device-software/global-memory) (GPU RAM)**: |
| 89 | + * **Analogy**: This is the warehouse down the street. It's huge (80GB+), but it takes a long time to travel there to pick up a package (data). |
| 90 | + * **Reality**: This is the VRAM on the card. |
| 91 | + |
| 92 | +2. **[Shared Memory](https://modal.com/gpu-glossary/device-software/shared-memory) (L1 Cache)**: |
| 93 | + * **Analogy**: This is a communal workbench shared by all the workers (threads) in the same room (Block). It's really fast, but small. |
| 94 | + * **Reality**: Physically located inside the SM. You, the programmer, have to manually move data here if you want to use it efficiently. |
| 95 | + |
| 96 | +3. **[Registers](https://modal.com/gpu-glossary/device-software/registers)**: |
| 97 | + * **Analogy**: This is the pocket of the individual worker. It's instant to reach, but you only have so many pockets. |
| 98 | + * **Reality**: Private memory for each thread to store its local variables. |
| 99 | + |
| 100 | +## 5. The Secret Sauce: Warps and Latency Hiding |
| 101 | + |
| 102 | +Here is the "gotcha" that catches every beginner. |
| 103 | + |
| 104 | +You might think that if you have 32 threads, they all run independently. They don't. |
| 105 | +The hardware groups threads into bundles of 32 called **[Warps](https://modal.com/gpu-glossary/device-software/warp)**. |
| 106 | + |
| 107 | +**The Drill Sergeant (SIMT)** |
| 108 | +A Warp executes in "Lock-step". It's like a drill sergeant commanding a platoon: "Everyone take a step forward!" If one soldier needs to tie their shoe (an `if` statement that takes a different path), *everyone else has to wait*. This is why "branching" code is bad on GPUs. |
| 109 | + |
| 110 | +**Running the Bus Service (Latency Hiding)** |
| 111 | +Remember the "Bus Service" analogy? |
| 112 | +When a Warp needs to load data from Global Memory (the warehouse), it takes a long time (hundreds of clock cycles). The SM doesn't just sit there and wait. It says, "Okay, Warp 1 is waiting for memory. Warp 2, you're up!" |
| 113 | + |
| 114 | +It instantly switches to another Warp that is ready to calculate. By the time Warp 2 needs memory, Warp 3 is ready. Eventually, Warp 1's data arrives, and it jumps back in line. |
| 115 | + |
| 116 | +{% include figure.liquid loading="eager" path="assets/img/gpu_basics/wave-scheduling.png" title="wave-scheduling" class="img-fluid rounded z-depth-1" %} |
| 117 | + |
| 118 | +This is **Latency Hiding**, and it is the key to GPU performance. You need to launch *way more threads* than you have cores, just to keep the hardware busy while it waits for memory. |
| 119 | + |
| 120 | +## Summary |
| 121 | + |
| 122 | +* **Software**: You write a **Kernel**. You launch a **Grid** of **Thread Blocks**, each containing hundreds of **Threads**. |
| 123 | +* **Hardware**: The GPU assigns Blocks to **SMs**. The SMs group threads into **Warps** of 32. These Warps execute instructions on **Cores**. |
| 124 | +* **Performance**: If you align your software structure (Blocks/Threads) to respect the hardware reality (SMs/Warps), you get incredible speed. If you fight the hardware, you get a very expensive space heater. |
0 commit comments