A hands-on cookbook for GPU kernel programming: CUDA, Triton, and CuTe DSL implementations of common deep learning operators, with detailed notes and benchmarks.
系统化的 GPU Kernel 学习项目,实现常见深度学习算子并对比多种 GPU 编程框架。
gpu-kernel-lab/
├── README.md
├── docs/ # 详细的算子设计文档
│ ├── vector_add.md # CUDA 线程模型、coalesced 访问
│ ├── transpose.md # Memory coalescing、shared memory、bank conflict
│ ├── softmax.md # Reduction、online softmax、warp shuffle
│ ├── layernorm.md # Welford 算法、两级规约
│ ├── matmul.md # Shared memory tiling、Roofline 分析
│ ├── attention.md # Flash Attention、IO-Aware 算法
│ └── rms_norm.md # RMS 归一化、float4 向量化、Fused Add+Norm
│
├── common/
│ ├── utils.py # benchmark 工具、性能指标计算
│ ├── tensor_utils.py # 张量生成工具
│ └── check.py # 正确性验证
│
├── benchmarks/
│ └── benchmark.py # 统一 benchmark 入口
│
├── operators/
│ ├── vector_add/ # ⭐ GPU 线程模型
│ ├── transpose/ # ⭐⭐ Memory Coalescing
│ ├── softmax/ # ⭐⭐⭐ Reduction
│ ├── layernorm/ # ⭐⭐⭐ Warp Reduction
│ ├── matmul/ # ⭐⭐⭐⭐ Shared Memory Tiling
│ ├── attention/ # ⭐⭐⭐⭐⭐ Fused Kernel (Flash Attention)
│ ├── rms_norm/ # ⭐⭐⭐ float4 向量化、Fused Add+Norm
│ └── rope/ # ⭐⭐⭐ Rotary Position Embedding
│
└── scripts/
├── build_all.sh
└── run_all_tests.sh
| 算子 | 难度 | 核心技术 | 文档 |
|---|---|---|---|
| Vector Add | ⭐ | CUDA 线程模型,float4 向量化 | vector_add.md |
| Transpose | ⭐⭐ | Shared Memory,Bank Conflict | transpose.md |
| Softmax | ⭐⭐⭐ | Reduction,Online Softmax | softmax.md |
| LayerNorm | ⭐⭐⭐ | Welford 算法,Warp Reduction | layernorm.md |
| Matmul | ⭐⭐⭐⭐ | Shared Memory Tiling,Roofline | matmul.md |
| Attention | ⭐⭐⭐⭐⭐ | Flash Attention,IO-Aware | attention.md |
| RMSNorm | ⭐⭐⭐ | float4 向量化,Fused Add+Norm | rms_norm.md |
| RoPE | ⭐⭐⭐ | Rotary Position Embedding | rope.md |
每个算子包含:
- CUDA:从 naive 到优化的手写 kernel(含详细注释)
- Triton:Python DSL 实现
- CuTe DSL:CUTLASS 的 Python 接口(部分算子)
- PyTorch:baseline(用于正确性验证和性能对比)
CUDA >= 11.8
PyTorch >= 2.0
Triton >= 2.1
cutlass (可选,用于 CuTe DSL)# 编译所有(默认 sm_80 = A100/RTX3090)
CUDA_ARCH=sm_80 bash scripts/build_all.sh
# 编译指定算子
cd operators/matmul/cuda && bash build.sh常见 CUDA 架构:
| GPU | 架构 | CUDA_ARCH |
|---|---|---|
| RTX 5090 | Blackwell | sm_100 |
| H200 | Hopper | sm_90 |
| H20 | Hopper | sm_90 |
| L20 | Ada Lovelace | sm_89 |
| A100 | Ampere | sm_80 |
| RTX 30xx | Ampere | sm_86 |
以下 GPU 上均有实测数据:
| GPU | 架构 | 显存 | 峰值 FP32 | 显存带宽 |
|---|---|---|---|---|
| RTX 5090 | Blackwell (sm_100) | 32 GB | ~109 TFLOPS | ~1.79 TB/s |
| H200 SXM | Hopper (sm_90) | 141 GB | ~67 TFLOPS | ~4.8 TB/s |
| H20 | Hopper (sm_90) | 96 GB | ~44 TFLOPS | ~4.0 TB/s |
| L20 | Ada Lovelace (sm_89) | 48 GB | ~59.8 TFLOPS | ~864 GB/s |
# 测试所有算子
bash scripts/run_all_tests.sh
# 测试单个算子
python operators/matmul/test.py
python operators/attention/test.py# 所有算子
python benchmarks/benchmark.py
# 指定算子
python benchmarks/benchmark.py --op matmul
# 保存结果
python benchmarks/benchmark.py --save1. vector_add → 理解 CUDA 线程层次(grid/block/thread/warp)
2. transpose → 理解内存访问模式(coalescing、shared memory、bank conflict)
3. softmax → 理解 GPU reduction(tree reduce、warp shuffle)
4. layernorm → 复合 reduction(Welford 算法、两级规约)
5. matmul → 理解 compute-bound 优化(tiling、数据复用)
6. attention → 理解 IO-aware 算法(Flash Attention)
编程框架学习路径:
CUDA (手写) → Triton (Python DSL) → CuTe DSL → CUTLASS
每份文档包含:
- 算子数学定义
- GPU 并行策略
- 各版本 kernel 设计与分析
- Roofline / 性能分析
- 关键概念详解(含代码示例)
- 性能 benchmark 参考数据
- 学习要点总结