GPU Kernel Cookbook

A hands-on cookbook for GPU kernel programming: CUDA, Triton, and CuTe DSL implementations of common deep learning operators, with detailed notes and benchmarks.

系统化的 GPU Kernel 学习项目，实现常见深度学习算子并对比多种 GPU 编程框架。

项目结构

gpu-kernel-lab/
├── README.md
├── docs/                    # 详细的算子设计文档
│   ├── vector_add.md        # CUDA 线程模型、coalesced 访问
│   ├── transpose.md         # Memory coalescing、shared memory、bank conflict
│   ├── softmax.md           # Reduction、online softmax、warp shuffle
│   ├── layernorm.md         # Welford 算法、两级规约
│   ├── matmul.md            # Shared memory tiling、Roofline 分析
│   ├── attention.md         # Flash Attention、IO-Aware 算法
│   └── rms_norm.md          # RMS 归一化、float4 向量化、Fused Add+Norm
│
├── common/
│   ├── utils.py             # benchmark 工具、性能指标计算
│   ├── tensor_utils.py      # 张量生成工具
│   └── check.py             # 正确性验证
│
├── benchmarks/
│   └── benchmark.py         # 统一 benchmark 入口
│
├── operators/
│   ├── vector_add/          # ⭐ GPU 线程模型
│   ├── transpose/           # ⭐⭐ Memory Coalescing
│   ├── softmax/             # ⭐⭐⭐ Reduction
│   ├── layernorm/           # ⭐⭐⭐ Warp Reduction
│   ├── matmul/              # ⭐⭐⭐⭐ Shared Memory Tiling
│   ├── attention/           # ⭐⭐⭐⭐⭐ Fused Kernel (Flash Attention)
│   ├── rms_norm/            # ⭐⭐⭐ float4 向量化、Fused Add+Norm
│   └── rope/                # ⭐⭐⭐ Rotary Position Embedding
│
└── scripts/
    ├── build_all.sh
    └── run_all_tests.sh

算子列表

算子	难度	核心技术	文档
Vector Add	⭐	CUDA 线程模型，float4 向量化	vector_add.md
Transpose	⭐⭐	Shared Memory，Bank Conflict	transpose.md
Softmax	⭐⭐⭐	Reduction，Online Softmax	softmax.md
LayerNorm	⭐⭐⭐	Welford 算法，Warp Reduction	layernorm.md
Matmul	⭐⭐⭐⭐	Shared Memory Tiling，Roofline	matmul.md
Attention	⭐⭐⭐⭐⭐	Flash Attention，IO-Aware	attention.md
RMSNorm	⭐⭐⭐	float4 向量化，Fused Add+Norm	rms_norm.md
RoPE	⭐⭐⭐	Rotary Position Embedding	rope.md

每个算子的实现

每个算子包含：

CUDA：从 naive 到优化的手写 kernel（含详细注释）
Triton：Python DSL 实现
CuTe DSL：CUTLASS 的 Python 接口（部分算子）
PyTorch：baseline（用于正确性验证和性能对比）

快速开始

环境要求

CUDA >= 11.8
PyTorch >= 2.0
Triton >= 2.1
cutlass (可选，用于 CuTe DSL)

编译 CUDA Kernels

# 编译所有（默认 sm_80 = A100/RTX3090）
CUDA_ARCH=sm_80 bash scripts/build_all.sh

# 编译指定算子
cd operators/matmul/cuda && bash build.sh

常见 CUDA 架构：

GPU	架构	`CUDA_ARCH`
RTX 5090	Blackwell	`sm_100`
H200	Hopper	`sm_90`
H20	Hopper	`sm_90`
L20	Ada Lovelace	`sm_89`
A100	Ampere	`sm_80`
RTX 30xx	Ampere	`sm_86`

Benchmark 环境

以下 GPU 上均有实测数据：

GPU	架构	显存	峰值 FP32	显存带宽
RTX 5090	Blackwell (sm_100)	32 GB	~109 TFLOPS	~1.79 TB/s
H200 SXM	Hopper (sm_90)	141 GB	~67 TFLOPS	~4.8 TB/s
H20	Hopper (sm_90)	96 GB	~44 TFLOPS	~4.0 TB/s
L20	Ada Lovelace (sm_89)	48 GB	~59.8 TFLOPS	~864 GB/s

运行测试

# 测试所有算子
bash scripts/run_all_tests.sh

# 测试单个算子
python operators/matmul/test.py
python operators/attention/test.py

运行 Benchmark

# 所有算子
python benchmarks/benchmark.py

# 指定算子
python benchmarks/benchmark.py --op matmul

# 保存结果
python benchmarks/benchmark.py --save

学习路径

1. vector_add   →  理解 CUDA 线程层次（grid/block/thread/warp）
2. transpose    →  理解内存访问模式（coalescing、shared memory、bank conflict）
3. softmax      →  理解 GPU reduction（tree reduce、warp shuffle）
4. layernorm    →  复合 reduction（Welford 算法、两级规约）
5. matmul       →  理解 compute-bound 优化（tiling、数据复用）
6. attention    →  理解 IO-aware 算法（Flash Attention）

编程框架学习路径：

CUDA (手写) → Triton (Python DSL) → CuTe DSL → CUTLASS

文档说明

每份文档包含：

算子数学定义
GPU 并行策略
各版本 kernel 设计与分析
Roofline / 性能分析
关键概念详解（含代码示例）
性能 benchmark 参考数据
学习要点总结

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Kernel Cookbook

项目结构

算子列表

每个算子的实现

快速开始

环境要求

编译 CUDA Kernels

Benchmark 环境

运行测试

运行 Benchmark

学习路径

文档说明

参考资料

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
benchmarks		benchmarks
common		common
docs		docs
operators		operators
profiling		profiling
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

GPU Kernel Cookbook

项目结构

算子列表

每个算子的实现

快速开始

环境要求

编译 CUDA Kernels

Benchmark 环境

运行测试

运行 Benchmark

学习路径

文档说明

参考资料

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages