We already have a large set of Triton kernels (e.g. 14_Gemm_, 81_Gemm_, 95_Matmul_*, FlashAttention, etc.), but they are currently:
flat (no structure)
undocumented
hard to navigate for new users
not clearly categorized by use case or optimization pattern
Instead of adding new examples, we should convert a subset of existing kernels into curated, documented examples.
Something like:
GEMM
- 14_Gemm_Divide_Sum_Scaling
- 39_Gemm_Scale_BatchNorm
Fused / complex pipelines
- 81_Gemm_Swish_Divide_Clamp_Tanh_Clamp
- 95_Matmul_Add_Swish_Tanh_GELU_Hardtanh
Reduction / normalization
- 84_Gemm_BatchNorm_Scaling_Softmax
Attention
Plus add EXAMPLES.md file
We already have a large set of Triton kernels (e.g. 14_Gemm_, 81_Gemm_, 95_Matmul_*, FlashAttention, etc.), but they are currently:
flat (no structure)
undocumented
hard to navigate for new users
not clearly categorized by use case or optimization pattern
Instead of adding new examples, we should convert a subset of existing kernels into curated, documented examples.
Something like:
GEMM
Fused / complex pipelines
Reduction / normalization
Attention
1_FlashAttention_Fwd
Mixed ops
55_Matmul_MaxPool_Sum_Scale
68_Matmul_Min_Subtract
Plus add EXAMPLES.md file