Challenging myself to learn CUDA (Basics ⇾ Intermediate) these 100 days.
Tip
View these notes as a beautifully rendered webpage at cuda.firojpaudel.com.np
My learning resources:
- Books:
- Cuda By Example An Introduction to General-Purpose GPU Programming — Jason Sandres, Edward Kandrot
- PMPP; *4th Edition — Wen-mei, David, Izzat
| Day | Learnt Topics | Links |
|---|---|---|
| Day 01 | History, applications, setup, and first Hello World CUDA program. Covers initial CUDA installation and running a basic kernel. | index.md |
| Day 02 | Parameter passing, device queries, vector addition on kernel, and PMPP Chapter 2 exercises. Explores kernel arguments and device properties. | index.md |
| Day 03 | Multidimensional grids, mapping threads to multidimensional data, and image color conversion. Practical thread mapping strategies. | index.md |
| Day 04 | Image blurring, matrix multiplication, and solutions to exercises. Focus on convolution and matrix operations in CUDA. | index.md |
| Day 05 | Modern GPU architecture, block scheduling, barrier synchronization, and use of __syncthreads(). | index.md |
| Day 06 | Warps, SIMD hardware, GPU architecture, and introduction to control divergence. | index.md |
| Day 07 | Impact of divergence on performance, types of divergence, identification, and performance analysis. | index.md |
| Day 08 | Warp scheduling, latency tolerance, resource partitioning, and occupancy. | index.md |
| Day 09 | Memory access efficiency, roofline model, and matrix multiplication code optimization. | index.md |
| Day 10 | CUDA memory types: global, constant, local, registers, and shared memory. | index.md |
| Day 11 | Tiling concept and memory tradeoffs in CUDA matrix multiplication. | index.md |
| Day 12 | Explanation for tiled matrix multiplication, impact of memory usage on occupancy, and dynamic tiling. | index.md |
| Day 13 | Memory coalescing, row-major vs. column-major storage, and DRAM burst access in CUDA. | index.md |
| Day 14 | Corner turning in matrix multiplication, memory coalescing analogies, and latency hiding. | index.md |
| Day 15 | Thread coarsening and exercises from PMPP Chapter 6. | index.md |
| Day 16 | Start of convolutions: 1D and 2D convolution with boundary conditions. | index.md |
| Day 17 | Parallel 2D convolution with edge handling and normalization. | index.md |
| Day 18 | Convolution on 2D images: preprocessing, CUDA kernel, and post-processing. | index.md |
| Day 19 | Filter array properties, constant memory, caching, tiled convolution with halo cells, and thread strategies. | index.md |
| Day 20 | Tiled convolution using caches for halo cells and exercises from Chapter 7. | index.md |
| Day 21 | Stencil vs. convolution, parallel stencil algorithms, and code implementations. | index.md |
| Day 22 | Thread coarsening and optimization for 3D stencil computations. | index.md |
| Day 23 | Exercises from Chapter 8 and chapter completion. | index.md |
| Day 24 | Introduction to parallel histogram and code implementation. | index.md |
| Day 25 | Atomic operations, privatization, coarsening, and aggregation in CUDA. | index.md |
| Day 26 | Reduction: max and sum reduction, and exercises from Chapter 10. | index.md |
| Day 27 | Simple sum reduction kernel and convergent sum reduction. | index.md |
| Day 28 | Shared memory for reduction, hierarchical reduction, and thread coarsening. | index.md |
| Day 29 | Exercises from Chapter 10. | index.md |
| Day 30 | Parallel prefix scan and Kogge-Stone algorithm. | index.md |
| Day 31 | Kogge-Stone continued, complexity analysis, exclusive and inclusive scans. | index.md |
| Day 32 | Brent-Kung parallel inclusive scan algorithm. | index.md |
| Day 33 | Thread coarsening in detail and its impact on performance. | index.md |
| Day 34 | Coarsening complexity analysis and hierarchical scan. | index.md |
| Day 35 | Exercises from Chapter 11. | index.md |
| Day 36 | Sequential merge and introduction to parallel merge algorithms. | index.md |
| Day 37 | Parallel merge kernels, co-ranks, and divide and conquer strategies. | index.md |
| Day 38 | Tiled merge kernels and their performance benefits. | index.md |
| Day 39 | Exercises from Chapter 12. | index.md |
| Day 40 | Parallel radix sort and its CUDA implementation. | index.md |
| Day 41 | Choice of radix, multi-bit radix, memory coalescing, and parallel merge sort. | index.md |
| Day 42 | Exercises from Chapter 13. | index.md |
| Day 43 | SpMV with COO format and code implementation. | index.md |
| Day 44 | CSR and ELL formats for sparse matrices in CUDA. | index.md |
| Day 45 | Hybrid ELL-COO format, JDS format, and parallelization strategies. | index.md |
| Day 46 | Exercises from Chapter 14. | index.md |
| Day 47 | Normal BFS and introduction to graph traversal in CUDA. | index.md |
| Day 48 | Vertex-centric parallelization: pull and push methods. | index.md |
| Day 49 | Edge-centric parallelization and frontier-based graph processing. | index.md |
| Day 50 | Privatization and exercises from Chapter 15. | index.md |
| Day 51 | CNNs: basic ML concepts and CNN architecture. | index.md |
| Day 52 | Vector addition and matrix multiplication in PyCUDA. | index.md |
| Day 53 | CNN forward pass: CUDA implementation and performance. | index.md |
| Day 54 | Backpropagation in CUDA: implementation and explanation. | index.md |
| Day 55 | Complete backpropagation for CNN in CUDA. | index.md |
| Day 56 | ReLU activation function in PyCUDA: implementation and testing. | index.md |
| Day 57 | Matrix inversion kernel in PyCUDA and its applications. | index.md |
| Day 58 | Batch normalization in PyCUDA: implementation and usage. | index.md |
| Day 59 | Layer normalization in PyCUDA: theory and code. | index.md |
| Day 60 | Multi-Head Self-Attention in Triton, initial implementation and notes. | index.md |
| Day 61 | Fixed and explained MHA Triton implementation, detailed kernel parameter breakdown. | index.md |
| Day 62 | CUDA CNN inference kernel design, thread organization, grid mapping. | index.md |
| Day 63 | Explored cuDNN for DNN acceleration, convolution parameterization. | index.md |
| Day 64 | Implemented Batch Norm with cuDNN, shared initial approach. | index.md |
| Day 65 | Pooling forward pass (LeNet-5), memory layout discussion. | index.md |
| Day 66 | MRI image reconstruction, k-space, FFT, and scan strategies. | index.md |
| Day 67 | Iterative MRI reconstruction, quasi-Bayesian estimation, large matrix challenges. | index.md |
| Day 68 | Step-by-step optimization of F^H D kernel for MRI, parallelization, atomic ops. | index.md |
| Day 69 | Dynamic Parallelism in CUDA, device kernel launches, recursion. | index.md |
| Day 70 | Tensara competition: Leaky ReLU and L1 Norm kernel submissions. | index.md |
| Day 71 | Tanh, Softmax, and Vector Addition (loop unrolling, shared memory) kernels. | index.md |
| Day 72 | Matrix Scalar Multiplication and Matrix Vector Multiplication, performance notes. | index.md |
| Day 73 | GEMM with bias and ReLU activation: C = ReLU(A . W^T + b). | index.md |
| Day 74 | Prefix Sum (Inclusive Scan), Diagonal Matrix Multiplication, ELU kernel. | index.md |
| Day 75 | Cumulative product kernels: naive, multi-kernel, performance analysis. | index.md |
| Day 76 | Fixed cumulative product with thrust, 4D/3D tensor matmul, cosine similarity. | index.md |
| Day 77 | Hinge Loss, Hard Sigmoid, Huber Loss, SELU kernels; reached Tensara global rank 1. | index.md |
| Day 78 | Swish activation function: multiple kernel approaches, benchmarking. | index.md |
| Day 79 | RMS Normalization kernel and performance benchmarking. | index.md |
| Day 80 | Optimized Frobenius Norm kernel and Mat-Mul kernel for high GFLOPs. | index.md |
| Day 81 | Frobenius Normalization implementation. | index.md |
| Day 82 | Softplus kernel and Min Over Dimension kernel, performance notes. | index.md |
| Day 83 | 1D Convolution kernel for Tensara competition. | index.md |
| Day 84 | KL-Divergence kernel and benchmarking on Tensara. | index.md |
| Day 85 | Improved vector addition and ReLU kernel for higher GFLOPs. | index.md |
| Day 86 | Layer Normalization kernel on 4D Tensor, performance benchmarking. | index.md |
| Day 87 | Improved Leaky ReLU and Lower Triangular Matrix Multiplication kernels. | index.md |
| Day 88 | Upper Triangular Matrix Multiplication kernel, performance notes. | index.md |
| Day 89 | L2 Normalization and optimized KL divergence kernels. | index.md |
| Day 90 | Symmetric Matrix Multiplication and GEMM with bias and ReLU kernels. | index.md |
| Day 91 | Triplet Margin Loss and optimized Softplus kernel. | index.md |
| Day 92 | GELU kernel and performance benchmarking. | index.md |
| Day 93 | Product Over a Dimension kernel: two implementations, performance notes. | index.md |
| Day 94 | 2D Convolution kernel: naive, optimized, performance comparison. | index.md |
| Day 95 | MSE Loss kernel and performance on H100 and L40S GPUs. | index.md |
| Day 96 | MSE Loss kernel and performance on H100 and L40S GPUs. | index.md |
| Day 97 | Sigmoid Activation Function kernel and performance notes. | index.md |
| Day 98 | Matrix Multiplication with Swish Activation and optimized L1 Norm kernel. | index.md |
| Day 99 | 2D Average Pooling and optimized MatMul kernel with half2 and __hfma2. | index.md |
| Day 100 | 2D Max Pooling kernel and challenge completion reflection. | index.md |
- Comprehensive Coverage: From CUDA basics to advanced deep learning and transformer architectures.
- Hands-on Code: Every day features real CUDA code, with a focus on practical, high-performance GPU programming.
- Modern Deep Learning: Includes CNNs, RNNs, attention mechanisms, normalization, and more.
- Performance Optimization: Profiling, memory management, and multi-GPU strategies.
Stay tuned for more advanced CUDA explorations, real-world projects, and deep dives into GPU-powered AI!
Follow this repository for future updates and bonus content.
Follow this repository for future updates and bonus content.