Skip to content

Firojpaudel/100_days_of_CUDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

260 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

100_days_of_CUDA

Challenging myself to learn CUDA (Basics ⇾ Intermediate) these 100 days.

Tip

View these notes as a beautifully rendered webpage at cuda.firojpaudel.com.np

My learning resources:

  1. Books:
    • Cuda By Example An Introduction to General-Purpose GPU Programming — Jason Sandres, Edward Kandrot
    • PMPP; *4th Edition — Wen-mei, David, Izzat

Days Completed CUDA Deep Learning License


Day Learnt Topics Links
Day 01 History, applications, setup, and first Hello World CUDA program. Covers initial CUDA installation and running a basic kernel. index.md
Day 02 Parameter passing, device queries, vector addition on kernel, and PMPP Chapter 2 exercises. Explores kernel arguments and device properties. index.md
Day 03 Multidimensional grids, mapping threads to multidimensional data, and image color conversion. Practical thread mapping strategies. index.md
Day 04 Image blurring, matrix multiplication, and solutions to exercises. Focus on convolution and matrix operations in CUDA. index.md
Day 05 Modern GPU architecture, block scheduling, barrier synchronization, and use of __syncthreads(). index.md
Day 06 Warps, SIMD hardware, GPU architecture, and introduction to control divergence. index.md
Day 07 Impact of divergence on performance, types of divergence, identification, and performance analysis. index.md
Day 08 Warp scheduling, latency tolerance, resource partitioning, and occupancy. index.md
Day 09 Memory access efficiency, roofline model, and matrix multiplication code optimization. index.md
Day 10 CUDA memory types: global, constant, local, registers, and shared memory. index.md
Day 11 Tiling concept and memory tradeoffs in CUDA matrix multiplication. index.md
Day 12 Explanation for tiled matrix multiplication, impact of memory usage on occupancy, and dynamic tiling. index.md
Day 13 Memory coalescing, row-major vs. column-major storage, and DRAM burst access in CUDA. index.md
Day 14 Corner turning in matrix multiplication, memory coalescing analogies, and latency hiding. index.md
Day 15 Thread coarsening and exercises from PMPP Chapter 6. index.md
Day 16 Start of convolutions: 1D and 2D convolution with boundary conditions. index.md
Day 17 Parallel 2D convolution with edge handling and normalization. index.md
Day 18 Convolution on 2D images: preprocessing, CUDA kernel, and post-processing. index.md
Day 19 Filter array properties, constant memory, caching, tiled convolution with halo cells, and thread strategies. index.md
Day 20 Tiled convolution using caches for halo cells and exercises from Chapter 7. index.md
Day 21 Stencil vs. convolution, parallel stencil algorithms, and code implementations. index.md
Day 22 Thread coarsening and optimization for 3D stencil computations. index.md
Day 23 Exercises from Chapter 8 and chapter completion. index.md
Day 24 Introduction to parallel histogram and code implementation. index.md
Day 25 Atomic operations, privatization, coarsening, and aggregation in CUDA. index.md
Day 26 Reduction: max and sum reduction, and exercises from Chapter 10. index.md
Day 27 Simple sum reduction kernel and convergent sum reduction. index.md
Day 28 Shared memory for reduction, hierarchical reduction, and thread coarsening. index.md
Day 29 Exercises from Chapter 10. index.md
Day 30 Parallel prefix scan and Kogge-Stone algorithm. index.md
Day 31 Kogge-Stone continued, complexity analysis, exclusive and inclusive scans. index.md
Day 32 Brent-Kung parallel inclusive scan algorithm. index.md
Day 33 Thread coarsening in detail and its impact on performance. index.md
Day 34 Coarsening complexity analysis and hierarchical scan. index.md
Day 35 Exercises from Chapter 11. index.md
Day 36 Sequential merge and introduction to parallel merge algorithms. index.md
Day 37 Parallel merge kernels, co-ranks, and divide and conquer strategies. index.md
Day 38 Tiled merge kernels and their performance benefits. index.md
Day 39 Exercises from Chapter 12. index.md
Day 40 Parallel radix sort and its CUDA implementation. index.md
Day 41 Choice of radix, multi-bit radix, memory coalescing, and parallel merge sort. index.md
Day 42 Exercises from Chapter 13. index.md
Day 43 SpMV with COO format and code implementation. index.md
Day 44 CSR and ELL formats for sparse matrices in CUDA. index.md
Day 45 Hybrid ELL-COO format, JDS format, and parallelization strategies. index.md
Day 46 Exercises from Chapter 14. index.md
Day 47 Normal BFS and introduction to graph traversal in CUDA. index.md
Day 48 Vertex-centric parallelization: pull and push methods. index.md
Day 49 Edge-centric parallelization and frontier-based graph processing. index.md
Day 50 Privatization and exercises from Chapter 15. index.md
Day 51 CNNs: basic ML concepts and CNN architecture. index.md
Day 52 Vector addition and matrix multiplication in PyCUDA. index.md
Day 53 CNN forward pass: CUDA implementation and performance. index.md
Day 54 Backpropagation in CUDA: implementation and explanation. index.md
Day 55 Complete backpropagation for CNN in CUDA. index.md
Day 56 ReLU activation function in PyCUDA: implementation and testing. index.md
Day 57 Matrix inversion kernel in PyCUDA and its applications. index.md
Day 58 Batch normalization in PyCUDA: implementation and usage. index.md
Day 59 Layer normalization in PyCUDA: theory and code. index.md
Day 60 Multi-Head Self-Attention in Triton, initial implementation and notes. index.md
Day 61 Fixed and explained MHA Triton implementation, detailed kernel parameter breakdown. index.md
Day 62 CUDA CNN inference kernel design, thread organization, grid mapping. index.md
Day 63 Explored cuDNN for DNN acceleration, convolution parameterization. index.md
Day 64 Implemented Batch Norm with cuDNN, shared initial approach. index.md
Day 65 Pooling forward pass (LeNet-5), memory layout discussion. index.md
Day 66 MRI image reconstruction, k-space, FFT, and scan strategies. index.md
Day 67 Iterative MRI reconstruction, quasi-Bayesian estimation, large matrix challenges. index.md
Day 68 Step-by-step optimization of F^H D kernel for MRI, parallelization, atomic ops. index.md
Day 69 Dynamic Parallelism in CUDA, device kernel launches, recursion. index.md
Day 70 Tensara competition: Leaky ReLU and L1 Norm kernel submissions. index.md
Day 71 Tanh, Softmax, and Vector Addition (loop unrolling, shared memory) kernels. index.md
Day 72 Matrix Scalar Multiplication and Matrix Vector Multiplication, performance notes. index.md
Day 73 GEMM with bias and ReLU activation: C = ReLU(A . W^T + b). index.md
Day 74 Prefix Sum (Inclusive Scan), Diagonal Matrix Multiplication, ELU kernel. index.md
Day 75 Cumulative product kernels: naive, multi-kernel, performance analysis. index.md
Day 76 Fixed cumulative product with thrust, 4D/3D tensor matmul, cosine similarity. index.md
Day 77 Hinge Loss, Hard Sigmoid, Huber Loss, SELU kernels; reached Tensara global rank 1. index.md
Day 78 Swish activation function: multiple kernel approaches, benchmarking. index.md
Day 79 RMS Normalization kernel and performance benchmarking. index.md
Day 80 Optimized Frobenius Norm kernel and Mat-Mul kernel for high GFLOPs. index.md
Day 81 Frobenius Normalization implementation. index.md
Day 82 Softplus kernel and Min Over Dimension kernel, performance notes. index.md
Day 83 1D Convolution kernel for Tensara competition. index.md
Day 84 KL-Divergence kernel and benchmarking on Tensara. index.md
Day 85 Improved vector addition and ReLU kernel for higher GFLOPs. index.md
Day 86 Layer Normalization kernel on 4D Tensor, performance benchmarking. index.md
Day 87 Improved Leaky ReLU and Lower Triangular Matrix Multiplication kernels. index.md
Day 88 Upper Triangular Matrix Multiplication kernel, performance notes. index.md
Day 89 L2 Normalization and optimized KL divergence kernels. index.md
Day 90 Symmetric Matrix Multiplication and GEMM with bias and ReLU kernels. index.md
Day 91 Triplet Margin Loss and optimized Softplus kernel. index.md
Day 92 GELU kernel and performance benchmarking. index.md
Day 93 Product Over a Dimension kernel: two implementations, performance notes. index.md
Day 94 2D Convolution kernel: naive, optimized, performance comparison. index.md
Day 95 MSE Loss kernel and performance on H100 and L40S GPUs. index.md
Day 96 MSE Loss kernel and performance on H100 and L40S GPUs. index.md
Day 97 Sigmoid Activation Function kernel and performance notes. index.md
Day 98 Matrix Multiplication with Swish Activation and optimized L1 Norm kernel. index.md
Day 99 2D Average Pooling and optimized MatMul kernel with half2 and __hfma2. index.md
Day 100 2D Max Pooling kernel and challenge completion reflection. index.md

✨ Project Highlights

  • Comprehensive Coverage: From CUDA basics to advanced deep learning and transformer architectures.
  • Hands-on Code: Every day features real CUDA code, with a focus on practical, high-performance GPU programming.
  • Modern Deep Learning: Includes CNNs, RNNs, attention mechanisms, normalization, and more.
  • Performance Optimization: Profiling, memory management, and multi-GPU strategies.

🌟 What's Next?

Stay tuned for more advanced CUDA explorations, real-world projects, and deep dives into GPU-powered AI!
Follow this repository for future updates and bonus content.

Skill Icons Skill Icons
GitHub Repo Card

About

Challenging myself to learn CUDA (Basics → Intermediate) these 100 days.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors