100_days_of_CUDA

Challenging myself to learn CUDA (Basics ⇾ Intermediate) these 100 days.

Tip

View these notes as a beautifully rendered webpage at cuda.firojpaudel.com.np

My learning resources:

Books:
- Cuda By Example An Introduction to General-Purpose GPU Programming — Jason Sandres, Edward Kandrot
- PMPP; *4th Edition — Wen-mei, David, Izzat

Day	Learnt Topics	Links
Day 01	History, applications, setup, and first Hello World CUDA program. Covers initial CUDA installation and running a basic kernel.	index.md
Day 02	Parameter passing, device queries, vector addition on kernel, and PMPP Chapter 2 exercises. Explores kernel arguments and device properties.	index.md
Day 03	Multidimensional grids, mapping threads to multidimensional data, and image color conversion. Practical thread mapping strategies.	index.md
Day 04	Image blurring, matrix multiplication, and solutions to exercises. Focus on convolution and matrix operations in CUDA.	index.md
Day 05	Modern GPU architecture, block scheduling, barrier synchronization, and use of __syncthreads().	index.md
Day 06	Warps, SIMD hardware, GPU architecture, and introduction to control divergence.	index.md
Day 07	Impact of divergence on performance, types of divergence, identification, and performance analysis.	index.md
Day 08	Warp scheduling, latency tolerance, resource partitioning, and occupancy.	index.md
Day 09	Memory access efficiency, roofline model, and matrix multiplication code optimization.	index.md
Day 10	CUDA memory types: global, constant, local, registers, and shared memory.	index.md
Day 11	Tiling concept and memory tradeoffs in CUDA matrix multiplication.	index.md
Day 12	Explanation for tiled matrix multiplication, impact of memory usage on occupancy, and dynamic tiling.	index.md
Day 13	Memory coalescing, row-major vs. column-major storage, and DRAM burst access in CUDA.	index.md
Day 14	Corner turning in matrix multiplication, memory coalescing analogies, and latency hiding.	index.md
Day 15	Thread coarsening and exercises from PMPP Chapter 6.	index.md
Day 16	Start of convolutions: 1D and 2D convolution with boundary conditions.	index.md
Day 17	Parallel 2D convolution with edge handling and normalization.	index.md
Day 18	Convolution on 2D images: preprocessing, CUDA kernel, and post-processing.	index.md
Day 19	Filter array properties, constant memory, caching, tiled convolution with halo cells, and thread strategies.	index.md
Day 20	Tiled convolution using caches for halo cells and exercises from Chapter 7.	index.md
Day 21	Stencil vs. convolution, parallel stencil algorithms, and code implementations.	index.md
Day 22	Thread coarsening and optimization for 3D stencil computations.	index.md
Day 23	Exercises from Chapter 8 and chapter completion.	index.md
Day 24	Introduction to parallel histogram and code implementation.	index.md
Day 25	Atomic operations, privatization, coarsening, and aggregation in CUDA.	index.md
Day 26	Reduction: max and sum reduction, and exercises from Chapter 10.	index.md
Day 27	Simple sum reduction kernel and convergent sum reduction.	index.md
Day 28	Shared memory for reduction, hierarchical reduction, and thread coarsening.	index.md
Day 29	Exercises from Chapter 10.	index.md
Day 30	Parallel prefix scan and Kogge-Stone algorithm.	index.md
Day 31	Kogge-Stone continued, complexity analysis, exclusive and inclusive scans.	index.md
Day 32	Brent-Kung parallel inclusive scan algorithm.	index.md
Day 33	Thread coarsening in detail and its impact on performance.	index.md
Day 34	Coarsening complexity analysis and hierarchical scan.	index.md
Day 35	Exercises from Chapter 11.	index.md
Day 36	Sequential merge and introduction to parallel merge algorithms.	index.md
Day 37	Parallel merge kernels, co-ranks, and divide and conquer strategies.	index.md
Day 38	Tiled merge kernels and their performance benefits.	index.md
Day 39	Exercises from Chapter 12.	index.md
Day 40	Parallel radix sort and its CUDA implementation.	index.md
Day 41	Choice of radix, multi-bit radix, memory coalescing, and parallel merge sort.	index.md
Day 42	Exercises from Chapter 13.	index.md
Day 43	SpMV with COO format and code implementation.	index.md
Day 44	CSR and ELL formats for sparse matrices in CUDA.	index.md
Day 45	Hybrid ELL-COO format, JDS format, and parallelization strategies.	index.md
Day 46	Exercises from Chapter 14.	index.md
Day 47	Normal BFS and introduction to graph traversal in CUDA.	index.md
Day 48	Vertex-centric parallelization: pull and push methods.	index.md
Day 49	Edge-centric parallelization and frontier-based graph processing.	index.md
Day 50	Privatization and exercises from Chapter 15.	index.md
Day 51	CNNs: basic ML concepts and CNN architecture.	index.md
Day 52	Vector addition and matrix multiplication in PyCUDA.	index.md
Day 53	CNN forward pass: CUDA implementation and performance.	index.md
Day 54	Backpropagation in CUDA: implementation and explanation.	index.md
Day 55	Complete backpropagation for CNN in CUDA.	index.md
Day 56	ReLU activation function in PyCUDA: implementation and testing.	index.md
Day 57	Matrix inversion kernel in PyCUDA and its applications.	index.md
Day 58	Batch normalization in PyCUDA: implementation and usage.	index.md
Day 59	Layer normalization in PyCUDA: theory and code.	index.md
Day 60	Multi-Head Self-Attention in Triton, initial implementation and notes.	index.md
Day 61	Fixed and explained MHA Triton implementation, detailed kernel parameter breakdown.	index.md
Day 62	CUDA CNN inference kernel design, thread organization, grid mapping.	index.md
Day 63	Explored cuDNN for DNN acceleration, convolution parameterization.	index.md
Day 64	Implemented Batch Norm with cuDNN, shared initial approach.	index.md
Day 65	Pooling forward pass (LeNet-5), memory layout discussion.	index.md
Day 66	MRI image reconstruction, k-space, FFT, and scan strategies.	index.md
Day 67	Iterative MRI reconstruction, quasi-Bayesian estimation, large matrix challenges.	index.md
Day 68	Step-by-step optimization of F^H D kernel for MRI, parallelization, atomic ops.	index.md
Day 69	Dynamic Parallelism in CUDA, device kernel launches, recursion.	index.md
Day 70	Tensara competition: Leaky ReLU and L1 Norm kernel submissions.	index.md
Day 71	Tanh, Softmax, and Vector Addition (loop unrolling, shared memory) kernels.	index.md
Day 72	Matrix Scalar Multiplication and Matrix Vector Multiplication, performance notes.	index.md
Day 73	GEMM with bias and ReLU activation: C = ReLU(A . W^T + b).	index.md
Day 74	Prefix Sum (Inclusive Scan), Diagonal Matrix Multiplication, ELU kernel.	index.md
Day 75	Cumulative product kernels: naive, multi-kernel, performance analysis.	index.md
Day 76	Fixed cumulative product with thrust, 4D/3D tensor matmul, cosine similarity.	index.md
Day 77	Hinge Loss, Hard Sigmoid, Huber Loss, SELU kernels; reached Tensara global rank 1.	index.md
Day 78	Swish activation function: multiple kernel approaches, benchmarking.	index.md
Day 79	RMS Normalization kernel and performance benchmarking.	index.md
Day 80	Optimized Frobenius Norm kernel and Mat-Mul kernel for high GFLOPs.	index.md
Day 81	Frobenius Normalization implementation.	index.md
Day 82	Softplus kernel and Min Over Dimension kernel, performance notes.	index.md
Day 83	1D Convolution kernel for Tensara competition.	index.md
Day 84	KL-Divergence kernel and benchmarking on Tensara.	index.md
Day 85	Improved vector addition and ReLU kernel for higher GFLOPs.	index.md
Day 86	Layer Normalization kernel on 4D Tensor, performance benchmarking.	index.md
Day 87	Improved Leaky ReLU and Lower Triangular Matrix Multiplication kernels.	index.md
Day 88	Upper Triangular Matrix Multiplication kernel, performance notes.	index.md
Day 89	L2 Normalization and optimized KL divergence kernels.	index.md
Day 90	Symmetric Matrix Multiplication and GEMM with bias and ReLU kernels.	index.md
Day 91	Triplet Margin Loss and optimized Softplus kernel.	index.md
Day 92	GELU kernel and performance benchmarking.	index.md
Day 93	Product Over a Dimension kernel: two implementations, performance notes.	index.md
Day 94	2D Convolution kernel: naive, optimized, performance comparison.	index.md
Day 95	MSE Loss kernel and performance on H100 and L40S GPUs.	index.md
Day 96	MSE Loss kernel and performance on H100 and L40S GPUs.	index.md
Day 97	Sigmoid Activation Function kernel and performance notes.	index.md
Day 98	Matrix Multiplication with Swish Activation and optimized L1 Norm kernel.	index.md
Day 99	2D Average Pooling and optimized MatMul kernel with half2 and __hfma2.	index.md
Day 100	2D Max Pooling kernel and challenge completion reflection.	index.md

✨ Project Highlights

Comprehensive Coverage: From CUDA basics to advanced deep learning and transformer architectures.
Hands-on Code: Every day features real CUDA code, with a focus on practical, high-performance GPU programming.
Modern Deep Learning: Includes CNNs, RNNs, attention mechanisms, normalization, and more.
Performance Optimization: Profiling, memory management, and multi-GPU strategies.

🌟 What's Next?

Stay tuned for more advanced CUDA explorations, real-world projects, and deep dives into GPU-powered AI!
_{Follow this repository for future updates and bonus content.}

Name		Name	Last commit message	Last commit date
Latest commit History 260 Commits
Day_01		Day_01
Day_02		Day_02
Day_03		Day_03
Day_04		Day_04
Day_05		Day_05
Day_06		Day_06
Day_07		Day_07
Day_08		Day_08
Day_09		Day_09
Day_10		Day_10
Day_100		Day_100
Day_11		Day_11
Day_12		Day_12
Day_13		Day_13
Day_14		Day_14
Day_15		Day_15
Day_16		Day_16
Day_17		Day_17
Day_18		Day_18
Day_19		Day_19
Day_20		Day_20
Day_21		Day_21
Day_22		Day_22
Day_23		Day_23
Day_24		Day_24
Day_25		Day_25
Day_26		Day_26
Day_27		Day_27
Day_28		Day_28
Day_29		Day_29
Day_30		Day_30
Day_31		Day_31
Day_32		Day_32
Day_33		Day_33
Day_34		Day_34
Day_35		Day_35
Day_36		Day_36
Day_37		Day_37
Day_38		Day_38
Day_39		Day_39
Day_40		Day_40
Day_41		Day_41
Day_42		Day_42
Day_43		Day_43
Day_44		Day_44
Day_45		Day_45
Day_46		Day_46
Day_47		Day_47
Day_48		Day_48
Day_49		Day_49
Day_50		Day_50
Day_51		Day_51
Day_52		Day_52
Day_53		Day_53
Day_54		Day_54
Day_55		Day_55
Day_56		Day_56
Day_57		Day_57
Day_58		Day_58
Day_59		Day_59
Day_60		Day_60
Day_61		Day_61
Day_62		Day_62
Day_63		Day_63
Day_64		Day_64
Day_65		Day_65
Day_66		Day_66
Day_67		Day_67
Day_68		Day_68
Day_69		Day_69
Day_70		Day_70
Day_71		Day_71
Day_72		Day_72
Day_73		Day_73
Day_74		Day_74
Day_75		Day_75
Day_76		Day_76
Day_77		Day_77
Day_78		Day_78
Day_79		Day_79
Day_80		Day_80
Day_81		Day_81
Day_82		Day_82
Day_83		Day_83
Day_84		Day_84
Day_85		Day_85
Day_86		Day_86
Day_87		Day_87
Day_88		Day_88
Day_89		Day_89
Day_90		Day_90
Day_91		Day_91
Day_92		Day_92
Day_93		Day_93
Day_94		Day_94
Day_95		Day_95
Day_96		Day_96
Day_97		Day_97
Day_98		Day_98
Day_99		Day_99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

100_days_of_CUDA

✨ Project Highlights

🌟 What's Next?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

100_days_of_CUDA

✨ Project Highlights

🌟 What's Next?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages