CUDA kernel reference built over 100 days of deep study. Organized by optimization technique — not by day. Covers memory coalescing, warp divergence, tensor cores, async copy, and attention kernels.

Instructions: https://github.com/hkproj/100-days-of-cuda

CUDA kernel reference built over 100 days of deep study. Organized by optimization technique — not by day. Covers memory coalescing, warp divergence, tensor cores, async copy, and attention kernels.

See TOPIC_INDEX.md to navigate by technique rather than chronologically.

Day	Link	Notes
1	Vector Addition Kernel	Learned basic CUDA syntax and kernel execution - Vector Addtion and printing Hello Cuda.
2	Benchmarking Vector Add	Explored about Benchmarking in Cuda with Vector Add.
3	Cuda Streams	CUDA Stream is a sequence of operations (memory transfers, kernel launches, etc.) that execute in order within the stream, but operations in different streams can run concurrently.
4	Unified Mem VectorAdd	Unified Memory simplifies memory management by allowing the CPU and GPU to share the same memory space.
5	Tiled MatMul	Matrix Multiplication in CUDA using shared memory to optimize performance. Tiling improves memory access efficiency by reducing global memory accesses and leveraging shared memory for faster computation.
6	Matrix Transpose	Coalesced memory access refers to a pattern where multiple threads in a warp access consecutive memory locations, leading to efficient memory transactions.
7	Basic GEMM with Optimizations	Utilizes shared memory tiling, loop unrolling, and parallel execution for high performance.
8	WMMA (Tensor Core with Double buffering)	WMMA leverages specialized Tensor Cores on NVIDIA GPUs to accelerate matrix multiplications.
9	Speeds Comparisons Matmul	Naive vs Tiled vs Thread Tiling vs WMMA/Tensor Core
10	Advance Profiling	Importance of CUDA Profiling, Using Nsight systems
11	Cuda Basic Softmax	Understanding Softmax Algorithm and implementing in Cuda
12	Better Softmax	Optimizing Softmax Algorithm and Benchmarking it
13	SoftMax FP16 Acceleration	Higher Speedup achieved when used FP16 tensor cores optimization
14	Tensor MatMul	Naive vs Tensor core Matmul
15	CUDA Graphs	Reduced Overhead , Improved Performance, Simplified Code
16	SoftMax SuperFast	Implemented Cuda Algorithm that uses CuDNN + CudaStreams with FP16 Accelaration
17	cuBLAS VectorAdd	cuBLAS to perform Vector Addition and Benchmarking it
18	cuBLAS MatrixMultiplication	cuBLAS matmul with cuRAND for random num generation and benchmarking it
19	Sum Reduction	Performs a parallel reduction of the input array in blocks. Each thread adds elements in a range, and shared memory is used for efficient intra-block communication.
20	1D/2D Convolution	1D convolution is used primarily in signal processing. 2D convolution is used primarily in image processing
21	Triton	Working with Triton , used Tutorials from Triton Documentation to run VectorAdd , matmul and softmax kernel
22	Fused Softmax in Triton	Triton fused softmax implementation provides a highly efficient way to compute the softmax function on GPUs. By fusing multiple operations into a single kernel, it achieves better performance compared to traditional implementations.
23	LayerNorm and Flash Attention	Basic layerNorm and FlashAttention implementation in Cuda
24	Profiling Errors Solving	Solved Questions related to profiling. Created strategies, before and after examples with command line debugging tools, and optimization techniques for GPU performance tuning.
25	Blelloch Prefix Scan	Blelloch Prefix Scan using shared memory for efficiency.Solved More question related to design and GPU architecture.
26	FFT with Profiling	Fast Fourier Transform (FFT) Using Shared Memory + Profiling
27	Matmul_naive	Hit learning block so just repeated writing Naive Matmul on LEETGPU.com .
28	CuTLASS	Tried CUTLASS , added CudaEvents and modified basic code to support like previous days naive matmul . Also made profile report using ncu
29	Shared Matmul Competitive	Wrote shared Matmul for competitive coding, optimizing performance with tiling and CUDA streams.
30	Vectorized Tiled Matmul	vectorized tiled shared mem matmul. Improved my previous days Naive matmul GFLOPS from ~ 450 to 1500 on tensara website.
31	Faster Float2 Vectorization	float2 Vectorization for faster memory coalescing
32	FP16 Vector Addition	Optimized FP16 Vector Addition using half2 for better memory efficiency
33	Competitive Float2 Vector Addition	Optimized CUDA kernel for element-wise vector addition using float2 for memory coalescing and efficiency.
34	Cmake SGEMM	Implemented SGEMM on square matrices, tested on RTX 3060 (CMake) and Nvidia T4 (test website). Optimizations include shared memory tiling, float2 vectorized operations, and efficient memory access for better GPU performance.
35	ReLU	Simple ReLU (Rectified Linear Unit) activation function in CUDA
36	Leaky ReLU	Leaky ReLU (Leaky Rectified Linear Unit) activation function in CUDA
37	Alphatensor	Deployed Google Deepmind Alphatensor matmul locally in my 3060.
38	Basic PTX	Learned about running PTX code and its advantages in various metrics thoroughly also analyzed compiler-generated PTX to but struggled with installation will complete this tomorrow.
39	Inline PTX	More PTX testing , command lines , cubin , Had Cuda API errors : probably bad installation . Used Compiler explorer website explore more ptx stuf and compiling.Locally also tested inline PTX assembly to load integers from global memory, add a constant value, and store the results back in global memory.
40	More Inline PTX	Did multiple inline PTX cuda snippets/functionality separately like popc , Membar , rcp and shufl .Struggle with errors on both locally and compiler explorer with only shufl type so switched to cuda intrinsic __shuffle_sync()
41	MLIR - 1	Worked on integrating MLIR with CUDA and successfully executed matrix addition.Initially faced issues with the gpu.launch method throwing numerous errors that even GPT couldn't resolve. Dropped that approach and directly integrated with the CUDA runtime.
42	MLIR - 2	More MLIR stuff, The installation time and and Figuring out deprecated commands like from cpu-runner --> runner
43	INT8 Matmul	Wrote INT8 Matmul and compared it with FP32. The INT8 version was faster, but I messed up scaling and stream usage, causing high errors. Will try to reduce this MAE
44	cuSolver - 1	Also busy with college assignments pushed one code a day . Solved Linear System using cuSolver (LU Decomposition)
45	cuSolver - 2	QR Factorization
46	cuSolver - 3	Cholesky Decomposition and Eigenvalue & Eigenvector
47	cuSolver - 4	Singualr Value Decomposition
48	cuSPARSE - 1	Sparse Matmul with cuSPARSE[CSR]
49	cuSPARSE - 2	Compression with Grids of cuSolver [Dense]vs cuSPARSE [Sparse]
50	Q-learning - 1	Used Q-learning with sparse matrices (CSR format) to make it efficient.
51	Q-learning - 2	Started learning Tabular RL since I already tried previous day Q learning so today I Improved how the agent explores using Boltzmann exploration and Epsilon-Greedy.
52	Multi-Armed Bandits - 1	Moved from Q-learning to Multi-Armed Bandits to learn action selection strategies.
53	Markov Decision Process	Working With MDP sim where it learns from rewards from basic grid .
54	Q-Learning -3	Q-learning algorithm , achieved Unstable Q-values result , will improve.
55	Q-Learning -4	Q-values are more controlled Almost 50, making more stable agent.The extreme values (99.999, 88.999) from Day 54 are gone.
56	SARSA	Implemented SARSA today. It is much more stable and less aggressive than Q-learning, as it follows the current policy instead of always taking the greedy action.
57	Expected SARSA	Completed Expected SARSA with lower Q Values and even more stable than previous days SARSA.
58	Double Q-Learning	Double Q-Learning implementation , initially it didn't quite work with lower Q values so had to adjust learning rate, normalized rewards, and reduced epsilon decay to ensure better Q-value progression and stability.
59	Dynamic Programming	Today, I worked on Policy Iteration using the Bellman Expectation Equation, which is better suited for smaller action spaces. I also implemented Value Iteration using the Bellman Optimality Equation, which works well when dealing with a larger number of states and actions.
60	Monte Carlo & Temporal Difference Learning	I first implemented Monte Carlo policy evaluation, which estimates values based on complete episodes. Then, I explored Temporal Difference (TD) learning, starting with TD(0), which updates values step by step. Finally, I extended it to TD(λ), introducing eligibility traces to blend Monte Carlo and Dynamic Programming approaches.
61	DQN Test/Check	I started working on DQN today—just the basics. Faced some big installation issues with the supported PyTorch version but fixed them later by setting up a virtual environment.Finally,ran a test to check if DQN was working properly with my installed toolkit 12.4.
62	DQN - Frozen Lake	implemented DQN using LibTorch and CUDA for the Frozen Lake environment, focusing on optimizing Deep RL with GPU acceleration. Used Python for initial testing and leveraged LLMs for code assistance.
63	Benchamrking - DQN and PPO [Cartpole]	Benchmarked DQN (CartPole) and PPO (Atari EnvPool) from the Learn RL repo by PyTorch Labs.The PPO (Atari EnvPool) implementation used torch.compile, which had CUDA Graphs enabled by default to reduce CPU overhead.Encountered the warning: "Not enough SMs to use max_autotune_gemm mode." on PPO file but anyways got he models executed successfully, with performance and training metrics logged on Weights & Biases (wandb.ai).
64	DQN - Atari Model	Worked on forward propagation in a one-layer network using CUDA. Also read this DQN paper:https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf out of curiosity, built a simplified DQN model with just 100 training frames using Claude.
65	DQN - Cartpole	Implemented a native PyTorch DQN for the CartPole environment , it was from this tutorial from pytorch: https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html. Optimized it using CUDA features like Mixed Precision Training and Gradient Scaling for better performance.Tested it on Google Collab T4 GPU.
66	Revision - DEEP RL	Revision of All the Rl algorithm from scratch and read chapter 1 from This book : https://www.google.co.in/books/edition/Deep_Reinforcement_Learning_Hands_On/xKdhDwAAQBAJ?hl=en&gbpv=1&printsec=frontcover
67	Simple RL	Wrote from scratch Full basic simple RL in google collab
68	PPO	Full PPO gridEnv . Wrote cheatsheet to better memorize it.
69	RLHF	Reading Chapter 2 from Deep RL hands on. Also reading article tut on RLHF and executing its kernels on collab. Mostly reading and summarizing . Used this resource : https://arena-chapter2-rl.streamlit.app/[2.4]_RLHF.
70	PPO - Benchmarked	Simplified and upgraded my Day 68 PPO implementation by ensuring all tensors run on the CUDA device. Added separate training functions and buffers for CPU and CUDA to prevent device mismatch errors. Performance benchmarking and reward plots highlight CUDA's smoother, more stable learning curve. While training times were similar in this small task, the CUDA version scales significantly better for this Grid RL environment.
71	PPO - SB3 - Cartpole - Baseline	I worked on implementing a basic PPO agent using Stable-Baselines3 on the CartPole-v1 environment. I set up TensorBoard to visualize training progress and spent time playing around with various hyperparameters to get a better feel for how they affect learning. This was mainly a hands-on session to get comfortable with SB3 and see the training dynamics in action.
72	PPO - SB3 - Cartpole - Parrallel	Today, I extended the setup by adding a parallelized PPO implementation using DummyVecEnv to run 4 environments in parallel. I also wrote a script to plot and compare the training performance of the baseline vs parallel versions. The parallel setup gave a slight FPS boost and showed higher rewards over 20k timesteps. Overall, a good improvement in sample efficiency and speed!

Name		Name	Last commit message	Last commit date
Latest commit History 324 Commits
.vscode		.vscode
Day 01		Day 01
Day 02		Day 02
Day 03		Day 03
Day 04		Day 04
Day 05		Day 05
Day 06		Day 06
Day 07		Day 07
Day 08		Day 08
Day 09		Day 09
Day 10		Day 10
Day 100		Day 100
Day 11		Day 11
Day 12		Day 12
Day 13		Day 13
Day 14		Day 14
Day 15		Day 15
Day 16		Day 16
Day 17		Day 17
Day 18		Day 18
Day 19		Day 19
Day 20		Day 20
Day 21		Day 21
Day 22		Day 22
Day 23		Day 23
Day 24		Day 24
Day 25		Day 25
Day 26		Day 26
Day 27		Day 27
Day 28		Day 28
Day 29		Day 29
Day 30		Day 30
Day 31		Day 31
Day 32		Day 32
Day 33		Day 33
Day 34		Day 34
Day 35		Day 35
Day 36		Day 36
Day 37		Day 37
Day 38		Day 38
Day 39		Day 39
Day 40		Day 40
Day 41		Day 41
Day 42		Day 42
Day 43		Day 43
Day 44		Day 44
Day 45		Day 45
Day 46		Day 46
Day 47		Day 47
Day 48		Day 48
Day 49		Day 49
Day 50		Day 50
Day 51		Day 51
Day 52		Day 52
Day 53		Day 53
Day 54		Day 54
Day 55		Day 55
Day 56		Day 56
Day 57		Day 57
Day 58		Day 58
Day 59		Day 59
Day 60		Day 60
Day 61		Day 61
Day 62		Day 62
Day 63		Day 63
Day 64		Day 64
Day 65		Day 65
Day 66		Day 66
Day 67		Day 67
Day 68		Day 68
Day 69		Day 69
Day 70		Day 70
Day 71		Day 71
Day 72		Day 72
Day 73		Day 73
Day 74		Day 74
Day 75		Day 75
Day 76		Day 76
Day 77		Day 77
Day 78		Day 78
Day 79		Day 79
Day 80		Day 80
Day 81		Day 81
Day 82		Day 82
Day 83		Day 83
Day 84		Day 84
Day 85		Day 85
Day 86		Day 86
Day 87		Day 87
Day 88		Day 88
Day 89		Day 89
Day 90		Day 90
Day 91		Day 91
Day 92		Day 92
Day 93		Day 93
Day 94		Day 94
Day 95		Day 95
Day 96		Day 96
Day 97		Day 97
Day 98		Day 98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA kernel reference built over 100 days of deep study. Organized by optimization technique — not by day. Covers memory coalescing, warp divergence, tensor cores, async copy, and attention kernels.

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA kernel reference built over 100 days of deep study. Organized by optimization technique — not by day. Covers memory coalescing, warp divergence, tensor cores, async copy, and attention kernels.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages