In Lab 2, you implemented GEMM kernels using shared memory tiling and loop unrolling with FP32 arithmetic. In this lab, you will go a step further by leveraging Tensor Cores via NVIDIA's WMMA (Warp Matrix Multiply-Accumulate) API and asynchronous memory copies via cuda::pipeline to build high-performance GEMM kernels that operate on FP16 inputs with FP32 accumulation.
Modern GPU architectures (Volta and newer) include dedicated Tensor Core hardware that can perform 16x16x16 matrix multiply-accumulate operations in a single instruction across an entire warp. Combined with asynchronous memory copy (cuda::memcpy_async), which allows data movement from global to shared memory without occupying CUDA cores, these features represent the building blocks of production-grade GEMM implementations such as those in cuBLAS and CUTLASS.
In traditional CUDA kernels, loading data from global memory into shared memory is a two-step process: threads first load data into registers (LDG), then store from registers into shared memory (STS). This occupies CUDA cores for the entire transfer and stalls warps while waiting on memory latency.
Starting with Ampere (sm_80), NVIDIA introduced the cp.async instruction (also called LDGSTS — Load Global, Store Shared). This is a single hardware instruction that copies data directly from global memory to shared memory, bypassing registers entirely. The key benefits are:
- Non-blocking: The copy is issued asynchronously — CUDA cores are free to execute other instructions (e.g., Tensor Core math) while the data transfer happens in the background.
- Bypasses L1 cache: Data goes directly to shared memory, avoiding L1 pollution and reducing cache pressure.
- Enables pipelining: Combined with
cuda::pipeline, you can overlap the loading of the next tile with computation on the current tile (double buffering), hiding memory latency behind useful work.
On Volta (sm_70), cuda::memcpy_async compiles to the traditional two-step LDG+STS path as a software fallback — it is functionally correct but does not provide the performance benefits of hardware cp.async.
Compute C = A * B where:
- A is an M x K matrix (row-major, FP16)
- B is a K x N matrix (row-major, FP16)
- C is an M x N matrix (row-major, FP32)
- M = K = N = 1024
All inputs are stored in half precision (__half). Accumulation is performed in single precision (float).
The starter code in lab3.cu contains three GEMM kernels:
| Kernel | Description | Status |
|---|---|---|
gemm_tiled_smem |
Shared-memory tiled GEMM (FP16 input, FP32 accumulate). Each thread computes one output element using tile-by-tile loading into shared memory. | Given |
gemm_wmma_smem |
Tensor Core GEMM using WMMA API with shared memory staging. Regular loads from global to shared memory, then WMMA fragments load from shared memory. | You implement |
gemm_wmma_async |
Tensor Core GEMM using WMMA API with cuda::memcpy_async and cuda::pipeline for asynchronous global-to-shared memory transfers. |
You implement |
The given gemm_tiled_smem kernel serves as the correctness reference. Your implementations will be verified against it using a relative error tolerance of 1e-2.
Implement the gemm_wmma_smem kernel. This kernel should:
- Use regular loads (
shared[i] = global[i]) to copy tiles of A and B from global memory into shared memory - Use
__syncthreads()to ensure all threads have finished loading before proceeding - Use WMMA fragments (
wmma::fragment) to load data from shared memory into register-level fragments - Use
wmma::mma_syncto perform 16x16x16 matrix multiply-accumulate on Tensor Cores - Store results back to global memory using
wmma::store_matrix_sync
Launch configuration: 128 threads per block (4 warps), each warp handles one 16x16 output tile. The block covers 16 rows x 64 columns.
Shared memory layout: Single buffer with space for one A tile (WMMA_M x WMMA_K) and one B tile (WMMA_K x (WMMA_N * 4)) that covers all 4 warps' columns.
Implement the gemm_wmma_async kernel. This kernel builds on Part A by replacing regular loads with asynchronous memory copies and using double buffering to overlap data loading with Tensor Core computation:
- Allocate two buffers in shared memory for both A and B tiles (As[0], As[1], Bs[0], Bs[1])
- Prefetch the first tile into buffer 0 before entering the main loop using
cuda::memcpy_async - Use
cuda::pipelineto manage the async copy operations (producer_acquire,producer_commit,consumer_wait,consumer_release) - In the main loop: while computing on the current buffer with WMMA, asynchronously load the next tile into the alternate buffer
- Use
wmma::mma_syncto perform Tensor Core computation - Store results back to global memory using
wmma::store_matrix_sync
Launch configuration: 128 threads per block (4 warps). Each warp handles a 32x16 output tile (2 WMMA ops per k-step). The block covers 32 rows x 64 columns.
Shared memory layout: Double-buffered: 2 * (ASYNC_TILE_M * WMMA_K + WMMA_K * WMMA_N * 4) * sizeof(half) bytes, where ASYNC_TILE_M = WMMA_M * 2 = 32.
ECE60827-CUDA3/
Makefile # Build configuration (do not modify)
main.cu # Test harness & reference kernel (do not modify)
lab3.cu # YOUR CODE GOES HERE — implement both kernels
README.md # This file
report.md # Your report (submit this)
Read these carefully before you begin:
-
WMMA API: The NVIDIA blog post Programming Tensor Cores in CUDA 9 provides an excellent walkthrough of the
nvcuda::wmmaAPI with a complete GEMM example. Read this post thoroughly — the GEMM kernel shown in the blog is very similar togemm_wmma_smem, except that the blog example loads directly from global memory without using shared memory. You can use the blog's example as a starting point and extend it with shared memory staging. -
Async Memory Copy: The NVIDIA documentation on Asynchronous Data Copies using cuda::pipeline explains how
cuda::memcpy_asyncandcuda::pipelinework to bypass L1 cache and copy data directly from global memory to shared memory without occupying CUDA cores. Read the pipeline usage examples to understandproducer_acquire,producer_commit,consumer_wait, andconsumer_release.
Important: You must implement your kernels using the nvcuda::wmma API. You may not use cuBLAS or any other library to perform the matrix multiplication.
#include <mma.h>
using namespace nvcuda;
// Declare fragments
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
wmma::fragment<wmma::accumulator, M, N, K, float> c_frag;
// Initialize accumulator to zero
wmma::fill_fragment(c_frag, 0.0f);
// Load from shared memory into fragments
wmma::load_matrix_sync(a_frag, shared_ptr, leading_dim);
wmma::load_matrix_sync(b_frag, shared_ptr, leading_dim);
// Tensor core multiply-accumulate: c_frag += a_frag * b_frag
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
// Store result to global memory
wmma::store_matrix_sync(global_ptr, c_frag, leading_dim, wmma::mem_row_major);#include <cuda/pipeline>
cuda::pipeline<cuda::thread_scope_thread> pipe = cuda::make_pipeline();
pipe.producer_acquire();
cuda::memcpy_async(dst_ptr, src_ptr, sizeof(half), pipe);
pipe.producer_commit();
pipe.consumer_wait();
// ... use the data ...
pipe.consumer_release();Build and run via Slurm:
module load gcc/11.4.1 cuda
make # builds locally
make test # runs both parts via SlurmIf Slurm is unavailable, run locally (e.g. on a GPU node via ssh):
module load gcc/11.4.1 cuda
make
make test-local # runs both parts locally
make test-a-local # runs Part A only
make test-b-local # runs Part B onlyNote: On Volta (sm_70) GPUs, cuda::memcpy_async lacks hardware LDGSTS support and falls back to software emulation, so Part B may run slower than Part A. This is expected — only correctness is graded, not performance.
When both kernels are correctly implemented, you should see output similar to:
Shared-mem GEMM (1024x1024)*(1024x1024): X.XXX ms
WMMA+smem GEMM (1024x1024)*(1024x1024): X.XXX ms
WMMA+async GEMM (1024x1024)*(1024x1024): X.XXX ms
Verification (WMMA+smem vs shared-mem golden):
Errors : 0 / 1048576
Max rel err: 0.XXXXXX
Verification (WMMA+async vs shared-mem golden):
Errors : 0 / 1048576
Max rel err: 0.XXXXXX
Both kernels must produce 0 errors against the shared-memory reference.
| Component | Points |
|---|---|
Part A: gemm_wmma_smem |
35 |
Part B: gemm_wmma_async |
35 |
Report (report.md) — see questions inside |
30 |
| Total | 100 |
Submit the following files:
lab3.cu- Your completed CUDA sourcereport.md- Your written report (seereport.mdfor the template)
- Programming Tensor Cores in CUDA 9 - Start here. Read the "Programmatic Access to Tensor Cores" section carefully. The WMMA GEMM example is your starting point for
gemm_wmma_smem. - NVIDIA WMMA Documentation
- Asynchronous Data Copies using cuda::pipeline - Read this for understanding
cuda::memcpy_asyncandcuda::pipeline. - NVIDIA CUTLASS - Production GEMM templates
All code must be your own work. The use of AI tools (e.g., ChatGPT, Copilot, Claude) to generate code is prohibited. You may use AI tools to help understand concepts, but all submitted code must be written by you.