Basic performance tests comparing optimized to naive kernel implementations. All tests were performed with CUDA Toolkit 12.6 and the reported timings were obtained from runs on the Perlmutter HPC system.
Functioning device & host (nvcc & g++) compilers are all that is needed to run the examples.
The following command may be utilized to clone the repository
git clone https://github.com/AMLattanzi/cuda_perf_tests.git
The code tree is given below where each subdirectory inside the src directory contains a particular test whose timings are documented in README and a shell file make.sh for compilation.
cuda_perf_tests/
└── src
├── kernel_concur
├── lambda_kernel
├── matrix_add
├── matrix_mult
└── mpi_host_device
To compile and run a given example inside the src directory, one may execute the following commands:
cd src/<case_name>
source make.sh
./<case_name>.exe