A comprehensive performance analysis project demonstrating the critical impact of GPU memory access patterns on deep learning inference performance. This project implements custom CUDA kernels for Conv2D operations in VGG16 and compares three execution paths:
- CPU Baseline - Standard PyTorch/TorchScript implementation
- GPU Optimized - Custom memory-coalesced CUDA kernel (3-5× faster)
- GPU Uncoalesced - Intentionally inefficient baseline kernel
Key Finding: Proper memory coalescing yields 3-5× performance improvement over naive GPU implementations, demonstrating that memory access patterns matter more than raw compute power for Conv2D operations.
- ✅ Custom CUDA Conv2D Kernels - Written from scratch, not using PyTorch's GPU operations
- ✅ Memory Coalescing Optimization - 2D tiling with sequential access patterns
- ✅ Constant Memory Usage - Automatic weight caching for small filters (<16KB)
- ✅ Strategic Buffer Reuse - Pre-allocated GPU memory across all 13 Conv2D layers
- ✅ Hybrid Execution Model - Custom kernels for Conv2D, TorchScript for other layers
- ✅ Comprehensive Validation - Layer-wise L2 norm error calculation
(Refer to the report attached for details)
- End-to-end image classification with detailed performance metrics
- Layer-wise timing for all 13 Conv2D operations
- Accuracy validation through L2 norm error analysis
- Production-ready patterns (async streams, buffer reuse, constant memory)
- Educational value for understanding GPU memory hierarchy
./HPA_Project
├── build
├── CMakeLists.txt
├── images
│ ├── cat_1.png
│ ├── cat.png
│ ├── dog_1.png
│ ├── dog.png
│ ├── elephant.png
│ ├── labels.txt
│ ├── pegion.png
│ ├── rose.png
│ └── tiger.png
├── include
│ ├── gpu_common.cuh
│ ├── gpu.h
│ └── utils.h
├── models
│ ├── dummy_scripted.pt
│ └── vgg16_scripted.pt
├── Readme.md
├── scripts
│ ├── build.sh
│ └── export_vgg16.py
└── src
├── gpu.cpp
├── gpu.cu
├── main.cpp
└── utils.cpp
| Requirement | Version | Installation |
|---|---|---|
| CUDA Toolkit | ≥ 11.0 | Download |
| LibTorch | ≥ 1.13 | Download |
| OpenCV | ≥ 4.5 | sudo apt install libopencv-dev |
| CMake | ≥ 3.18 | sudo apt install cmake |
| GCC/G++ | ≥ 9.0 | sudo apt install g++ |
| Python | 3.8+ | For model export script |
GPU Requirements: NVIDIA GPU with Compute Capability ≥ 7.5 (Turing or newer)
-
Clone the repository
git clone https://github.com/Jkdxbns/VGG16-Memory-Optimization.git cd VGG16-Memory-Optimization -
Install Python dependencies (for model export)
pip install -r requirements.txt
-
Download and extract LibTorch
# Download from https://pytorch.org/get-started/locally/ # Select: C++/CUDA 11.x/LibTorch wget https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.1.0%2Bcu118.zip unzip libtorch-*.zip -d ~/
-
Update CMakeLists.txt
Edit
CMakeLists.txtand set the correct LibTorch path:set(Torch_DIR "/path/to/your/libtorch/share/cmake/Torch")
Update CUDA architecture for your GPU:
# For RTX 40xx (Ada): set(TORCH_CUDA_ARCH_LIST "8.9") # For RTX 30xx (Ampere): set(TORCH_CUDA_ARCH_LIST "8.6") # For RTX 20xx (Turing): set(TORCH_CUDA_ARCH_LIST "7.5")
-
Export the VGG16 model
cd scripts python export_vgg16.py # This creates models/vgg16_scripted.pt
-
Build the project
./build.sh # Or manually: # cd ../build # cmake .. # make -j$(nproc)
cd build
./conv_test ../images/cat.pngExpected Output:
(Refer to report attached)
Try other images:
./conv_test ../images/dog.png
./conv_test ../images/elephant.pngUnlike typical deep learning projects that use PyTorch's built-in GPU operations (.cuda()), this project:
-
Implements Conv2D from scratch in CUDA C++
- No high-level framework abstractions
- Direct control over memory access patterns
- Manual thread block and grid configuration
-
Demonstrates Memory Coalescing
// GOOD (Coalesced): Consecutive threads read consecutive memory Thread 0: reads address[0], Thread 1: reads address[1], ... → Combined into ONE 128-byte memory transaction // BAD (Uncoalesced): Threads read scattered memory Thread 0: reads address[0], Thread 1: reads address[1000], ... → Requires 32 separate memory transactions (32× slower!)
-
Strategic Optimizations
- Constant Memory: Small weights (<16KB) cached for broadcast reads
- Buffer Reuse: Pre-allocated GPU memory across all layers
- 2D Tiling: Thread blocks mapped to spatial dimensions
- Loop Unrolling: Manual unrolling for 3×3 filters
┌─────────────────────────────────────────────────────────┐
│ main.cpp │
│ ┌─────────────────────────────────────────────────┐ │
│ │ For each Conv2D layer: │ │
│ │ 1. Extract weights from TorchScript model │ │
│ │ 2. Launch three parallel executions: │ │
│ │ │ │
│ │ CPU Path → TorchScript forward │ │
│ │ │ |
│ │ GPU Opt → run_conv_gpu() │ │
│ │ ↓ gpu.cpp │ │
│ │ ↓ launch_conv2d_naive() │ │
│ │ ↓ gpu.cu │ │
│ │ ↓ conv2d_naive_kernel<<<>>> | │
│ │ | │
│ │ GPU NC → run_convnc_gpu() │ │
│ │ ↓ gpu.cpp │ │
│ │ ↓ launch_conv2d_uncoalesced() │ │
│ │ ↓ gpu.cu │ │
│ │ ↓ conv2d_uncoalesced_kernel │ │
│ │ │ │
│ │ 3. Measure execution time for each │ │
│ │ 4. Calculate L2 error for validation │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Memory-Coalesced Kernel (Optimized):
__global__ void conv2d_naive_kernel(...) {
// 2D thread mapping
int w_out = blockIdx.x * blockDim.x + threadIdx.x;
int h_out = blockIdx.y * blockDim.y + threadIdx.y;
// Sequential memory access (coalesced!)
for (int c = 0; c < C; ++c) {
for (int r = 0; r < 3; ++r) {
for (int s = 0; s < 3; ++s) {
int h_in = h_out + r - 1;
int w_in = w_out + s - 1;
// Access pattern: [...][h][w] ← consecutive threads, consecutive w
acc += input[...][h_in][w_in] * weight[...];
}
}
}
}Grid Configuration:
dim3 blockDim(16, 16); // 256 threads per block
dim3 gridDim(
(W + 15) / 16, // Width tiles
(H + 15) / 16, // Height tiles
N * K // Batch × Output channels
);(Refer to the report attached for results)
-
GPU Memory Hierarchy
- Global memory vs constant memory vs shared memory
- Memory transaction coalescing requirements
- Bandwidth vs latency trade-offs
-
CUDA Programming
- Kernel launch configuration (grid/block dimensions)
- Thread indexing and data partitioning
- Synchronization and stream management
- Performance profiling and optimization
-
Deep Learning Systems
- How Conv2D actually executes on GPU hardware
- Why PyTorch/TensorFlow are fast (they use these tricks!)
- Production optimization strategies
-
Performance Engineering
- Identifying bottlenecks (memory vs compute)
- Measuring and validating optimizations
- Trade-offs in system design
-
Profile with Nsight Compute:
ncu --set full -o profile ./conv_test ../images/cat.png
-
Try different block sizes:
// In gpu.cu, modify: #define BLOCK_DIM_X (32) // Try 8, 16, 32 #define BLOCK_DIM_Y (32)
-
Add shared memory tiling:
- Load input tiles into shared memory
- Reuse across multiple output pixels
- Expected: 10-20% additional speedup
-
Compare with cuDNN:
- Replace custom kernel with
cudnnConvolutionForward - Benchmark against production-grade implementation
- Replace custom kernel with
1. No such file or directory: libtorch
# Solution: Update Torch_DIR in CMakeLists.txt
set(Torch_DIR "/correct/path/to/libtorch/share/cmake/Torch")2. undefined reference to cv::imread
# Solution: Install OpenCV development headers
sudo apt install libopencv-dev3. nvcc fatal: Unsupported gpu architecture 'compute_89'
# Solution: Change TORCH_CUDA_ARCH_LIST to match your GPU
# Check your GPU architecture:
nvidia-smi --query-gpu=compute_cap --format=csv4. CUDA out of memory
# Solution: Use dummy model for testing
# In scripts/export_vgg16.py, uncomment:
# model = Dummy().eval()5. Slow performance / no speedup
# Check GPU is being used:
nvidia-smi # Should show conv_test process
# Profile to identify bottlenecks:
nvprof ./conv_test ../images/cat.pngThis project is intended for educational purposes as part of CMPE 755 - High Performance Architecture at Rochester Institute of Technology (RIT).
- NVIDIA for CUDA toolkit and documentation
- PyTorch team for LibTorch C++ API
- VGG16 authors for the model architecture
- RIT faculty for course guidance
⭐ If you found this project helpful, please consider starring the repository!