VGG16 Conv2D CUDA Performance Benchmark

A comprehensive performance analysis project demonstrating the critical impact of GPU memory access patterns on deep learning inference performance. This project implements custom CUDA kernels for Conv2D operations in VGG16 and compares three execution paths:

CPU Baseline - Standard PyTorch/TorchScript implementation
GPU Optimized - Custom memory-coalesced CUDA kernel (3-5× faster)
GPU Uncoalesced - Intentionally inefficient baseline kernel

Key Finding: Proper memory coalescing yields 3-5× performance improvement over naive GPU implementations, demonstrating that memory access patterns matter more than raw compute power for Conv2D operations.

Project Highlights

Unique Implementations

✅ Custom CUDA Conv2D Kernels - Written from scratch, not using PyTorch's GPU operations
✅ Memory Coalescing Optimization - 2D tiling with sequential access patterns
✅ Constant Memory Usage - Automatic weight caching for small filters (<16KB)
✅ Strategic Buffer Reuse - Pre-allocated GPU memory across all 13 Conv2D layers
✅ Hybrid Execution Model - Custom kernels for Conv2D, TorchScript for other layers
✅ Comprehensive Validation - Layer-wise L2 norm error calculation

Performance Results (Typical)

(Refer to the report attached for details)

Features

End-to-end image classification with detailed performance metrics
Layer-wise timing for all 13 Conv2D operations
Accuracy validation through L2 norm error analysis
Production-ready patterns (async streams, buffer reuse, constant memory)
Educational value for understanding GPU memory hierarchy

Folder Structure

./HPA_Project
├── build
├── CMakeLists.txt
├── images
│   ├── cat_1.png
│   ├── cat.png
│   ├── dog_1.png
│   ├── dog.png
│   ├── elephant.png
│   ├── labels.txt
│   ├── pegion.png
│   ├── rose.png
│   └── tiger.png
├── include
│   ├── gpu_common.cuh
│   ├── gpu.h
│   └── utils.h
├── models
│   ├── dummy_scripted.pt
│   └── vgg16_scripted.pt
├── Readme.md
├── scripts
│   ├── build.sh
│   └── export_vgg16.py
└── src
    ├── gpu.cpp
    ├── gpu.cu
    ├── main.cpp
    └── utils.cpp

Quick Start

Prerequisites

Requirement	Version	Installation
CUDA Toolkit	≥ 11.0	Download
LibTorch	≥ 1.13	Download
OpenCV	≥ 4.5	`sudo apt install libopencv-dev`
CMake	≥ 3.18	`sudo apt install cmake`
GCC/G++	≥ 9.0	`sudo apt install g++`
Python	3.8+	For model export script

GPU Requirements: NVIDIA GPU with Compute Capability ≥ 7.5 (Turing or newer)

Installation

Clone the repository

git clone https://github.com/Jkdxbns/VGG16-Memory-Optimization.git
cd VGG16-Memory-Optimization

Install Python dependencies (for model export)
```
pip install -r requirements.txt
```

Download and extract LibTorch

# Download from https://pytorch.org/get-started/locally/
# Select: C++/CUDA 11.x/LibTorch
wget https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.1.0%2Bcu118.zip
unzip libtorch-*.zip -d ~/

Update CMakeLists.txt

Edit CMakeLists.txt and set the correct LibTorch path:

set(Torch_DIR "/path/to/your/libtorch/share/cmake/Torch")

Update CUDA architecture for your GPU:

# For RTX 40xx (Ada):  set(TORCH_CUDA_ARCH_LIST "8.9")
# For RTX 30xx (Ampere): set(TORCH_CUDA_ARCH_LIST "8.6")
# For RTX 20xx (Turing): set(TORCH_CUDA_ARCH_LIST "7.5")

Export the VGG16 model

cd scripts
python export_vgg16.py
# This creates models/vgg16_scripted.pt

Build the project

./build.sh
# Or manually:
# cd ../build
# cmake ..
# make -j$(nproc)

Running Inference

cd build
./conv_test ../images/cat.png

Expected Output: (Refer to report attached)

Try other images:

./conv_test ../images/dog.png
./conv_test ../images/elephant.png

Technical Deep Dive

What Makes This Project Unique?

Unlike typical deep learning projects that use PyTorch's built-in GPU operations (.cuda()), this project:

Implements Conv2D from scratch in CUDA C++
- No high-level framework abstractions
- Direct control over memory access patterns
- Manual thread block and grid configuration

Demonstrates Memory Coalescing

// GOOD (Coalesced): Consecutive threads read consecutive memory
Thread 0: reads address[0], Thread 1: reads address[1], ...
→ Combined into ONE 128-byte memory transaction

// BAD (Uncoalesced): Threads read scattered memory
Thread 0: reads address[0], Thread 1: reads address[1000], ...
→ Requires 32 separate memory transactions (32× slower!)

Strategic Optimizations
- Constant Memory: Small weights (<16KB) cached for broadcast reads
- Buffer Reuse: Pre-allocated GPU memory across all layers
- 2D Tiling: Thread blocks mapped to spatial dimensions
- Loop Unrolling: Manual unrolling for 3×3 filters

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                     main.cpp                            │
│  ┌─────────────────────────────────────────────────┐    │
│  │ For each Conv2D layer:                          │    │
│  │  1. Extract weights from TorchScript model      │    │
│  │  2. Launch three parallel executions:           │    │
│  │                                                 │    │
│  │     CPU Path → TorchScript forward             │    │
│  │                                                 │    |
│  │     GPU Opt  → run_conv_gpu()                  │    │
│  │                  ↓ gpu.cpp                     │    │
│  │                  ↓ launch_conv2d_naive()       │    │
│  │                  ↓ gpu.cu                      │    │
│  │                  ↓ conv2d_naive_kernel<<<>>>   |    │
│  │                                                 |    │
│  │     GPU NC   → run_convnc_gpu()                │    │
│  │                  ↓ gpu.cpp                     │    │
│  │                  ↓ launch_conv2d_uncoalesced() │    │
│  │                  ↓ gpu.cu                      │    │
│  │                  ↓ conv2d_uncoalesced_kernel   │    │
│  │                                                 │    │
│  │  3. Measure execution time for each             │    │
│  │  4. Calculate L2 error for validation           │    │
│  └─────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────┘

Key Algorithms

Memory-Coalesced Kernel (Optimized):

__global__ void conv2d_naive_kernel(...) {
    // 2D thread mapping
    int w_out = blockIdx.x * blockDim.x + threadIdx.x;
    int h_out = blockIdx.y * blockDim.y + threadIdx.y;
    
    // Sequential memory access (coalesced!)
    for (int c = 0; c < C; ++c) {
        for (int r = 0; r < 3; ++r) {
            for (int s = 0; s < 3; ++s) {
                int h_in = h_out + r - 1;
                int w_in = w_out + s - 1;
                // Access pattern: [...][h][w] ← consecutive threads, consecutive w
                acc += input[...][h_in][w_in] * weight[...];
            }
        }
    }
}

Grid Configuration:

dim3 blockDim(16, 16);  // 256 threads per block
dim3 gridDim(
    (W + 15) / 16,      // Width tiles
    (H + 15) / 16,      // Height tiles
    N * K               // Batch × Output channels
);

Performance Analysis

(Refer to the report attached for results)

Educational Value

What You'll Learn

GPU Memory Hierarchy
- Global memory vs constant memory vs shared memory
- Memory transaction coalescing requirements
- Bandwidth vs latency trade-offs
CUDA Programming
- Kernel launch configuration (grid/block dimensions)
- Thread indexing and data partitioning
- Synchronization and stream management
- Performance profiling and optimization
Deep Learning Systems
- How Conv2D actually executes on GPU hardware
- Why PyTorch/TensorFlow are fast (they use these tricks!)
- Production optimization strategies
Performance Engineering
- Identifying bottlenecks (memory vs compute)
- Measuring and validating optimizations
- Trade-offs in system design

Recommended Experiments

Profile with Nsight Compute:

ncu --set full -o profile ./conv_test ../images/cat.png

Try different block sizes:

// In gpu.cu, modify:
#define BLOCK_DIM_X (32)  // Try 8, 16, 32
#define BLOCK_DIM_Y (32)

Add shared memory tiling:
- Load input tiles into shared memory
- Reuse across multiple output pixels
- Expected: 10-20% additional speedup
Compare with cuDNN:
- Replace custom kernel with cudnnConvolutionForward
- Benchmark against production-grade implementation

Troubleshooting

Common Issues

1. No such file or directory: libtorch

# Solution: Update Torch_DIR in CMakeLists.txt
set(Torch_DIR "/correct/path/to/libtorch/share/cmake/Torch")

2. undefined reference to cv::imread

# Solution: Install OpenCV development headers
sudo apt install libopencv-dev

3. nvcc fatal: Unsupported gpu architecture 'compute_89'

# Solution: Change TORCH_CUDA_ARCH_LIST to match your GPU
# Check your GPU architecture:
nvidia-smi --query-gpu=compute_cap --format=csv

4. CUDA out of memory

# Solution: Use dummy model for testing
# In scripts/export_vgg16.py, uncomment:
# model = Dummy().eval()

5. Slow performance / no speedup

# Check GPU is being used:
nvidia-smi  # Should show conv_test process

# Profile to identify bottlenecks:
nvprof ./conv_test ../images/cat.png

License

This project is intended for educational purposes as part of CMPE 755 - High Performance Architecture at Rochester Institute of Technology (RIT).

Acknowledgments

NVIDIA for CUDA toolkit and documentation
PyTorch team for LibTorch C++ API
VGG16 authors for the model architecture
RIT faculty for course guidance

⭐ If you found this project helpful, please consider starring the repository!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VGG16 Conv2D CUDA Performance Benchmark

Project Highlights

Unique Implementations

Performance Results (Typical)

Features

Folder Structure

Quick Start

Prerequisites

Installation

Running Inference

Technical Deep Dive

What Makes This Project Unique?

Architecture Overview

Key Algorithms

Performance Analysis

Educational Value

What You'll Learn

Recommended Experiments

Troubleshooting

Common Issues

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
include		include
scripts		scripts
src		src
CMakeLists.txt		CMakeLists.txt
PROJECT_REPORT.txt		PROJECT_REPORT.txt
Readme.md		Readme.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

VGG16 Conv2D CUDA Performance Benchmark

Project Highlights

Unique Implementations

Performance Results (Typical)

Features

Folder Structure

Quick Start

Prerequisites

Installation

Running Inference

Technical Deep Dive

What Makes This Project Unique?

Architecture Overview

Key Algorithms

Performance Analysis

Educational Value

What You'll Learn

Recommended Experiments

Troubleshooting

Common Issues

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages