This project demonstrates the conversion of RGB images to Grayscale using GPU acceleration via OpenCL. The primary goals include getting familiar with GPU architecture and execution models, and starting with GPU-based parallel processing.
Environment & Technologies:
- GPU: Intel UHD Graphics 620
- API: OpenCL 3.0
- Language: C/C++
-
24 Execution Units (EUs):
- 3 subslices, each containing 8 EUs.
- Shared components include:
- L3 Cache: Shared data cache across all EUs, used to reduce memory latency and improve performance for large datasets.
- L2 Cache: Each slice (8 EUs) has its own dedicated L2 cache, which helps in reducing access times to frequently used data.
- L1 Cache: Smaller, faster cache located closer to each individual EU, used for immediate access to data. Helps in improving the speed of thread execution.
- Thread Dispatcher: Manages and schedules threads.
- Sampler Units: Handles texture fetch and sampling.
- Shared Resources: Registers, counters, memory units used across threads.
-
Each EU supports up to 7 independent hardware threads.
👉 Theoretical maximum concurrent threads:
24 EUs × 7 threads = 168 threads -
Each thread can execute instructions using SIMD8/SIMD16/SIMD32 (Single Instruction, Multiple Data – up to 8/16/32 data elements per instruction/thread), depending on the compiler's decision based on the complexity of the kernel.
👉 The data processing lanes: ~ 168 threads × 16 = 5376 parallel data lanes.
-
No dedicated DRAM — shares system RAM with CPU (Unified Memory).
(No 8, The Compute Architecture of Intel Processor Graphics Gen9)
- Platform: A vendor (e.g., Intel / AMD / NVIDIA).
- Device: Specific GPU/CPU device.
- Context: A shared workspace between CPU and GPU.
- Buffers: Used to store input and output image data.
- Program: Contains the compiled OpenCL kernel code.
- Kernel: The GPU function executed in parallel.
- Command Queue: Handles task execution order and sync.
- Global Memory: Accessible by all workgroups.
- Local Memory: Shared within a single workgroup.
- Private Memory: Specific to each work-item.
- Work-item: The smallest unit of execution (similar to a thread).
- Work-group: A group of work-items that run concurrently.
- The total number of work-items typically equals the number of pixels (one per pixel).
- GPUs excel at massively parallel workloads—perfect for image processing where each pixel can be processed independently.
- CPUs are optimized for logic-heavy, control-intensive tasks, while GPUs shine in data-parallel computation.
- Intel UHD 620: 24 EUs × 7 threads = 168 threads, assuming each uses SIMD16 = 5376 data lanes.
- Work group size = 64 is a practical choice:
- Matches typical wavefront/warp sizes (power of 2).
- Easy to split image blocks into 8×8 or 16×4 chunks.
- Load image and extract RGB pixel data.
- Initialize OpenCL context and select device.
- Compile and build kernel to convert RGB → Grayscale.
- Enqueue kernel execution.
- Read output buffer and save the resulting grayscale image.
gray = 0.299 * R + 0.587 * G + 0.114 * B;- GPU (Intel UHD Graphics 620):
- Contains 24 Execution Units (EUs) × 7 threads per EU, each running SIMD16 = 5376 parallel lanes.
- Can process a 1920×1080 image (~2.07 million pixels) in approximately 15–25 ms, depending on kernel efficiency.
- ⇒ Achieves ~40 to 60 images per second with optimized pipeline (e.g., asynchronous memory transfer and compute overlap).
- CPU (Intel Core i5-8250U @ 1.60–1.80GHz):
- With 4 cores / 8 threads, sequential grayscale conversion takes around 150–300 ms per image.
- ⇒ Processes ~3 to 6 images per second, 10× slower than GPU.
⇒ OpenCL acceleration offers 10–15× speedup over CPU-based processing for grayscale conversion.
| Category | Intel UHD Graphics 620 | NVIDIA RTX 3090 |
|---|---|---|
| Architecture | Gen9 (Integrated GPU) | Ampere (GA102, Dedicated GPU) |
| Execution Units | 24 EUs | 10,496 CUDA Cores |
| Thread Model | 7 threads/EU, SIMD8 | Warp (32 threads), SIMT |
| Optimal Workgroup Size | 64–128 | 128–1024 (multiple of 32, often 256 or 512) |
| Memory Architecture | Shared system RAM | 24 GB GDDR6X, L1/L2 Cache + Shared Memory |
| Scheduler Type | Static (CPU-side enqueue) | Dynamic warp-level scheduling |
| Peak Theoretical Parallelism | ~1344 SIMD lanes | 10,496 cores × 2 (FP32/INT32 dual-issue) = ~20,000+ threads in flight |
| Execution Unit Details | 24 EUs × 7 threads/EU, SIMD16 = ~5376 parallel lanes | 10,496 CUDA cores × 2 FP32 or INT32 ops = ~20,000 threads |
| Image Processing Example | ~40–66 images/sec (1920×1080) in 15–25 ms each (optimized OpenCL pipeline) | ~200–1000 images/sec (1920×1080) in 1–5 ms each (optimized CUDA pipeline) |
-
Set-up guide: https://github.com/KhronosGroup/OpenCL-Guide?tab=readme-ov-file#the-opencl-sdk
-
Command to run (on x64 Native Tools Command Prompt):
cl.exe /nologo /W4 /DCL_TARGET_OPENCL_VERSION=100 ^ /I"<PATH_TO_OPENCL_INCLUDE>" <YOUR_SOURCE_FILE.cpp> ^ /Fe:<OUTPUT_EXE_NAME>.exe /link /LIBPATH:"<PATH_TO_OPENCL_LIB>" OpenCL.lib
Example for my device: cl.exe /nologo /W4 /DCL_TARGET_OPENCL_VERSION=100 ^ /ID:\\vcpkg\\vcpkg\\packages\\opencl_x64-windows\\include main.cpp ^ /Fe:grayscale.exe /link /LIBPATH:D:\\vcpkg\\vcpkg\\packages\\opencl_x64-windows\\lib OpenCL.lib
grayscale.exe
-
Live Demo: https://youtu.be/vEB8gr6dpK4.
- 📄 Intel Gen9 GPU Architecture (PDF)
- 📘 ENCCS GPU Programming Tutorial
- 📚 OpenCL Guide – KhronosGroup
- 📦 stb_image – Public domain image libs
- 📜 GrayScale - Wikipedia
© 2025 Duc Dao Licensed under the MIT LICENSE.
🤟 Take a star if you find something interesting 🤟
