Feature/cuda clusterfinder#306
Draft
kferjaoui wants to merge 19 commits into
Draft
Conversation
Implements a GPU version of the sequential ClusterFinder for single-frame cluster reconstrcution. Kernel (ClusterFinderCUDA.cuh): - Shared memory tiling with generalized halo loading for arbitrary cluster sizes (3x3, 5x5, ...) - Zero-initialization of shared memory to handle image boundary and partial edge-block cases. - Pedestal subtraction during shared memory loading. - Compile-time cluster geometry enabling full loop unrolling of the stencil reduction - Atomic global counter for lock-free cluster output across blocks. - RAII host wrapper; `ClusterFinderCUDA` struct.
- Non-photon pixels now update pedestal (push_fast equivalent) directly in the kernel, no atomics needed - Commented out quadrant significance test (c2): absent from sequential CPU code, was producing GPU-only clusters. - Added d_pd_sum to device allocations and host upload Build (sm_89): 46 registers, 0 spills, 100% occupancy. Verified on 256x256 Jungfrau data, 5000 frames, nSigma=5.0: CPU 8428 vs GPU 8471 clusters, 99.8% match 0.63 ms/frame CPU vs 0.04 ms/frame GPU (~16x)
- Wrap per-stream CUDA resources (device buffers, stream handle) in StreamContext struct; ClusterFinderCUDA owns a vector of n_streams contexts with independent pedestal arrays - Split ClusterFinderCUDA.cuh into clusterfinder_kernel.cuh (device kernel) and ClusterFinderCUDA.hpp (host RAII wrapper) - Add find_clusters_batched(): processes N frames round-robin across streams, returns per-frame cluster vectors. - Update ClusterFinderCUDA.test.cu - Update Makefile for new file layout.
- Add bind_ClusterFinderCUDA.hpp with pybind11 bindings for ClusterFinderCUDA - Build CUDA bindings as separate _aare_cuda.so to avoid segfaults from mixing nvcc and gcc compiled code in the same shared object - Re-export CUDA classes onto _aare in __init__.py so user code uses `from aare import ClusterFinderCUDA` regardless of which .so hosts the class - Factory in ClusterFinder.py selects backend; RuntimeError if GPU requested on CPU-only build - Update python/CMakeLists.txt: _aare_cuda module gated behind AARE_CUDA and AARE_PYTHON_BINDINGS - Add validation notebook: ~20x speedup vs sequential ClusterFinder
- After upgrading to pybind11 3, duplicate registration of cluster-related types across `_aare` and `_aare_cuda` started failing. - Mark the `Cluster` and `ClusterVector` bindings as `py::module_local()` so each extension owns its local registration. Note: cluster objects from CPU and CUDA bindings are now distinct Python types.
- Stencil arithmetic and shared memory use float (COMPUTE_TYPE alias). - Pedestal accumulation stays double to preserve variance accuracy. Notes: - On RTX 4090, FP32 throughput is ~64× higher than FP64, so moving stencil math to float improves performance. - Using float also avoids shared memory bank conflicts: stride-18 maps to distinct banks for 32-bit values, but caused conflicts with 64-bit.
- Use per-stream pinned host staging buffers for truly async CUDA transfers. - Avoid reserving full device capacity per result frame. - Reduce kernel work by delaying cluster payload construction. - Use squared comparisons and removing per-pixel sqrtf() ops.
lrlunin
reviewed
May 7, 2026
Contributor
There was a problem hiding this comment.
Should not be included in git, I guess
Contributor
Author
There was a problem hiding this comment.
Yes will be deleted (only for quick tests while developing)
lrlunin
reviewed
May 7, 2026
Co-authored-by: Leonid Lunin <lunin.leonid@gmail.com>
lrlunin
reviewed
May 13, 2026
Comment on lines
+43
to
+45
| - name: Install conda-build in base | ||
| run: conda install -n base conda-build -y | ||
|
|
Contributor
There was a problem hiding this comment.
Suggested change
| - name: Install conda-build in base | |
| run: conda install -n base conda-build -y |
fixed in #309
Comment on lines
+38
to
+40
| - name: Install conda-build in base | ||
| run: conda install -n base conda-build -y | ||
|
|
Contributor
There was a problem hiding this comment.
Suggested change
| - name: Install conda-build in base | |
| run: conda install -n base conda-build -y |
fixed in #309
Rework the multi-stream pipeline to eliminate per-frame sync barriers and fix the D2H staging architecture. Sync reduction: - Replace one cudaStreamSynchronize per frame with one per stream per batch, cutting synchronisation calls from O(n_frames x n_streams) to O(n_streams) - Introduce a unified per-frame D2H output layout [uint32_t count | clusters[max]] stored in a single class-level lazy-allocated pinned pool (h_output_pinned), replacing the per-stream separate cluster/count device buffers - Move CUDA event pool from per-stream fixed-size to per-frame-slot lazy-allocated, enabling correct kernel timing across any batch size Pinned H2D without CPU-side copy: - Add register_input_buffer(ptr, bytes) / unregister_input_buffer() wrapping cudaHostRegister so callers can pin their existing batch buffer once; all find_clusters_batched() slices then transfer at DMA speed (~22 GB/s) instead of ~15 GB/s for pageable, with no extra memcpy or WC-memory penalty Result (RTX 4090, 400x400 uint16, 3x3 clusters, batch=2000, 5 streams): Before: ~34 µs/frame -> After: ~28 µs/frame (−18 %)
- Device pedestal arrays (mean/sum/sum2) are now float instead of double: halves global-memory bandwidth for pedestal reads/writes and eliminates FP64 arithmetic in the kernel (3.3x kernel speedup, 15µs -> 4.6µs). - Replace the per-cluster push_back loop in the D2H drain with a single resize()+memcpy().
- Eliminate the ~200–300 µs inter-batch idle gap by allowing two batches to be in-flight simultaneously: - submit_batch() enqueues H2D+kernel+D2H without blocking - collect() syncs via cudaEventSynchronize (not cudaStreamSynchronize) so a queued second batch runs uninterrupted. - Two ping-pong output slots (NUM_SLOTS=2) with per-slot pinned buffers and cudaEventDisableTiming sync events. - find_clusters_batched() keeps its direct implementation. * Measured: 0.026 -> 0.022 ms/frame (~18%).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CUDA Cluster Finder
Kernel
sm_89(43 registers/thread, 0 spills)Precision
COMPUTE_TYPEalias)Host Wrapper & Transfers
StreamContextinstances, each owning device buffers, a stream handle, and a pedestal arrayfind_clusters_batched()distributes frames round-robin across streamssqrtf()Python Integration
_aare_cuda.soto avoid nvcc/gcc ABI conflicts_aareso user imports remainfrom aare import ClusterFinderCUDAClusterandClusterVectormarkedpy::module_local()for pybind11 3 compatibilityValidation