Skip to content

Perf/shared memory prefetch#14

Open
Functionhx wants to merge 2 commits into
camenduru:mainfrom
Functionhx:perf/shared-memory-prefetch
Open

Perf/shared memory prefetch#14
Functionhx wants to merge 2 commits into
camenduru:mainfrom
Functionhx:perf/shared-memory-prefetch

Conversation

@Functionhx

Copy link
Copy Markdown

Summary

Optimize the boxMeanDist kernel by cooperatively loading points into shared memory before processing each spatial box, reducing global memory
bandwidth consumption by ~1024×.

Problem

In the boxMeanDist kernel, all 1024 threads in a block read the same set of points from global memory for each box. For a box with 1024 points,
each point is read 1024 times — a 1024× bandwidth waste.

Solution

Before processing each box, cooperatively load all points into __shared__ memory (12KB per block), then read from shared memory during the KNN
search.

Benchmark (RTX 4070 Ti Super, CUDA 11.8, PyTorch 2.7.1)

Points Before (ms) After (ms) Speedup
1,000 0.245 0.220 1.1×
5,000 0.716 0.542 1.3×
10,000 1.463 0.805 1.8×
50,000 5.432 1.994 2.7×
100,000 8.649 3.211 2.7×
500,000 34.108 14.495 2.4×
1,000,000 68.707 33.426 2.1×

Test plan

  • All 14 pytest tests pass — correctness unchanged
  • Verified against brute-force KNN (rtol=1e-4)
  • Tested on CUDA 11.8 + PyTorch 2.7.1

Yuchen Fan added 2 commits April 27, 2026 01:36
- Add TORCH_CHECK guards for CUDA tensor, shape, dtype, and contiguity
- Add CUDA_CHECK macro for cudaMalloc/Memcpy/Free error handling
- Remove manual #define __CUDACC__ (set automatically by nvcc)
- Improve README with prerequisites and --no-build-isolation guidance
- Add pyproject.toml for build metadata
- Add pytest test suite and benchmark script
- Improve .gitignore
Cooperatively load box points into __shared__ memory before the KNN
search loop, reducing global memory reads by ~1024x per box.

Benchmark (RTX 4070 Ti Super):
- 10K points: 1.46ms → 0.81ms (1.8x)
- 100K points: 8.6ms → 3.2ms (2.7x)
- 1M points: 68.7ms → 33.4ms (2.1x)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant