Perf/shared memory prefetch by Functionhx · Pull Request #14 · camenduru/simple-knn

Functionhx · 2026-04-26T17:38:36Z

Summary

Optimize the boxMeanDist kernel by cooperatively loading points into shared memory before processing each spatial box, reducing global memory
bandwidth consumption by ~1024×.

Problem

In the boxMeanDist kernel, all 1024 threads in a block read the same set of points from global memory for each box. For a box with 1024 points,
each point is read 1024 times — a 1024× bandwidth waste.

Solution

Before processing each box, cooperatively load all points into __shared__ memory (12KB per block), then read from shared memory during the KNN
search.

Benchmark (RTX 4070 Ti Super, CUDA 11.8, PyTorch 2.7.1)

Points	Before (ms)	After (ms)	Speedup
1,000	0.245	0.220	1.1×
5,000	0.716	0.542	1.3×
10,000	1.463	0.805	1.8×
50,000	5.432	1.994	2.7×
100,000	8.649	3.211	2.7×
500,000	34.108	14.495	2.4×
1,000,000	68.707	33.426	2.1×

Test plan

All 14 pytest tests pass — correctness unchanged
Verified against brute-force KNN (rtol=1e-4)
Tested on CUDA 11.8 + PyTorch 2.7.1

- Add TORCH_CHECK guards for CUDA tensor, shape, dtype, and contiguity - Add CUDA_CHECK macro for cudaMalloc/Memcpy/Free error handling - Remove manual #define __CUDACC__ (set automatically by nvcc) - Improve README with prerequisites and --no-build-isolation guidance - Add pyproject.toml for build metadata - Add pytest test suite and benchmark script - Improve .gitignore

Cooperatively load box points into __shared__ memory before the KNN search loop, reducing global memory reads by ~1024x per box. Benchmark (RTX 4070 Ti Super): - 10K points: 1.46ms → 0.81ms (1.8x) - 100K points: 8.6ms → 3.2ms (2.7x) - 1M points: 68.7ms → 33.4ms (2.1x)

Yuchen Fan added 2 commits April 27, 2026 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf/shared memory prefetch#14

Perf/shared memory prefetch#14
Functionhx wants to merge 2 commits into
camenduru:mainfrom
Functionhx:perf/shared-memory-prefetch

Functionhx commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Functionhx commented Apr 26, 2026

Summary

Problem

Solution

Benchmark (RTX 4070 Ti Super, CUDA 11.8, PyTorch 2.7.1)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant