Confidence: medium · Effort: trivial (<1 h)
Problem
The autotune loop resets d_keys_tested, d_best_packed, d_stop_requested between every tpb/bpsm/chunk_mult combination with three separate cudaMemsetAsync calls followed by a blocking cudaStreamSynchronize. Three tiny resets cost more in launch overhead than the bytes moved.
Files: src/bruteforce.cu:2150-2153
Suggested fix
Either:
- Allocate a single 24-byte device counter struct and reset it in one
cudaMemsetAsync, or
- Batch into a small 1-thread kernel that zeroes all three.
Why it matters
Cuts ~2 ms × N_combinations off autotune startup. Not huge but a clean, low-risk win.
Confidence: medium · Effort: trivial (<1 h)
Problem
The autotune loop resets
d_keys_tested,d_best_packed,d_stop_requestedbetween every tpb/bpsm/chunk_mult combination with three separatecudaMemsetAsynccalls followed by a blockingcudaStreamSynchronize. Three tiny resets cost more in launch overhead than the bytes moved.Files:
src/bruteforce.cu:2150-2153Suggested fix
Either:
cudaMemsetAsync, orWhy it matters
Cuts ~2 ms × N_combinations off autotune startup. Not huge but a clean, low-risk win.