Skip to content

Perf: autotune emits three cudaMemsetAsync per chunk-mult combination #9

@felixx-sp

Description

@felixx-sp

Confidence: medium · Effort: trivial (<1 h)

Problem

The autotune loop resets d_keys_tested, d_best_packed, d_stop_requested between every tpb/bpsm/chunk_mult combination with three separate cudaMemsetAsync calls followed by a blocking cudaStreamSynchronize. Three tiny resets cost more in launch overhead than the bytes moved.

Files: src/bruteforce.cu:2150-2153

Suggested fix

Either:

  1. Allocate a single 24-byte device counter struct and reset it in one cudaMemsetAsync, or
  2. Batch into a small 1-thread kernel that zeroes all three.

Why it matters

Cuts ~2 ms × N_combinations off autotune startup. Not huge but a clean, low-risk win.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions