Skip to content

Remove inter-trial device syncs during autotuning.#116

Merged
romerojosh merged 2 commits intomainfrom
reduce_autotuning_syncs
Mar 19, 2026
Merged

Remove inter-trial device syncs during autotuning.#116
romerojosh merged 2 commits intomainfrom
reduce_autotuning_syncs

Conversation

@romerojosh
Copy link
Collaborator

Currently, the cuDecomp autotuning implementation enforces device synchronization between trials to query timings from recorded cudaEvents. These synchronizations aren't really necessary and can pollute timings with more exposed kernel launch latency from draining the GPU work queue between trials.

This PR updates the autotuning implementation to remove inter-trial device syncs where possible.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh
Copy link
Collaborator Author

/build

@github-actions
Copy link

🚀 Build workflow triggered! View run

@github-actions
Copy link

✅ Build workflow passed! View run

@romerojosh romerojosh merged commit 2d8e1cb into main Mar 19, 2026
4 checks passed
@romerojosh romerojosh deleted the reduce_autotuning_syncs branch March 23, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant