Skip to content

Run submodel inference in parallel using CUDA streams#18

Merged
kmaziarz merged 3 commits into
mainfrom
kmaziarz/parallelize-ensemble-model-inference
Jun 10, 2026
Merged

Run submodel inference in parallel using CUDA streams#18
kmaziarz merged 3 commits into
mainfrom
kmaziarz/parallelize-ensemble-model-inference

Conversation

@kmaziarz

@kmaziarz kmaziarz commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

So far, we've been running RetroChimera's submodels sequentially. Since they are independent, we can run them in parallel; this has potential to speed things up if the submodels are not fully utilizing the GPU, which is the case for the ones currently used in RetroChimera. This PR adds an option to run the submodels in parallel using CUDA streams. This is hidden behind a flag, which is True by default (the main reason to turn it off would be if operating under very stringent memory requirements, as parallelization does slightly increase memory needs).

The speedup is small but detectable, and model output is unchanged. For context, I also include a comparison with the state of main before #15 (back from before we started looking at model speed altogether).

Setting Before #15 Before #18 After #18 Speedup #18 Speedup combined
batch_size=1 3052.616 2335.829 2285.757 1.02x 1.34x
batch_size=16 1197.197 658.432 584.615 1.13x 2.05x

@jla-gardner jla-gardner left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🚀 LGTM

@kmaziarz kmaziarz merged commit a4ae7c7 into main Jun 10, 2026
8 checks passed
@kmaziarz kmaziarz deleted the kmaziarz/parallelize-ensemble-model-inference branch June 10, 2026 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants