[Perf] Streams 4: add stream pool#410
[Perf] Streams 4: add stream pool#410hughperkins wants to merge 5 commits intohp/streams-quadrantsic-3-stream-parallelfrom
Conversation
…adrantsic-3-stream-parallel
Replace per-launch stream_create/stream_destroy with acquire_stream/ release_stream on CUDAContext and AMDGPUContext. Streams are cached in a pool and reused across invocations, avoiding the driver-level overhead of stream creation (~5-50us) on every kernel launch in hot loops.
…ms-quadrantsic-4-stream-pool
Calls a stream_parallel kernel 5 times in a loop to verify that pooled streams are correctly reused with correct results each iteration.
|
Review from Opus 4.6: What it does: Introduces a stream pool in both CUDAContext and AMDGPUContext so that GPU streams are reused across kernel launches rather than created and destroyed every time.
Concerns / suggestions:
Overall, it's a solid, focused change. The main actionable item is making sure pooled streams get destroyed when the context shuts down. |
|
Response from the original Opus that wrote this PR:
I think I might get some opinions from a fresh Opus. |
|
The reviewer Opus said the following when I asked "what are your thougths on concern 1?": Honestly, on reflection it's probably overstated. The number of streams acquired per launch is determined by the number of distinct stream_parallel groups in a kernel — that's |
|
So both opuses concur that the only concern is the destructor, concern 2. |
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough