Skip to content

[Perf] Streams 4: add stream pool#410

Draft
hughperkins wants to merge 5 commits intohp/streams-quadrantsic-3-stream-parallelfrom
hp/streams-quadrantsic-4-stream-pool
Draft

[Perf] Streams 4: add stream pool#410
hughperkins wants to merge 5 commits intohp/streams-quadrantsic-3-stream-parallelfrom
hp/streams-quadrantsic-4-stream-pool

Conversation

@hughperkins
Copy link
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Replace per-launch stream_create/stream_destroy with acquire_stream/
release_stream on CUDAContext and AMDGPUContext. Streams are cached in
a pool and reused across invocations, avoiding the driver-level overhead
of stream creation (~5-50us) on every kernel launch in hot loops.
@hughperkins hughperkins changed the base branch from main to hp/streams-quadrantsic-3-stream-parallel March 12, 2026 01:35
Calls a stream_parallel kernel 5 times in a loop to verify that pooled
streams are correctly reused with correct results each iteration.
@hughperkins
Copy link
Collaborator Author

Review from Opus 4.6:

What it does: Introduces a stream pool in both CUDAContext and AMDGPUContext so that GPU streams are reused across kernel launches rather than created and destroyed every time.
The kernel launchers now call acquire_stream() / release_stream() instead of stream_create() / stream_destroy().
What's good:

  1. Clear motivation. The previous branch even had a comment saying "a stream pool could reduce overhead for hot loops" — this PR implements exactly that and removes the TODO
    comment. Nice follow-through.
  2. Symmetry. The CUDA and AMDGPU implementations are perfectly mirrored, which is easy to review and maintain.
  3. Thread safety. Both acquire_stream and release_stream are guarded by lock_, which is correct since the context is a singleton shared across threads.
  4. Test coverage. The test runs 5 iterations of a kernel with two stream_parallel blocks, which exercises the pool reuse path.

Concerns / suggestions:

  1. Unbounded pool growth. Streams are never destroyed — release_stream just pushes back into the vector forever. If a spike causes many concurrent streams to be allocated,
    they'll remain in the pool indefinitely. Consider adding a high-water mark or trimming idle streams, or at minimum destroying them in the context destructor.
  2. No cleanup in destructor. The stream_pool_ vector will leak GPU streams when the context is destroyed. You should iterate and destroy all pooled streams in the destructor.
  3. Lock contention. Every acquire_stream / release_stream takes the global lock_ mutex. In the hot loop case this PR is optimizing for, the acquire and release happen close
    together per launch. This is probably fine for now, but if you ever acquire/release streams from many threads concurrently, a lock-free pool or thread-local free-lists would
    be better. Not needed now, just worth noting.
  4. std::vector as a stack is fine but pop_back() + push_back() means LIFO reuse, which is actually good — recently used streams are more likely to be warm in driver caches.
  5. Test could be stronger. The test validates correctness but doesn't actually verify that streams are being reused (e.g., that no new streams are created on the 2nd+
    iteration). A unit test that directly calls acquire_stream / release_stream and checks pool size would give more confidence the pooling logic itself works. Though I
    understand that may be harder to do from Python.

Overall, it's a solid, focused change. The main actionable item is making sure pooled streams get destroyed when the context shuts down.

@hughperkins
Copy link
Collaborator Author

Response from the original Opus that wrote this PR:

  1. Unbounded pool growth -- In practice this is bounded by the number of unique stream_parallel group IDs per kernel, which is typically 2-4. The pool won't grow beyond that
    peak. Not a real concern.
  2. No cleanup in destructor -- Technically valid, but both CUDAContext and AMDGPUContext are singletons allocated with new and intentionally never deleted (the CUDA destructor
    is even commented out). The OS reclaims GPU resources on process exit. That said, adding cleanup is trivially cheap and good hygiene -- I could add it.
  3. Lock contention -- Reviewer already acknowledges it's fine for now. Agreed.
  4. LIFO reuse -- Just an observation, and a positive one.
  5. Stronger test -- Fair point but acquire_stream/release_stream are C++ methods not exposed via pybind, so testing pool internals from Python isn't practical. The correctness
    test (5 iterations with correct results) is the right level of testing from this side.

I think I might get some opinions from a fresh Opus.

@hughperkins
Copy link
Collaborator Author

The reviewer Opus said the following when I asked "what are your thougths on concern 1?":

Honestly, on reflection it's probably overstated. The number of streams acquired per launch is determined by the number of distinct stream_parallel groups in a kernel — that's
a compile-time property, typically small (single digits). The pool will quickly stabilize at the high-water mark and stay there. There's no mechanism that would cause a
"spike" of stream allocation that then subsides, since the acquire/release pattern is synchronous within each launch (acquire all, launch, sync, release all).
GPU streams are also lightweight when idle — they're essentially just driver-side command queues, not large resource allocations.
So the "unbounded growth" framing is misleading. The real actionable concern is just concern 2 — destructor cleanup. You should still destroy the pooled streams when the
context shuts down to avoid leaking driver handles. But worrying about pool size management or high-water trimming is over-engineering for this use case.

@hughperkins
Copy link
Collaborator Author

So both opuses concur that the only concern is the destructor, concern 2.

@hughperkins hughperkins marked this pull request as ready for review March 12, 2026 01:51
@hughperkins hughperkins marked this pull request as draft March 12, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants