[Perf] Streams 3: Add qd.stream_parallel() context manager#409
[Perf] Streams 3: Add qd.stream_parallel() context manager#409hughperkins wants to merge 6 commits intohp/streams-quadrantsic-2-amdgpu-cpufrom
Conversation
Introduces stream_parallel() for running top-level for-loop blocks on separate GPU streams. The AST transformer maps 'with qd.stream_parallel()' blocks to stream-parallel group IDs, which propagate through IR lowering and offloading to the CUDA/AMDGPU kernel launchers. Each unique group ID gets its own stream at launch time. Includes validation that all top-level kernel statements must be stream_parallel blocks (no mixing), and offline cache key support.
…adrantsic-3-stream-parallel # Conflicts: # python/quadrants/lang/stream.py
Prevents stale group IDs from leaking if insert_for is called after a path that set a non-zero stream_parallel_group_id, matching the reset pattern of all other ForLoopConfig fields.
Add an error check in begin_stream_parallel() to prevent nesting, which would produce undefined group ID semantics.
…context safety Add comments explaining that streams are created/destroyed per launch (stream pooling as future optimization), and that RuntimeContext sharing across concurrent streams is safe because kernels only read from it.
…adrantsic-3-stream-parallel
|
Review from Opus (predates last 5 commits above): PR Review:
|
| Priority | Item |
|---|---|
| High | Add config.stream_parallel_group_id = 0; to ForLoopDecoratorRecorder::reset() (#3) |
| High | Verify RuntimeContext is safe to share across concurrent streams (#5) |
| Medium | Reject nested stream_parallel blocks explicitly (#7) |
| Medium | Consider rejecting stream_parallel in @qd.func (#8) |
| Low | Extract shared stream-dispatch logic from CUDA/AMDGPU launchers (#2) |
| Low | Consider stream pooling for repeated kernel launches (#1) |
|
For the concern about stream pool, added 4th pr to add stream pool #410 |
| if len(stmt.items) != 1: | ||
| return False | ||
| item = stmt.items[0] |
There was a problem hiding this comment.
What is items ? Could you document here or somewhere else why the length can be 1 or more, and what does it means in this context?
| "When using qd.stream_parallel(), all top-level statements " | ||
| "in the kernel must be 'with qd.stream_parallel():' blocks. " | ||
| "Move non-parallel code to a separate kernel." | ||
| ) |
There was a problem hiding this comment.
I still don't understand why you are moving to the next line before you have to. This is weird to me. But I don't care much.
| has_sp = any(FunctionDefTransformer._is_stream_parallel_with(s, global_vars) for s in body) | ||
| if not has_sp: |
There was a problem hiding this comment.
I would rather do
# <Insert fancy comment explaining what this check is doing>
if not any(FunctionDefTransformer._is_stream_parallel_with(s, global_vars) for s in body):
return| if len(node.items) != 1: | ||
| raise QuadrantsSyntaxError("'with' in Quadrants kernels only supports a single context manager") | ||
| item = node.items[0] |
There was a problem hiding this comment.
Same. Not clear what items is.
There was a problem hiding this comment.
All this "code duplication" (at least duplicate logics) is annoying but if there is no better choice and it is what we have been doing so far, then it is ok.
Introduces stream_parallel() for running top-level for-loop blocks on separate GPU streams. The AST transformer maps 'with qd.stream_parallel()' blocks to stream-parallel group IDs, which propagate through IR lowering and offloading to the CUDA/AMDGPU kernel launchers. Each unique group ID gets its own stream at launch time. Includes validation that all top-level kernel statements must be stream_parallel blocks (no mixing), and offline cache key support.
lines added: +377 - 161 = +216
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough