Skip to content

[GPU][BSE-5357] Make sure GPU buffers are ready before calling MPI#1075

Merged
scott-routledge2 merged 11 commits intomainfrom
ehsan/mpi_sync
Apr 1, 2026
Merged

[GPU][BSE-5357] Make sure GPU buffers are ready before calling MPI#1075
scott-routledge2 merged 11 commits intomainfrom
ehsan/mpi_sync

Conversation

@ehsantn
Copy link
Copy Markdown
Collaborator

@ehsantn ehsantn commented Mar 19, 2026

Changes included in this PR

Resolves segfault inside UCX for multi-node / multi GPU TPCH Q5.

Reference: https://github.com/rapidsai/rapidsmpf/blob/a1580b7f67619e7dcafd32d4a1b933f1fce61a84/cpp/include/rapidsmpf/memory/buffer.hpp#L250

Testing strategy

Multi GPU TPCH where 2 ranks are assigned to the same GPU reliably reproduces the issue. Though this can occur with multiple GPUs and one rank per GPU. Both cases now should pass.

User facing changes

Checklist

  • PR title contains "[GPU]" if changes target Bodo DataFrames GPU acceleration.
  • Pipelines passed before requesting review. To run CI you must include [run CI] in your commit message.
  • I am familiar with the Contributing Guide
  • I have installed + ran pre-commit hooks.

@scott-routledge2 scott-routledge2 changed the title [GPU] Make sure GPU buffers are ready before calling MPI [GPU][BSE-5357] Make sure GPU buffers are ready before calling MPI Mar 27, 2026
Copy link
Copy Markdown
Collaborator

@DrTodd13 DrTodd13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

Comment thread bodo/libs/gpu_utils.cpp

// Make sure GPU buffers are ready before passing to MPI
// TODO(BSE-5359): Make this check async
CHECK_CUDA(cudaStreamSynchronize(stream));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just check if the stream is done and not do this function or say that we have nothing to send rather than just waiting here? Is there a reason that waiting is required here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a reason we can't. I created a followup issue for that (BSE-5359) which can be done next.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 62.13%. Comparing base (9c48231) to head (b03532e).
⚠️ Report is 27 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1075      +/-   ##
==========================================
- Coverage   68.18%   62.13%   -6.06%     
==========================================
  Files         195      195              
  Lines       68080    68104      +24     
  Branches     9708     9713       +5     
==========================================
- Hits        46423    42318    -4105     
- Misses      18841    22860    +4019     
- Partials     2816     2926     +110     

@scott-routledge2 scott-routledge2 merged commit ec80a51 into main Apr 1, 2026
22 of 31 checks passed
@scott-routledge2 scott-routledge2 deleted the ehsan/mpi_sync branch April 1, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants