Skip to content

MPI run on GPU fails #296

@dineshadepu

Description

@dineshadepu

Hi all,

I am looking into MPI and Cabana to do some simulations, which need MPI from cabana side, and for further expansion.

I had followed similar way of using particle communication as CabanaPD does, and was getting into runtime error.

So I tried to run PowderFill using the similar commands. CabanaPD is also failing for a mpi run on a single GPU.

Here is the log:

 

dineshadepu@dwi199a (main) /home/dineshadepu/life/softwares/ecp/CabanaPD/build $  
|  lab desktop => mpirun -n 2 ./examples/dem/PowderFill ../examples/dem/inputs/powder_fill.json 
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false

MPI detected: For OpenMP binding to work as intended, MPI ranks must be bound to exclusive CPU sets.

Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false

MPI detected: For OpenMP binding to work as intended, MPI ranks must be bound to exclusive CPU sets.

Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 10.0 on device with compute capability 12.0 , this will likely reduce potential performance.
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 10.0 on device with compute capability 12.0 , this will likely reduce potential performance.
[dwi199a:44403:0:44403] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f72cedecc80)
[dwi199a:44402:0:44402] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fb6eeded880)
==== backtrace (tid:  44403) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x304) [0x7f74fa3c35b4]
 1  /lib64/libucs.so.0(+0x2c691) [0x7f74fa3c6691]
 2  /lib64/libucs.so.0(+0x2c85a) [0x7f74fa3c685a]
 3  /lib64/libc.so.6(+0x41000) [0x7f74f9469000]
 4  /lib64/libc.so.6(+0x16e80d) [0x7f74f959680d]
 5  /usr/lib64/openmpi/lib/libopen-pal.so.80(+0xa5b47) [0x7f74f99a6b47]
 6  /usr/lib64/openmpi/lib/libmpi.so.40(mca_pml_ob1_send_request_schedule_once+0x2d1) [0x7f74fa2620c1]
 7  /usr/lib64/openmpi/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_ack+0x115) [0x7f74fa26a4d5]
 8  /usr/lib64/openmpi/lib/libopen-pal.so.80(mca_btl_sm_poll_handle_frag+0x9a) [0x7f74f999df3a]
 9  /usr/lib64/openmpi/lib/libopen-pal.so.80(+0x9fb1c) [0x7f74f99a0b1c]
10  /usr/lib64/openmpi/lib/libopen-pal.so.80(opal_progress+0x34) [0x7f74f992c194]
11  /usr/lib64/openmpi/lib/libmpi.so.40(mca_pml_ob1_send+0x490) [0x7f74fa25d9a0]
12  /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Send+0x21a) [0x7f74fa0e153a]
13  ./examples/dem/PowderFill() [0x41bc82]
14  ./examples/dem/PowderFill() [0x4bbee6]
15  ./examples/dem/PowderFill() [0x4c1167]
16  ./examples/dem/PowderFill() [0x4c2c23]
17  ./examples/dem/PowderFill() [0x428760]
18  ./examples/dem/PowderFill() [0x40e4f8]
19  /lib64/libc.so.6(+0x2a58e) [0x7f74f945258e]
20  /lib64/libc.so.6(__libc_start_main+0x89) [0x7f74f9452649]
21  ./examples/dem/PowderFill() [0x40fa75]
=================================
==== backtrace (tid:  44402) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x304) [0x7fb91fcf35b4]
 1  /lib64/libucs.so.0(+0x2c691) [0x7fb91fcf6691]
 2  /lib64/libucs.so.0(+0x2c85a) [0x7fb91fcf685a]
 3  /lib64/libc.so.6(+0x41000) [0x7fb918a69000]
 4  /lib64/libc.so.6(+0x16e80d) [0x7fb918b9680d]
 5  /usr/lib64/openmpi/lib/libopen-pal.so.80(+0xa5b47) [0x7fb919da6b47]
 6  /usr/lib64/openmpi/lib/libmpi.so.40(mca_pml_ob1_send_request_schedule_once+0x2d1) [0x7fb9198620c1]
 7  /usr/lib64/openmpi/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_ack+0x115) [0x7fb91986a4d5]
 8  /usr/lib64/openmpi/lib/libopen-pal.so.80(mca_btl_sm_poll_handle_frag+0x9a) [0x7fb919d9df3a]
 9  /usr/lib64/openmpi/lib/libopen-pal.so.80(+0x9fb1c) [0x7fb919da0b1c]
10  /usr/lib64/openmpi/lib/libopen-pal.so.80(opal_progress+0x34) [0x7fb919d2c194]
11  /usr/lib64/openmpi/lib/libmpi.so.40(mca_pml_ob1_send+0x490) [0x7fb91985d9a0]
12  /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Send+0x21a) [0x7fb9196e153a]
13  ./examples/dem/PowderFill() [0x41bc82]
14  ./examples/dem/PowderFill() [0x4bbee6]
15  ./examples/dem/PowderFill() [0x4c1167]
16  ./examples/dem/PowderFill() [0x4c2c23]
17  ./examples/dem/PowderFill() [0x428760]
18  ./examples/dem/PowderFill() [0x40e4f8]
19  /lib64/libc.so.6(+0x2a58e) [0x7fb918a5258e]
20  /lib64/libc.so.6(__libc_start_main+0x89) [0x7fb918a52649]
21  ./examples/dem/PowderFill() [0x40fa75]
=================================

It works nicely on single run though:

dineshadepu@dwi199a (main) /home/dineshadepu/life/softwares/ecp/CabanaPD/build $  
|  lab desktop => ./examples/dem/PowderFill ../examples/dem/inputs/powder_fill.json 
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false

Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 10.0 on device with compute capability 12.0 , this will likely reduce potential performance.
Local particles: 19141, Maximum neighbors: 21
#Timestep/Total-steps Simulation-time
1000/250000 2.00e-03
2000/250000 4.00e-03
3000/250000 6.00e-03
4000/250000 8.00e-03
5000/250000 1.00e-02
6000/250000 1.20e-02
7000/250000 1.40e-02
8000/250000 1.60e-02
9000/250000 1.80e-02
10000/250000 2.00e-02
11000/250000 2.20e-02
12000/250000 2.40e-02
13000/250000 2.60e-02
14000/250000 2.80e-02

Is this something only I am facing, or you guys are able to get it working?

I am using Rocky machine, with

|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5080        On  |   00000000:01:00.0  On |                  N/A |
|  0%   49C    P0             48W /  360W |    1601MiB /  16303MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Any idea where I am going wrong, or is there a specific way of installing it to make it work?

Thank you very much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions