Add hybrid MPI+OpenMP support#826
Conversation
rupertnash
left a comment
There was a problem hiding this comment.
Thanks for the work! I will have to review your dissertation to get the full details of the change in performance, but there are a few minor code problem before this can be considered for a merge. I notice that you haven't touched the initialisation code (in lb::InitialCondition) which can make a huge difference in performance when NUMA effects are in play (due to the typical first touch page allocation strategy used)
There was a problem hiding this comment.
Need to use find_package(OpenMP) and then target_link_libraries(... OpenMP::OpenMP_CXX)
There was a problem hiding this comment.
Unacceptable use of ifdef in new code. Should refactor to minimise the code that is different when OpenMP enabled. If different code is required, use if constexpr
Overview
This PR introduces hybrid parallelism by integrating OpenMP into the existing MPI-based code. Computationally intensive collision and streaming parts were parallelised with OpenMP loops with the intention to better exploit shared-memory parallelism within nodes.
Enabling OpenMP is configurable via
-DHEMELB_USE_OPENMP=ON/OFFbuild option. OpenMP is disabled by default.Results
The pure MPI reference implementation consistently delivers the best performance and scalability across all tested configurations, compilers and platforms. However, at low node counts, the OpenMP version shows promising results, slightly outperforming the pure MPI version. That suggests that potentially, on a larger input geometry with more lattice sites per rank (more iterations for the OpenMP loops), it could still be beneficial to use OpenMP.
For full performance comparison please find the plots below.
ARCHER2
Figure 1: Hybrid parallelism: speedup for the retina dataset (40,000 time steps) on ARCHER2 using GNU compilers, 128 execution units per node.
Figure 2: Hybrid parallelism: speedup for the retina dataset (40,000 time steps) on ARCHER2 using Cray compilers, 128 execution units per node.
Figure 3: Hybrid Parallelism: simulation time on 4 nodes on ARCHER2 using GNU compilers, 128 execution units per node.
Cirrus
Figure 4: Hybrid parallelism: speedup for the retina dataset (40,000 time steps) on Cirrus using GNU compilers, 128 execution units per node.