Skip to content

Minimal viable custom MPI for mesh decomposition#148

Merged
jacob-moore22 merged 52 commits intomainfrom
mesh_decomp
Dec 11, 2025
Merged

Minimal viable custom MPI for mesh decomposition#148
jacob-moore22 merged 52 commits intomainfrom
mesh_decomp

Conversation

@jacob-moore22
Copy link
Copy Markdown
Collaborator

@jacob-moore22 jacob-moore22 commented Nov 25, 2025

End-to-end mesh decomposition example plus MPI communication plan overhaul


Context / Why

  • We needed a reproducible, data-oriented example that demonstrates how MATAR can build, partition, and export a realistic mesh using PT-Scotch so developers can leverage MPI without heavy third party libraries.
  • Existing MPI required building trillinos, which is not fun.

Highlights

  • New mesh decomposition example (examples/mesh_decomp/)

    • Adds a complete pipeline (mesh_decomp.cpp, decomp_utils.h, mesh.h, state.h, mesh_io.h) that:
      • Builds or reads a global mesh on rank 0
      • Performs a naive scatter of elements/nodes, then re-partitions with PT-Scotch and populates ghost layers.
      • Writes per-rank VTU outputs with nodal/element diagnostic fields for visual debugging.
    • Ships a user-friendly bootstrapper (install_ptscotch.sh) so the example can clone/build Scotch/PT-Scotch in-place under examples/mesh_decomp/lib/.
    • Hooks the target into examples/CMakeLists.txt (behind MPI + KOKKOS) and introduces a local CMakeLists.txt that wires in the freshly built PT-Scotch archives with whole-archive linking to guarantee symbol resolution.
    • Documents and codifies mesh input selections via mesh_inputs.h, enabling future CLI/JSON integration.
    • .gitignore now drops examples/mesh_decomp/lib/* to keep the vendored PT-Scotch artifacts out of version control.
  • Communication plan rewrite (src/include/communication_plan.h)

    • Encapsulates all MPI distributed-graph setup within a CommunicationPlan struct: communicator handles, neighbor rank vectors, ragged send/recv index tables, counts, and displacements.
    • Provides verification helpers (verify_graph_communicator, verify_send_recv) to sanity-check topology and send/recv metadata—critical while chasing nodal comm bugs.
    • Leverages MATAR dual-view containers (DCArrayKokkos, DRaggedRightArrayKokkos) so the plan stays consistent across host/device.
    • Keeps the old implementation around (communication_plan_old.h) for staged migration.
  • MPI array container refactor (src/include/mpi_types.h)

    • Rebuilds MPICArrayKokkos on top of the new communication plan:
      • Type-trait driven MPI datatype mapping (mpi_type_map<>) replaces ad-hoc specialization calls.
      • Explicit send/recv buffers plus stride-aware packing/unpacking remove undefined behavior when tensor ranks > 1.
      • Adds hooks to initialize comm plans, populate buffers, and perform neighbor collectives (MPI_Neighbor_alltoallv).
    • Preserves the prior implementation as mpi_types_old.h for comparison/testing.
    • Cleans up includes (mapped_mpi_types.h no longer exposed) and clarifies the data-oriented ownership of host/device views.

Testing

Scope Status Notes
cmake --build <build_dir> --target mesh_decomp ⚪️ not run here Requires PT-Scotch; run after invoking examples/mesh_decomp/install_ptscotch.sh or letting CMake drive it.
mpirun -n <ranks> mesh_decomp ⚪️ not run here Validate both generated box meshes and external VTK inputs; inspect vtk/Fierro.*.vtu outputs per rank.
Existing unit/example targets ⚪️ not run here No functional changes expected outside MPI layers, but please kick test_kokkos_for, mtestkokkos, etc., if those touch MPICArrayKokkos.

Follow-ups / Open Questions

  • Hook mesh_inputs.h into a true input parser so we can flip between generated and file-based meshes via CLI.
  • Consider persisting CommunicationPlan buffers with MPI_Neighbor_alltoallv_init to avoid per-step setup once the nodal exchange stabilizes.
  • Revisit GPU-aware MPI paths (HAVE_GPU_AWARE_MPI) now that the communication plan explicitly stages host buffers—there may be opportunities to skip host staging entirely.
  • Evaluate whether install_ptscotch.sh should pin Scotch revisions and add checksum verification for reproducibility.

Reference

  • Branch: mesh_decomp
  • Diverges from: origin/main
  • Key modified paths:
    • examples/mesh_decomp/{CMakeLists.txt,decomp_utils.h,mesh.h,mesh_io.h,mesh_decomp.cpp,mesh_inputs.h,state.h,install_ptscotch.sh}
    • examples/CMakeLists.txt
    • src/include/{communication_plan.h,communication_plan_old.h,mpi_types.h,mpi_types_old.h,mapped_mpi_types.h}
    • .gitignore, scripts/build-matar.sh

@jacob-moore22 jacob-moore22 self-assigned this Nov 25, 2025
@jacob-moore22 jacob-moore22 added the enhancement New feature or request label Nov 25, 2025
@jacob-moore22 jacob-moore22 mentioned this pull request Nov 25, 2025
Copy link
Copy Markdown
Collaborator

@nathanielmorgan nathanielmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job1

@jacob-moore22 jacob-moore22 merged commit 4e9e562 into main Dec 11, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants