Skip to content

fix: Infinite hang when MPI process crashes#4001

Draft
arng40 wants to merge 2 commits intodevelopfrom
bugfix/dudes/endless-wait-crash-mpi
Draft

fix: Infinite hang when MPI process crashes#4001
arng40 wants to merge 2 commits intodevelopfrom
bugfix/dudes/endless-wait-crash-mpi

Conversation

@arng40
Copy link
Contributor

@arng40 arng40 commented Mar 18, 2026

The bug occurs when you launch a simulation with multiple ranks and one of the ranks calls a GEOS_THROW. This results in an infinite hang

@arng40 arng40 self-assigned this Mar 18, 2026
@arng40 arng40 added type: bug Something isn't working flag: no rebaseline Does not require rebaseline labels Mar 18, 2026
@arng40 arng40 changed the title bugfix: Infinite hang when MPI process crashes fix: Infinite hang when MPI process crashes Mar 18, 2026
@jhuang2601 jhuang2601 added ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI ci: run code coverage enables running of the code coverage CI jobs labels Mar 18, 2026
@rrsettgast
Copy link
Member

rrsettgast commented Mar 18, 2026

The second call to basicCleanup() was done intentionally in:
#3332

to avoid cleanup issues. Are you sure this doesn't reintroduce the issues that it fixed?

@arng40 arng40 marked this pull request as draft March 19, 2026 08:33
Copy link
Contributor

@MelReyCG MelReyCG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so let's change the strategy:

  • separate anything that is not about cleaning up from basicCleanup() to a new function (finalizeRun()? something else?),
  • call basicCleanup() in an finally clause, to ensure it is called, but any logging of a potencial error must be done before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci: run code coverage enables running of the code coverage CI jobs ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI flag: no rebaseline Does not require rebaseline type: bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants