Skip to content

Bring in climo file hanging during creation fix#453

Merged
justin-richling merged 1 commit into
NCAR:mainfrom
justin-richling:climo-hang-fix
May 14, 2026
Merged

Bring in climo file hanging during creation fix#453
justin-richling merged 1 commit into
NCAR:mainfrom
justin-richling:climo-hang-fix

Conversation

@justin-richling
Copy link
Copy Markdown
Collaborator

@justin-richling justin-richling commented May 11, 2026

The current code can lend itself to climo files hanging during creation. The main improvement here is that the new implementation isolates worker processes much more cleanly and prevents nested parallelism / shared-memory issues that can cause hangs or deadlocks.

Major Changes:


  1. Switched from default multiprocessing context to 'spawn' which creates completely fresh Python interpreters for each worker instead of cloning the parent memory state.

That avoids inherited corrupted state and is much safer for scientific I/O workflows.

  1. Reduced what gets passed into workers

The original code passed the entire ADF object into every worker process.

That can be problematic because:

  • large Python objects must be pickled
  • custom objects sometimes fail serialization
  • the object may contain:
    • open file handles
    • locks
    • loggers
    • multiprocessing-unsafe state
    • dask objects

This can create hangs during worker initialization.

The new version passes only a simple string for adf_user

  1. Imports moved inside worker function

With multiprocessing, especially 'spawn', importing libraries inside workers can help avoid:

  • inherited thread state
  • inherited dask schedulers
  • inherited HDF5 state
  1. Explicitly disabling dask multithreading

This created a nested parallelism which can lead to:

  • oversubscription
  • thread contention
  • filesystem stalls
  • hangs

Now each worker process computes serially internally.

  1. Switched fully to open_mfdataset(..., chunks=...)

The new version forces dask-backed lazy arrays.

  1. Proper dataset cleanup using context manager

This guarantees:

  • file handles close correctly
  • NetCDF locks are released
  • HDF5 resources are freed

Without explicit closure, workers can accumulate open file descriptors and eventually hang.

  1. Added explicit garbage collection

Scientific Python stacks sometimes retain:

  • large arrays
  • dask graphs
  • HDF5 references

Explicit garbage collection helps workers release memory sooner between tasks.

  1. Added exception handling inside workers

Previously a worker crash could:

  • silently terminate
  • freeze starmap
  • leave pool state inconsistent

Now failures are caught and reported cleanly.

@brianpm I tried to tackle the core concepts of the changes and where they are helping fix the code, please let me know if I'm off or something I messed!

EDIT: These changes were supplied by @brianpm!

Sometimes the climo file generation hangs, this seems to consistently fix the issue
@justin-richling justin-richling added bug Something isn't working analysis Related to data analysis and statistics high priority This needs to be done ASAP labels May 11, 2026
@justin-richling justin-richling requested a review from brianpm May 12, 2026 21:47
Copy link
Copy Markdown
Collaborator

@brianpm brianpm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a clean implementation of what I was doing!

@justin-richling
Copy link
Copy Markdown
Collaborator Author

@brianpm These were your changes! I forgot to tag you for credit.

@justin-richling justin-richling merged commit baf471b into NCAR:main May 14, 2026
7 checks passed
@justin-richling justin-richling deleted the climo-hang-fix branch May 14, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

analysis Related to data analysis and statistics bug Something isn't working high priority This needs to be done ASAP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants