Bring in climo file hanging during creation fix by justin-richling · Pull Request #453 · NCAR/ADF

justin-richling · 2026-05-11T21:31:41Z

The current code can lend itself to climo files hanging during creation. The main improvement here is that the new implementation isolates worker processes much more cleanly and prevents nested parallelism / shared-memory issues that can cause hangs or deadlocks.

Major Changes:

Switched from default multiprocessing context to 'spawn' which creates completely fresh Python interpreters for each worker instead of cloning the parent memory state.

That avoids inherited corrupted state and is much safer for scientific I/O workflows.

Reduced what gets passed into workers

The original code passed the entire ADF object into every worker process.

That can be problematic because:

large Python objects must be pickled
custom objects sometimes fail serialization
the object may contain:
- open file handles
- locks
- loggers
- multiprocessing-unsafe state
- dask objects

This can create hangs during worker initialization.

The new version passes only a simple string for adf_user

Imports moved inside worker function

With multiprocessing, especially 'spawn', importing libraries inside workers can help avoid:

inherited thread state
inherited dask schedulers
inherited HDF5 state

Explicitly disabling dask multithreading

This created a nested parallelism which can lead to:

oversubscription
thread contention
filesystem stalls
hangs

Now each worker process computes serially internally.

Switched fully to open_mfdataset(..., chunks=...)

The new version forces dask-backed lazy arrays.

Proper dataset cleanup using context manager

This guarantees:

file handles close correctly
NetCDF locks are released
HDF5 resources are freed

Without explicit closure, workers can accumulate open file descriptors and eventually hang.

Added explicit garbage collection

Scientific Python stacks sometimes retain:

large arrays
dask graphs
HDF5 references

Explicit garbage collection helps workers release memory sooner between tasks.

Added exception handling inside workers

Previously a worker crash could:

silently terminate
freeze starmap
leave pool state inconsistent

Now failures are caught and reported cleanly.

@brianpm I tried to tackle the core concepts of the changes and where they are helping fix the code, please let me know if I'm off or something I messed!

EDIT: These changes were supplied by @brianpm!

Sometimes the climo file generation hangs, this seems to consistently fix the issue

brianpm

This looks like a clean implementation of what I was doing!

justin-richling · 2026-05-14T15:55:36Z

@brianpm These were your changes! I forgot to tag you for credit.

Bring in @brianpm's climo file fix

4eba549

Sometimes the climo file generation hangs, this seems to consistently fix the issue

justin-richling added bug Something isn't working analysis Related to data analysis and statistics high priority This needs to be done ASAP labels May 11, 2026

justin-richling requested a review from brianpm May 12, 2026 21:47

brianpm approved these changes May 14, 2026

View reviewed changes

justin-richling merged commit baf471b into NCAR:main May 14, 2026
7 checks passed

justin-richling deleted the climo-hang-fix branch May 14, 2026 15:56

brianpm mentioned this pull request May 14, 2026

Clean up Data handeling methods in adf_dataset.py and fix AMWG table formatting #427

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring in climo file hanging during creation fix#453

Bring in climo file hanging during creation fix#453
justin-richling merged 1 commit into
NCAR:mainfrom
justin-richling:climo-hang-fix

justin-richling commented May 11, 2026 •

edited

Loading

Uh oh!

brianpm left a comment

Uh oh!

justin-richling commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justin-richling commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Major Changes:

Uh oh!

brianpm left a comment

Choose a reason for hiding this comment

Uh oh!

justin-richling commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justin-richling commented May 11, 2026 •

edited

Loading