Skip to content

Memory optimization NOMAD nexus parser#750

Open
mkuehbach wants to merge 26 commits intomasterfrom
mem_optimization_parsing
Open

Memory optimization NOMAD nexus parser#750
mkuehbach wants to merge 26 commits intomasterfrom
mem_optimization_parsing

Conversation

@mkuehbach
Copy link
Copy Markdown
Collaborator

@mkuehbach mkuehbach commented Mar 17, 2026

Planned for v0.14.0

Addressing one aspect to issue #737:

Key changes for non-scalar iuf HDF5 datasets (integer, unsigned, float):

  • Using float64 precision throughout for computation of the naive arithmetic mean
  • Dropping the standard deviation field cuz i) it will for chunked storage layout require a dual pass through the data, ii) several area B folks reported that this was never really used
  • Taking advantage of benefits available when datasets are stored using the chunked layout:
    Loading chunk by chunk then datasets completely, thus completely flattening memory spikes
  • Added example how to compute numerical more precise (non-naive) arithmetic mean and std (Welford algorithm) for chunked storage but deactivated it by default as it would have a substantial performance hit.
  • Tested with 10GB mock NXem dataset, previously contiguous storage layout, memory allocation for loading the dataset during parsing to get mean spiked at approx 40 - 45 GB,
    using chunked storage layout (irrespective with compression or not (both supported now)), memory flat, only a few MB kept while iterating over chunks.
  • Additional benefit, with this running 2.5x faster than without if using no compression, with compression there is overhead.
  • Aim achieved, no memory spikes, out-of-memory NeXus files supported if these used chunked storage.
  • Choices for users using different storage layouts are considered at the dataset level, multiple options: to chunked or not to chunk?, to compress or not?, gzip (sequential, current and future default) or blosc (multithreaded compression Optimization related to compression, allowing multithreaded blosc2 to be used as a compression library #747)
  • Unit tests added
  • Extension for complex data would be possible, would rather delegate this though to another PR on "How to support data with complex numbers in pynxtools" nomad/parser.py report when facing datatypes with precision that NOMAD does currently not support #770
  • Fixed get_field handling of string types and distinguish properly from plain objects/structs that Metainfo does not supported, deactivated also parsing of such content.

Currently most datasets by pynxtools-plugins are written using the simple contiguous storage layout. Upon parsing these will be loaded fully into main memory, unpacked via hdf_node[...] at once if of dtype kind iufc.
If chunked storage layout is used, iterating over chunks respectively hyperslabs is not taken advantage of.
Consequently, the old implementation unnecessarily loads entire datasets into main memory instead of processing these off chunk-by-chunk. For large datasets, e.g. image and spectra stacks the impact is significant, for laptops eventually deal breaking: keep a rather flat few MiB per chunk flat RAM usage profile rather than provoke GiB spikes that may exceed even system max RAM. These spikes are particularly nasty if hosts are shared by multiple users.

Pitfalls: np.mean has optimized numerics (not only for speed but also compensating precision), chunking may need https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

  • Implementation
  • Testing individual parts, specifically Welford part
  • Testing in production

…istics, for chunked need to compute by hand, take care for complex values
@mkuehbach mkuehbach changed the title Memory optimization parsing Memory optimization NOMAD nexus parser Mar 17, 2026
…e welford needs incremental updates which wont end up getting vectorized, the only issue that this comes at all up is because we wish to take the mean value (and show population standard deviation), we should consider to keep it with min and max and just pick the first value, it will avoid using Welford and alike and with ought to speed up the parsing significantly, the mean value of a coordinate axis value triplet 1, 0, 0 with 0.333 and stdev is anyway of limited value, I understood @rettigl such that uses summary stats for navigation, might need discussion for a better compromise, mind that also np.mean and np.std even on cotiguous array are not without imprecisions, in edge cases eventually even weaker than Welford, I am supportive though that for something that is just a value to show in the GUI in NOMAD we should not invoke an eventually very costly algorithm, I kept both the incremental (non vectorized) as well as the numpy batch implementation to support the discussion, my preference is to remove mean and stdev altogether, if people are interested in this, they should compute it in pynxtools-plugin, when they have that dataset anyway at some point in main memory, we should not abuse the parser for these computations, as currently with contiguous storage layout they create memory consumption spikes.
@mkuehbach
Copy link
Copy Markdown
Collaborator Author

Current practice is that non-scalar iuf h5py.Datasets in NOMAD get the mean value as the field value. This is a convention going back to a suggestion from @sanbrock on how to reduce Metainfo instances per NeXus concepts when registering NeXus h5py.Dataset instances in Metainfo.

Also min, max, and population standard deviation were computed (given the formula np.mean is parameterized with ddof=0 by default). I would like to discuss if we could possibly live without mean and stdev and instead keep of course min, max, size, and ndim, and set arbitraily the first value of array in Metainfo.

For complex non-scalar datasets we anyway have no statistics at the moment.

The motivation is there is a tradeoff when reading efficiently from chunks: Namely that mean and stdev need to be computed incrementally and that is not trivially vectorizable in particular not if we wish to stick with Python code.

I suggest that we rather offload all the summary statistics to the plugin and attach these as arrays if required.

That would offer those folks who wish to have highly numerically accurate mean and std computed with in the parser ending just as a lookup, there its computation was damn costly for contiguous storage layout and is not getting cheaper when using chunked storage.

I also would like to motivate everybody to rather export their non-scalar arrays using the chunked storage layout.

Mind that this does not mean that you need to do any compression. Chunking is necessary for using HDF5 compression filters but alone not a sufficient criterion.

Using chunking would enable us to reduce memory consumption peaks especially when we could agree to avoid computing mean and std in the parsing stage.

Not urgent but thoughts would be appreciated @sherjeelshabih @rettigl @lukaspie @RubelMozumder @sanbrock

@mkuehbach mkuehbach marked this pull request as ready for review March 25, 2026 15:47
mkuehbach and others added 2 commits March 26, 2026 10:11
…#761)

* alternative approach, drop std and go with naive summation by default

* keep std deactivated

---------

Co-authored-by: mkuehbach <markus.kuehbach@physik.hu-berlin.de>
Copy link
Copy Markdown
Collaborator

@RubelMozumder RubelMozumder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR needs atom tests to verify that all changes will reflect the previous stats. I found some bugs while reading the code. Probably there are a few left.

Comment thread src/pynxtools/nomad/parsers/parser.py Outdated
Comment thread src/pynxtools/nomad/parsers/parser.py Outdated
Comment thread src/pynxtools/nomad/parsers/parser.py Outdated
Comment thread src/pynxtools/nomad/parsers/parser.py Outdated
Comment thread src/pynxtools/nomad/parsers/parser.py Outdated
Comment thread src/pynxtools/nomad/parsers/parser.py
Comment thread src/pynxtools/nomad/parsers/parser.py
Comment thread src/pynxtools/nomad/schema_packages/schema.py
Copy link
Copy Markdown
Collaborator

@rettigl rettigl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of changes, where I don't completely understand many of them. Certainly needs thorough testing of the different branches.
The removing of std as field statistics is okay for me, as I never used this in an app or so. But I think under the assumption of statistical independence of the different chunks, we can also cheaply provide an estimation.

Comment thread src/pynxtools/nomad/parsers/parser.py Outdated
Comment thread src/pynxtools/nomad/parsers/parser.py
Comment thread src/pynxtools/nomad/parsers/parser.py
Comment thread src/pynxtools/nomad/parsers/parser.py
Comment thread src/pynxtools/nomad/parsers/parser.py Outdated
Comment thread src/pynxtools/nomad/parsers/parser.py
Comment thread src/pynxtools/nomad/parsers/parser.py Outdated
Comment thread tests/nomad/welford_test.ipynb Outdated
@mkuehbach
Copy link
Copy Markdown
Collaborator Author

mkuehbach commented Apr 1, 2026

Hi, @rettigl, as you are currently reviewing, (thanks for this) would you mind if we replace the mean value that is used as a representative for value of array datasets with 0.5 * (min + max). Using the mean for arrays was anyway always a crude workaround, showing just also the first value of a multidimensional array is equally questionable, dropping adding of array datasets altogether I think is a too hard step I had the impression we do not want to pursue at this point, right ?

I will work through your feedback later, currently still on #752

@rettigl
Copy link
Copy Markdown
Collaborator

rettigl commented Apr 1, 2026

Hi, @rettigl, as you are currently reviewing, (thanks for this) would you mind if we replace the mean value that is used as a representative for value of array datasets with 0.5 * (min + max). Using the mean for arrays was anyway always a crude workaround, showing just also the first value of a multidimensional array is equally questionable, dropping adding of array datasets altogether I think is a too hard step I had the impression we do not want to pursue at this point, right ?

I will work through your feedback later, currently still on #752

I think we should discuss what this might be useful for in the first place. If anything, I think a mean value is still the most universal 1D metric, even though I admittedly also never used it.

@mkuehbach
Copy link
Copy Markdown
Collaborator Author

I think we should discuss what this might be useful for in the first place. If anything, I think a mean value is still the most universal 1D metric, even though I admittedly also never used it.

Thanks for clarifying that, I thought that you had been the most frequent user of these attributes but maybe using the axes string vector for filtering in the mpes app was already sufficient for distinguishing different types of datasets for you. Dunno why that thought of you being a strong advocate for mean and std remained in my mind, anyway that makes the implementation easier.

@rettigl
Copy link
Copy Markdown
Collaborator

rettigl commented Apr 1, 2026

Dunno why that thought of you being a strong advocate for mean and std remained in my mind, anyway that makes the implementation easier.

The original motivation came from me, but mostly about the min/max/ndim fields, which I use in the mpes app.

@mkuehbach
Copy link
Copy Markdown
Collaborator Author

Dunno why that thought of you being a strong advocate for mean and std remained in my mind, anyway that makes the implementation easier.

The original motivation came from me, but mostly about the min/max/ndim fields, which I use in the mpes app.

Exactly that I recall, was unsure about mean and std though technically though setting a value is required for a quantity as min, max, ... are attributes to the quantity, so probably that's why Sandor then though well why not use the mean.

…d value, ii) recover tests for complexfloating, iii) check that ci runs fine
Copy link
Copy Markdown
Collaborator

@rettigl rettigl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not go through everything again, nor did I test it personally, but looks now okay for me, apart from some comments to tests.

Comment thread tests/nomad/test_parsing_opt.py
Comment thread tests/nomad/test_parsing_opt.py Outdated
dat = prng.random(size=n_values).astype(data_type).reshape(-1, 50, 50)
mean = np.asarray(
np.sum(dat, dtype=np.float128) / np.float128(dat.size), dtype=data_type
).item()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be moved outside the if/elif

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add tests for all statistics to verify proper function. Assessing the impact on memory consumption by automatic testing is probably not so simple.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the first point, I agree that I would be good to add.

On the second point, it is possible but involved, the tricky part is to really catch peak memory consumption, sampling it needs here. I think adding this also in this PR is disproportionate. Agree though that is a useful professionalization topic to include in production code, maybe an aspect to consider in FAIRmat II.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the first test that Laurenz suggested would be needed for this PR.

Copy link
Copy Markdown
Collaborator

@lukaspie lukaspie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine, maybe another test should be added.

Comment on lines +162 to +184
# using a chunked storage layout does not necessarily demand usage of compression
# computing stats using chunks enables an incremental updating of stats
# while excellent for reducing the memory consumption a clear disadvantage is that
# measures typically implemented under the hood of np.mean and np.std cannot be used
# out of the box and expected to yield the best possible numerical robustness
# and accuracy given that how chunks stream in and how large they can affect
# numerical precision
# an alternative in every case is using Welford's algorithm to prevent such
# catastrophic cancellation errors, this algorithm is inherently sequential though
# and thus even if vectorized (non-trivial implementation) eventual an order of
# magnitude more costly than np.mean and np.std which are vectorized

# passing the mean as the representative of an array for NOMAD Metainfo is a choice
# that is also not without debate it is questionable though whether this warrants
# to use then one of the most costly algorithms despite it being theoretically the
# most precise one

# the implementation here shows two approaches:
# i) classical Welford
# ii) vectorized implementation of the naive formula summarizing over chunks
# for computing mean and std using np.float64 in the accumulator though
# iii) using np.float128 precision would result in software emulation on the hardware
# making the computation substantially more costly than for np.float64
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are way to many inline comments. I am generally not a fan of large inline comments, mostly they can become outdated if the actual code changes. If this is something to be documented, it should go to the docs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the first test that Laurenz suggested would be needed for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants