Memory optimization NOMAD nexus parser by mkuehbach · Pull Request #750 · FAIRmat-NFDI/pynxtools

mkuehbach · 2026-03-17T20:38:11Z

Planned for v0.14.0

Addressing one aspect to issue #737:

Key changes for non-scalar iuf HDF5 datasets (integer, unsigned, float):

Currently most datasets by pynxtools-plugins are written using the simple contiguous storage layout. Upon parsing these will be loaded fully into main memory, unpacked via hdf_node[...] at once if of dtype kind iufc.
If chunked storage layout is used, iterating over chunks respectively hyperslabs is not taken advantage of.
Consequently, the old implementation unnecessarily loads entire datasets into main memory instead of processing these off chunk-by-chunk. For large datasets, e.g. image and spectra stacks the impact is significant, for laptops eventually deal breaking: keep a rather flat few MiB per chunk flat RAM usage profile rather than provoke GiB spikes that may exceed even system max RAM. These spikes are particularly nasty if hosts are shared by multiple users.

Pitfalls: np.mean has optimized numerics (not only for speed but also compensating precision), chunking may need https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

Implementation
Testing individual parts, specifically Welford part
Testing in production

…istics, for chunked need to compute by hand, take care for complex values

@rettigl

…e welford needs incremental updates which wont end up getting vectorized, the only issue that this comes at all up is because we wish to take the mean value (and show population standard deviation), we should consider to keep it with min and max and just pick the first value, it will avoid using Welford and alike and with ought to speed up the parsing significantly, the mean value of a coordinate axis value triplet 1, 0, 0 with 0.333 and stdev is anyway of limited value, I understood @rettigl such that uses summary stats for navigation, might need discussion for a better compromise, mind that also np.mean and np.std even on cotiguous array are not without imprecisions, in edge cases eventually even weaker than Welford, I am supportive though that for something that is just a value to show in the GUI in NOMAD we should not invoke an eventually very costly algorithm, I kept both the incremental (non vectorized) as well as the numpy batch implementation to support the discussion, my preference is to remove mean and stdev altogether, if people are interested in this, they should compute it in pynxtools-plugin, when they have that dataset anyway at some point in main memory, we should not abuse the parser for these computations, as currently with contiguous storage layout they create memory consumption spikes.

mkuehbach · 2026-03-19T16:51:48Z

Current practice is that non-scalar iuf h5py.Datasets in NOMAD get the mean value as the field value. This is a convention going back to a suggestion from @sanbrock on how to reduce Metainfo instances per NeXus concepts when registering NeXus h5py.Dataset instances in Metainfo.

Also min, max, and population standard deviation were computed (given the formula np.mean is parameterized with ddof=0 by default). I would like to discuss if we could possibly live without mean and stdev and instead keep of course min, max, size, and ndim, and set arbitraily the first value of array in Metainfo.

For complex non-scalar datasets we anyway have no statistics at the moment.

The motivation is there is a tradeoff when reading efficiently from chunks: Namely that mean and stdev need to be computed incrementally and that is not trivially vectorizable in particular not if we wish to stick with Python code.

I suggest that we rather offload all the summary statistics to the plugin and attach these as arrays if required.

That would offer those folks who wish to have highly numerically accurate mean and std computed with in the parser ending just as a lookup, there its computation was damn costly for contiguous storage layout and is not getting cheaper when using chunked storage.

I also would like to motivate everybody to rather export their non-scalar arrays using the chunked storage layout.

Mind that this does not mean that you need to do any compression. Chunking is necessary for using HDF5 compression filters but alone not a sufficient criterion.

Using chunking would enable us to reduce memory consumption peaks especially when we could agree to avoid computing mean and std in the parsing stage.

Not urgent but thoughts would be appreciated @sherjeelshabih @rettigl @lukaspie @RubelMozumder @sanbrock

…#761) * alternative approach, drop std and go with naive summation by default * keep std deactivated --------- Co-authored-by: mkuehbach <markus.kuehbach@physik.hu-berlin.de>

RubelMozumder

This PR needs atom tests to verify that all changes will reflect the previous stats. I found some bugs while reading the code. Probably there are a few left.

rettigl

There are a lot of changes, where I don't completely understand many of them. Certainly needs thorough testing of the different branches.
The removing of std as field statistics is okay for me, as I never used this in an app or so. But I think under the assumption of statistical independence of the different chunks, we can also cheaply provide an estimation.

mkuehbach · 2026-04-01T11:11:44Z

Hi, @rettigl, as you are currently reviewing, (thanks for this) would you mind if we replace the mean value that is used as a representative for value of array datasets with 0.5 * (min + max). Using the mean for arrays was anyway always a crude workaround, showing just also the first value of a multidimensional array is equally questionable, dropping adding of array datasets altogether I think is a too hard step I had the impression we do not want to pursue at this point, right ?

I will work through your feedback later, currently still on #752

rettigl · 2026-04-01T11:27:30Z

Hi, @rettigl, as you are currently reviewing, (thanks for this) would you mind if we replace the mean value that is used as a representative for value of array datasets with 0.5 * (min + max). Using the mean for arrays was anyway always a crude workaround, showing just also the first value of a multidimensional array is equally questionable, dropping adding of array datasets altogether I think is a too hard step I had the impression we do not want to pursue at this point, right ?

I will work through your feedback later, currently still on #752

I think we should discuss what this might be useful for in the first place. If anything, I think a mean value is still the most universal 1D metric, even though I admittedly also never used it.

mkuehbach · 2026-04-01T11:37:25Z

I think we should discuss what this might be useful for in the first place. If anything, I think a mean value is still the most universal 1D metric, even though I admittedly also never used it.

Thanks for clarifying that, I thought that you had been the most frequent user of these attributes but maybe using the axes string vector for filtering in the mpes app was already sufficient for distinguishing different types of datasets for you. Dunno why that thought of you being a strong advocate for mean and std remained in my mind, anyway that makes the implementation easier.

rettigl · 2026-04-01T11:54:01Z

Dunno why that thought of you being a strong advocate for mean and std remained in my mind, anyway that makes the implementation easier.

The original motivation came from me, but mostly about the min/max/ndim fields, which I use in the mpes app.

mkuehbach · 2026-04-01T11:57:39Z

Dunno why that thought of you being a strong advocate for mean and std remained in my mind, anyway that makes the implementation easier.

The original motivation came from me, but mostly about the min/max/ndim fields, which I use in the mpes app.

Exactly that I recall, was unsure about mean and std though technically though setting a value is required for a quantity as min, max, ... are attributes to the quantity, so probably that's why Sandor then though well why not use the mean.

…d value, ii) recover tests for complexfloating, iii) check that ci runs fine

…ontent as it is anyway not supported in neither NeXus nor Metainfo

rettigl

I did not go through everything again, nor did I test it personally, but looks now okay for me, apart from some comments to tests.

rettigl · 2026-04-07T10:03:54Z

+            dat = prng.random(size=n_values).astype(data_type).reshape(-1, 50, 50)
+            mean = np.asarray(
+                np.sum(dat, dtype=np.float128) / np.float128(dat.size), dtype=data_type
+            ).item()


This can also be moved outside the if/elif

rettigl · 2026-04-07T10:07:40Z

I suggest to add tests for all statistics to verify proper function. Assessing the impact on memory consumption by automatic testing is probably not so simple.

On the first point, I agree that I would be good to add.

On the second point, it is possible but involved, the tricky part is to really catch peak memory consumption, sampling it needs here. I think adding this also in this PR is disproportionate. Agree though that is a useful professionalization topic to include in production code, maybe an aspect to consider in FAIRmat II.

I think having the first test that Laurenz suggested would be needed for this PR.

lukaspie

I think it's fine, maybe another test should be added.

lukaspie · 2026-04-17T11:47:27Z

+    # using a chunked storage layout does not necessarily demand usage of compression
+    # computing stats using chunks enables an incremental updating of stats
+    # while excellent for reducing the memory consumption a clear disadvantage is that
+    # measures typically implemented under the hood of np.mean and np.std cannot be used
+    # out of the box and expected to yield the best possible numerical robustness
+    # and accuracy given that how chunks stream in and how large they can affect
+    # numerical precision
+    # an alternative in every case is using Welford's algorithm to prevent such
+    # catastrophic cancellation errors, this algorithm is inherently sequential though
+    # and thus even if vectorized (non-trivial implementation) eventual an order of
+    # magnitude more costly than np.mean and np.std which are vectorized
+
+    # passing the mean as the representative of an array for NOMAD Metainfo is a choice
+    # that is also not without debate it is questionable though whether this warrants
+    # to use then one of the most costly algorithms despite it being theoretically the
+    # most precise one
+
+    # the implementation here shows two approaches:
+    # i) classical Welford
+    # ii) vectorized implementation of the naive formula summarizing over chunks
+    # for computing mean and std using np.float64 in the accumulator though
+    # iii) using np.float128 precision would result in software emulation on the hardware
+    # making the computation substantially more costly than for np.float64


These are way to many inline comments. I am generally not a fan of large inline comments, mostly they can become outdated if the actual code changes. If this is something to be documented, it should go to the docs.

lukaspie · 2026-04-17T11:50:06Z

I think having the first test that Laurenz suggested would be needed for this PR.

incomplete skeleton, untested, unfinished, next steps, implement stat…

3b7bdeb

…istics, for chunked need to compute by hand, take care for complex values

mkuehbach changed the title ~~Memory optimization parsing~~ Memory optimization NOMAD nexus parser Mar 17, 2026

atomprobe-tc added 7 commits March 18, 2026 20:53

Welford's algorithm skeleton

d94d471

skeleton for refactoring field_statistics data structure

3c591d1

fix bug in _get_value

2bf1ff5

first test version

3ad6d0c

typing

634e6bb

bugfixing

694b272

atomprobe-tc added 2 commits March 25, 2026 14:30

Merge branch 'master' into mem_optimization_parsing

9e0bed2

fix assignment bug

10b3c58

mkuehbach marked this pull request as ready for review March 25, 2026 15:47

mkuehbach and others added 2 commits March 26, 2026 10:11

alternative approach, drop std and go with naive summation by default (…

115c17a

…#761) * alternative approach, drop std and go with naive summation by default * keep std deactivated --------- Co-authored-by: mkuehbach <markus.kuehbach@physik.hu-berlin.de>

Merge branch 'master' into mem_optimization_parsing

b3d4c27

mkuehbach requested review from lukaspie, rettigl and sherjeelshabih March 26, 2026 09:13

RubelMozumder requested changes Mar 30, 2026

View reviewed changes

rettigl reviewed Apr 1, 2026

View reviewed changes

mkuehbach mentioned this pull request Apr 4, 2026

Tame the eventual memory hungryness of the dataconverter and NOMAD parser for large datasets #737

Open

4 tasks

adding skeleton for test

05ff627

mkuehbach mentioned this pull request Apr 6, 2026

nomad/parser.py report when facing datatypes with precision that NOMAD does currently not support #770

Open

atomprobe-tc added 2 commits April 6, 2026 13:05

addressing reviewer comments from rumo

1e04876

adding correct test data, next steps: i) add assertions against parse…

0419b78

…d value, ii) recover tests for complexfloating, iii) check that ci runs fine

atomprobe-tc added 8 commits April 6, 2026 14:44

move parsing test

8be4b91

temporary, while implementing asserts

484bab4

fix data_type tests

2ecb01e

rm welford ipynb addressing point from the reviewers

648bcdc

remove dev flatdict dependency

58cf12a

deactivated unsafe recursive parsing of arbitrary non-string object c…

ae97991

…ontent as it is anyway not supported in neither NeXus nor Metainfo

feature branch complete

a21bc07

Merge branch 'master' into mem_optimization_parsing

5c933a0

mkuehbach requested review from RubelMozumder and rettigl April 6, 2026 22:51

rettigl approved these changes Apr 7, 2026

View reviewed changes

lukaspie approved these changes Apr 17, 2026

View reviewed changes

mkuehbach mentioned this pull request May 8, 2026

Optimization related to compression, allowing multithreaded blosc2 to be used as a compression library #747

Open

3 tasks

atomprobe-tc added 3 commits May 8, 2026 18:25

Merge branch 'master' into mem_optimization_parsing

3c83e5f

fuse uint and int cases as per suggestion from rettigl

e0ef340

commenting on why to have np.issubdtype check separated for int and uint

b4dfd2d

Conversation

mkuehbach commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkuehbach commented Mar 19, 2026

Uh oh!

RubelMozumder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rettigl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkuehbach commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rettigl commented Apr 1, 2026

Uh oh!

mkuehbach commented Apr 1, 2026

Uh oh!

rettigl commented Apr 1, 2026

Uh oh!

mkuehbach commented Apr 1, 2026

Uh oh!

rettigl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rettigl Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

mkuehbach Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

rettigl Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

mkuehbach Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

lukaspie Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

lukaspie left a comment

Choose a reason for hiding this comment

Uh oh!

lukaspie Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

lukaspie Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mkuehbach commented Mar 17, 2026 •

edited

Loading

mkuehbach commented Apr 1, 2026 •

edited

Loading