Conversation
…istics, for chunked need to compute by hand, take care for complex values
…e welford needs incremental updates which wont end up getting vectorized, the only issue that this comes at all up is because we wish to take the mean value (and show population standard deviation), we should consider to keep it with min and max and just pick the first value, it will avoid using Welford and alike and with ought to speed up the parsing significantly, the mean value of a coordinate axis value triplet 1, 0, 0 with 0.333 and stdev is anyway of limited value, I understood @rettigl such that uses summary stats for navigation, might need discussion for a better compromise, mind that also np.mean and np.std even on cotiguous array are not without imprecisions, in edge cases eventually even weaker than Welford, I am supportive though that for something that is just a value to show in the GUI in NOMAD we should not invoke an eventually very costly algorithm, I kept both the incremental (non vectorized) as well as the numpy batch implementation to support the discussion, my preference is to remove mean and stdev altogether, if people are interested in this, they should compute it in pynxtools-plugin, when they have that dataset anyway at some point in main memory, we should not abuse the parser for these computations, as currently with contiguous storage layout they create memory consumption spikes.
|
Current practice is that non-scalar iuf h5py.Datasets in NOMAD get the mean value as the field value. This is a convention going back to a suggestion from @sanbrock on how to reduce Metainfo instances per NeXus concepts when registering NeXus h5py.Dataset instances in Metainfo. Also min, max, and population standard deviation were computed (given the formula np.mean is parameterized with ddof=0 by default). I would like to discuss if we could possibly live without mean and stdev and instead keep of course min, max, size, and ndim, and set arbitraily the first value of array in Metainfo. For complex non-scalar datasets we anyway have no statistics at the moment. The motivation is there is a tradeoff when reading efficiently from chunks: Namely that mean and stdev need to be computed incrementally and that is not trivially vectorizable in particular not if we wish to stick with Python code. I suggest that we rather offload all the summary statistics to the plugin and attach these as arrays if required. That would offer those folks who wish to have highly numerically accurate mean and std computed with in the parser ending just as a lookup, there its computation was damn costly for contiguous storage layout and is not getting cheaper when using chunked storage. I also would like to motivate everybody to rather export their non-scalar arrays using the chunked storage layout. Mind that this does not mean that you need to do any compression. Chunking is necessary for using HDF5 compression filters but alone not a sufficient criterion. Using chunking would enable us to reduce memory consumption peaks especially when we could agree to avoid computing mean and std in the parsing stage. Not urgent but thoughts would be appreciated @sherjeelshabih @rettigl @lukaspie @RubelMozumder @sanbrock |
…#761) * alternative approach, drop std and go with naive summation by default * keep std deactivated --------- Co-authored-by: mkuehbach <markus.kuehbach@physik.hu-berlin.de>
RubelMozumder
left a comment
There was a problem hiding this comment.
This PR needs atom tests to verify that all changes will reflect the previous stats. I found some bugs while reading the code. Probably there are a few left.
rettigl
left a comment
There was a problem hiding this comment.
There are a lot of changes, where I don't completely understand many of them. Certainly needs thorough testing of the different branches.
The removing of std as field statistics is okay for me, as I never used this in an app or so. But I think under the assumption of statistical independence of the different chunks, we can also cheaply provide an estimation.
|
Hi, @rettigl, as you are currently reviewing, (thanks for this) would you mind if we replace the I will work through your feedback later, currently still on #752 |
I think we should discuss what this might be useful for in the first place. If anything, I think a mean value is still the most universal 1D metric, even though I admittedly also never used it. |
Thanks for clarifying that, I thought that you had been the most frequent user of these attributes but maybe using the axes string vector for filtering in the mpes app was already sufficient for distinguishing different types of datasets for you. Dunno why that thought of you being a strong advocate for |
The original motivation came from me, but mostly about the min/max/ndim fields, which I use in the mpes app. |
Exactly that I recall, was unsure about mean and std though technically though setting a value is required for a quantity as min, max, ... are attributes to the quantity, so probably that's why Sandor then though well why not use the |
…d value, ii) recover tests for complexfloating, iii) check that ci runs fine
…ontent as it is anyway not supported in neither NeXus nor Metainfo
rettigl
left a comment
There was a problem hiding this comment.
I did not go through everything again, nor did I test it personally, but looks now okay for me, apart from some comments to tests.
| dat = prng.random(size=n_values).astype(data_type).reshape(-1, 50, 50) | ||
| mean = np.asarray( | ||
| np.sum(dat, dtype=np.float128) / np.float128(dat.size), dtype=data_type | ||
| ).item() |
There was a problem hiding this comment.
This can also be moved outside the if/elif
There was a problem hiding this comment.
I suggest to add tests for all statistics to verify proper function. Assessing the impact on memory consumption by automatic testing is probably not so simple.
There was a problem hiding this comment.
On the first point, I agree that I would be good to add.
On the second point, it is possible but involved, the tricky part is to really catch peak memory consumption, sampling it needs here. I think adding this also in this PR is disproportionate. Agree though that is a useful professionalization topic to include in production code, maybe an aspect to consider in FAIRmat II.
There was a problem hiding this comment.
I think having the first test that Laurenz suggested would be needed for this PR.
lukaspie
left a comment
There was a problem hiding this comment.
I think it's fine, maybe another test should be added.
| # using a chunked storage layout does not necessarily demand usage of compression | ||
| # computing stats using chunks enables an incremental updating of stats | ||
| # while excellent for reducing the memory consumption a clear disadvantage is that | ||
| # measures typically implemented under the hood of np.mean and np.std cannot be used | ||
| # out of the box and expected to yield the best possible numerical robustness | ||
| # and accuracy given that how chunks stream in and how large they can affect | ||
| # numerical precision | ||
| # an alternative in every case is using Welford's algorithm to prevent such | ||
| # catastrophic cancellation errors, this algorithm is inherently sequential though | ||
| # and thus even if vectorized (non-trivial implementation) eventual an order of | ||
| # magnitude more costly than np.mean and np.std which are vectorized | ||
|
|
||
| # passing the mean as the representative of an array for NOMAD Metainfo is a choice | ||
| # that is also not without debate it is questionable though whether this warrants | ||
| # to use then one of the most costly algorithms despite it being theoretically the | ||
| # most precise one | ||
|
|
||
| # the implementation here shows two approaches: | ||
| # i) classical Welford | ||
| # ii) vectorized implementation of the naive formula summarizing over chunks | ||
| # for computing mean and std using np.float64 in the accumulator though | ||
| # iii) using np.float128 precision would result in software emulation on the hardware | ||
| # making the computation substantially more costly than for np.float64 |
There was a problem hiding this comment.
These are way to many inline comments. I am generally not a fan of large inline comments, mostly they can become outdated if the actual code changes. If this is something to be documented, it should go to the docs.
There was a problem hiding this comment.
I think having the first test that Laurenz suggested would be needed for this PR.
Planned for
v0.14.0Addressing one aspect to issue #737:
Key changes for non-scalar iuf HDF5 datasets (integer, unsigned, float):
Loading chunk by chunk then datasets completely, thus completely flattening memory spikes
using chunked storage layout (irrespective with compression or not (both supported now)), memory flat, only a few MB kept while iterating over chunks.
nomad/parser.pyreport when facing datatypes with precision that NOMAD does currently not support #770Currently most datasets by pynxtools-plugins are written using the simple contiguous storage layout. Upon parsing these will be loaded fully into main memory, unpacked via hdf_node[...] at once if of dtype kind iufc.
If chunked storage layout is used, iterating over chunks respectively hyperslabs is not taken advantage of.
Consequently, the old implementation unnecessarily loads entire datasets into main memory instead of processing these off chunk-by-chunk. For large datasets, e.g. image and spectra stacks the impact is significant, for laptops eventually deal breaking: keep a rather flat few MiB per chunk flat RAM usage profile rather than provoke GiB spikes that may exceed even system max RAM. These spikes are particularly nasty if hosts are shared by multiple users.
Pitfalls: np.mean has optimized numerics (not only for speed but also compensating precision), chunking may need https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance