Memory optimization PYNXTOOLS validate by mkuehbach · Pull Request #752 · FAIRmat-NFDI/pynxtools

mkuehbach · 2026-03-19T20:48:30Z

Planned for v0.14.0

The other aspect of #737:

For HDF5 file base validate we wish to reduce the number of cases when large h5py.Dataset payload needs to be loaded into main memory, especially when using chunked storage layout that would allow iterating over chunks.

Motivation was usage of the standalone HDF validator on a 32GiB file which has essentially only one dataset (a large image stack uncompressed approx 32GiB payload) from CENEM on a 128GiB RAM workstation. Despite that image stack is stored chunked, all data gets loaded somewhen and eventually generating multiple copies in the process into main memory, causing the validation process to terminated after trying to allocate past all the memory available on the workstation.

Adds vscode launch configuration to debug the validation
Applies two key optimization steps i) chunk-by-chunk reading while validating datasets and enums and ii) a reorganization of the execution flow whilst validating enums to remove unnecessary or inefficient loading and unpacking entire dataset into main memory.
Drafts a further optimization for streamed reading of the positivity test in cases where reading a dataset in full is still required as in the case of contiguous storage layout, fact that contiguous storage still can take advantage of hyperslab-based reading allows streaming to approximately I/O chunks, a simple heuristic was defined to set the block_size, which could profit from further tuning for specific type of hardware and I/O patterns.
Evaluates cases of possible situations where the assumption that there exists always only one concept with highest score or none breaks. Now a score_board is used, the old implementation is carried along, and the user is warned if best_namefit fits multiple concepts with the same highest score. Inspecting the matter revealed an imprecision in the constraining of NXem_* base classes that was fixed with a separate feature branch of nexus_definitions, here pulled in for testing purposes and to show the full picture. Be careful that unfixed was observed to lead to different execution paths followed for the same file and code when uncorrected (inspected via Hunt nondeterminism #768).
Tightens Iterator to Sequence in validate
Fixing tests
New tests for evaluating NX_POSINT and other elementary NeXus datatypes for contiguous, chunked_uncompressed, and chunked_compressed storage

Keys benefits:

Validation substantially faster as unnecessary I/O time and time for prepping payload as np.array removed
Validation with substantially smaller and smoother memory footprint for arbitrarily large HDF5 files provided chunked based storage layout is used.

…d5cd3f81ba178f056698713bf223dd3e1afe

…ersion of is_valid_enum which defers the loading of data to occurr only when required, most h5py.Datasets do not qualify as an enum. This change could also in future version of the template-base validation be added. That is not expected to be so performance improving as in the case of template based validation prior to NeXus file writing all data are already in main memory. For HDF5 though they need to loaded first from disk, so far the clean_str_attr(dataset[()]) did that ALWAYS and for all datasets irrespective their size, just to discard most of these data with the next two lines when the code figures the dataset is not an instance of an enum concept

… resolving issues for em

…initial round of tests with float32 there validation is now blazing fast for all three cases (contiguous, chunked uncompressed, chunked compressed), the substantial speedup is not a signature of failure or bypassing certain function calls but expected, in the contiguous float32 test case the concept entry1/measurement/event1/image1/real is float32 yes its 10GB uncompressed payload. But keep in mind that h5py and HDF5 store datasets of elementary datatypes always as type-homogeneous arrays, that means a test if the field is valid reduces to a simple test of the dtype arribute against accepted ones, no need to load the data at all, unless we have NX_POSINT as datatype. Similar expected speedup signified alsso compared to v0.13.2 is valid enum, in that version the data were always fully loaded first just to become, in most cases, immediately discarded inside the function call as the node did not encode an enum type concept, this validation feature branch and commit here tested though check FIRST if the node is an enum, thus we can completely skip the loading, thus bringing essentially the speedup. We expect that we should test again with a concept of type NX_POSINT and our three storing schemes contiguous, chunked compressed and uncompressed, here we would expect to get a costly load of all data when contiguous and a chunk-by-chunk positivity integer test during the is valid field hdf evaluation, but the base line should remain flat. Next-up, i) implement this test with posint and run again, ii) fix pytest from pynxtools, iii) move all testing code to proper tests, mark this feature branch as ready for review, i.e. move from draft to ready

…sed in the code to work on datasets and non-string payload in which case type annotations were imprecise

mkuehbach · 2026-04-04T01:18:01Z

+      "cwd": "${workspaceFolder}/..",
+      "justMyCode": false,
+      "args": [
+        "chunked_uncompressed.nxs",


reduce to one example

rettigl

I appreciate the effort, but for me this PR includes too many changes to properly review. The connection of many of these changes to the issue at hand (large memory consumption) is not clear to me. I suggest to split it up into several PRs.
Also, it changes the behavior of the validator by removing the acceptance of ints as floats, which we explicitly decided to allow at some point, as far as I remember.

rettigl · 2026-04-07T10:13:45Z

+        isinstance(v.decode() if isinstance(v, bytes) else v, tuple(accepted_types))
+        for v in np.asarray(hdf_node[...]).flat
+    )
+


Not sure if this will work for any type of object data.

We should not support parsing of arbitrary object data, to be more precise any type of object data i.e. also C struct byte blobs.

I support that we allow string data even if these use the already involved a dozen formatting combinations, h5py reports as variable null-terminated utf8 but also other combinations are valid strings like fixed length space terminated ascii, agree that there should be examples for it, but if that PR is anyway too perceived as too large than that is for sure another PR.

rettigl · 2026-04-07T10:16:01Z

+                else:
+                    return False
+            """
+    return True


Would move this into the first if clause for better readability

rettigl · 2026-04-07T10:55:20Z

+def is_valid_data_field_hdf(hdf_node: h5py.Dataset, nxdl_type: str, path: str):
+    """Checks whether value of hdf_node is valid according to the type defined in the NXDL."""


Why is this needed now, was not there before, and how is it related to the memory optimization?

It is true that the PR started from memory optimization but adding a quick fix only focused on this would have for sure faced backslash as too sloppy. Therefore, a thorough revision of the function calls and reorganization.
Before we had is_valid_data_field which makes immediately calls for fully unpacking all data in memory,

The validation implementation so far, rightly so reused most functions that are used also in the validate_dict, all functions for which memory optimization was never an issue.

I agree there is some functional duplication in the is_valid_data_field and is_valid_enum and their *_hdf variants, maybe making value respective hdf_node a parameter input of several types and then internal type distinction would is an alternative.

rettigl · 2026-04-07T10:56:45Z

                )


+def is_valid_enum_hdf(


Same as above, how is this connected to the mem optimization?

Same comment as above, its a matter what of what we want, large complex functions that deal with a multitude of use cases or specialized functions that still keep the amount of code duplication as small as possible, luckily we are using Python, so no need for data-type-specific template programming

rettigl · 2026-04-07T10:57:55Z


    with File(file, "r") as h5file:
-        for entry, nxdl in def_map.items():
+        for entry, nxdl in sorted(def_map.items()):


Why this change? Is it related to the issue?

It is related to issues I observed with non-deterministic behavior so I went hard through the code to inspect each case where implicit assumptions about a certain order is just working automagically is unsubstantiated, even for HDF5 the sequence how attributes are iterated over can be version dependent given we have the low and high level backend (hdf5, h5py) involved, so this here and changes in some other places we should inspect hard if order is really always correct and the way we wish to. It might be the sorted call here is redundant then I take this more as a note for us to check.

Fact is the old implementation indeed is vulnerable to order effects if best namefit finds more than one concept with the same highest score. So something which started as a pragmatic PR for memory optimization, indeed demanded a closer inspection.

rettigl · 2026-04-07T11:06:23Z

        pytest.param(
            ["src/pynxtools/data/201805_WSe2_arpes.nxs"],
            [
+                "The value at /entry/collection_time should be one of the following Python types: (<class 'float'>, <class 'numpy.floating'>), as defined in the NXDL as NX_FLOAT.",


See above, should be auto converted

rettigl · 2026-04-07T11:06:33Z

                "Field /entry/instrument/analyser/projection has no documentation.",
                "Field /entry/instrument/analyser/sensor_count has no documentation.",
                "Field /entry/instrument/analyser/working_distance has no documentation.",
+                "The value at /entry/instrument/beam_probe_0/distance should be one of the following Python types: (<class 'float'>, <class 'numpy.floating'>), as defined in the NXDL as NX_FLOAT.",


rettigl · 2026-04-07T11:07:43Z

+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+"""Test cases for the Writer class used by the DataConverter"""


What is being tested here for?

All code in test_validation_opt we can move eventually to test_validation

rettigl · 2026-04-07T11:08:30Z

+def test_validate_file_storage_layouts(
+    storage_layout, tmp_path, caplog, expected_warnings
+):
+    """Test validation of a NeXus/HDF5 with the same content but different storage layout."""


Why not integrate this in the general validation tests?

Agree, should go to test_validation.py, but probably needs to stay its own function

rettigl · 2026-04-07T11:09:28Z

    with caplog.at_level(logging.INFO):
        observed_infos = [
-            r.getMessage() for r in caplog.records if r.levelno == logging.INFO
+            rec.getMessage() for rec in caplog.records if rec.levelno == logging.INFO


Why this change?

We use inconsistently r and rec throughout our code base, yes that could go into an own PR, or just be fixed here as we go.

mkuehbach · 2026-04-07T11:53:31Z

I appreciate the effort, but for me this PR includes too many changes to properly review. The connection of many of these changes to the issue at hand (large memory consumption) is not clear to me. I suggest to split it up into several PRs. Also, it changes the behavior of the validator by removing the acceptance of ints as floats, which we explicitly decided to allow at some point, as far as I remember.

Thank you @rettigl

Wrt to the last point, there is a good reason why in NeXus we distinguish NX_NUMBER, NX_(U,POS)INT, and NX_FLOAT. I do not agree that this decision in the past was a good one. IMHO a validator should check a report where there are inconsistencies it should not for convenience purposes do conversion of types. This should be the pynxtools-readers responsibility to feed to pynxtools straightaway. I can see if people may be fine with just using the default precision of data types, like python inbuild float, int but I disagree with cross-accepting what are different elementary data types. Not only for the fact that currently our implementation then would also need to check if really every precision of an int is exactly mappable on a floating with certain precision.

lukaspie

As it stands, I would not merge this PR.

I agree that we need separate function for validity checks (HDF5 vs, template), since we don't always want to load full HDF5 datasets.
Probably, we should split up the HDF5 and template validation in separate modules eventually. helpers.py is already rather overloaded. But let's do this in a separate PR.
Regarding namefitting, I left a longer comment. I don't agree with the idea that sorting by optionality is really better than what we had before.
Changes to the definitions do not belong here.
vscode settings are too specific: as long as the NeXus files that are mentioned there are not part of the repo, we cannot use the launch.json. I am not suggesting to add these files, rather the settings should be kept simpler.
Regarding "validation, not conversion" -> on a technical design level, I would agree with this. However, my rationale for allowing simple conversion (or other workarounds, like adding the target attribute automatically) was always to allow for easier reader development. Maybe one compromise could be that we allow such conversions in the the template validation, whereas for the HDF5 validation (where also non-pynxtools-generated files can be used), we are more strict. Curious about your opinion @rettigl.

lukaspie · 2026-04-17T10:25:48Z

lukaspie · 2026-04-17T11:08:06Z

+                else:
+                    return False
+            """
+    return True


lukaspie · 2026-04-17T11:13:47Z

@@ -188,15 +194,15 @@ def validate_hdf_group_against(

    def best_namefit_of(


Without going into the specific implementation of the function here (which could certainly be improved), why do you introduce a tie-breaking policy based on optionality? This is not how we handle namefitting elsewhere in the package (read_nexus, NOMAD parsers).

I don't think this case (i.e., an instance name that could feasibly fit two concepts with the same score) should come up often in the first place (as instance names often partially match the concept names, think entry1 for an unnamed NXentry). But okay, that's not the best argument, as indeed there may be cases where an instance name could have the same score for two distinct concepts.

Now the question is what to do in this (again, fairly rare) case. My argument was always that in that case, it is fully arbitrary. We cannot match to either of the best-fitting in a deterministic way. So the simplest solution is of course to use the first best match. I don't think the NeXus standard says that required fields are somehow more important in name resolution then recommended/optional ones.

In summary, I suggest we revert these changes, unless we can come up with a concrete example where this solution enables better validation.

The one alternative that would be okay for me is to keep collecting all scores and if there are two concepts with the same score, we log a warning, but still take the first one.

lukaspie · 2026-04-17T11:18:09Z

+        old_best_match = None
+        old_best_score = -1
+
+        # use score as key for the outer dict and inner dict as value with


Again, I get why you did this now (limiting review capabilities at the moment, need for a working solution), but this has happened now quite a bit, also in other PRs. We should really strive for single-feature PRs (in this case, memory optimization), of which we can have of course as many as needed, rather than PR scopes that grow over time.

lukaspie · 2026-04-17T11:19:25Z

            full_path,
        )
-        is_valid_enum(
-            clean_str_attr(dataset[()]),


I think it's reasonably to not load them, indeed then we need separate functions to check the validity of a given dataset in HDF5 or in the template.

lukaspie · 2026-04-17T11:19:41Z

                "/ENTRY[my_entry]/NXODD_name[nxodd_name]/float_value",
                0,
            ),
-            [],


lukaspie · 2026-04-17T11:20:58Z

+def test_validate_file_storage_layouts(
+    storage_layout, tmp_path, caplog, expected_warnings
+):
+    """Test validation of a NeXus/HDF5 with the same content but different storage layout."""


Agree, should go to test_validation.py, but probably needs to stay its own function

lukaspie · 2026-04-17T11:21:30Z

@@ -1,6 +1,6 @@
 repos:


This was a separate PR, should go away if you rebase

lukaspie · 2026-04-17T11:25:17Z

@@ -0,0 +1,24 @@
+{


We cannot have this in the code as it stands. Since none of these files are available to the normal developer, it is unusable at the moment.

lukaspie · 2026-04-17T11:28:47Z


+def is_valid_data_field_hdf(hdf_node: h5py.Dataset, nxdl_type: str, path: str):
+    """Checks whether value of hdf_node is valid according to the type defined in the NXDL."""
+    # validating i.e. reading only not converting !


I don't agree with this. The validation already accepts/adjusts smaller issues, especially with the template (namely int-to-float conversion, adding a missing target attribute, etc). I see no harm in having the validator be more permissive. The goal is to create a correct template in the end and making the developer's life easier, not enforcing things that we can easily fix.

For the example of automated int-to-float conversion, I want to see the concrete use case where this is breaking anything,

wip

1d050c1

mkuehbach changed the title ~~mem_optimization_validation~~ Memory optimization PYNXTOOLS validate Mar 20, 2026

atomprobe-tc added 4 commits March 26, 2026 14:33

Merge branch 'master' into mem_optimization_validation

47f3dc0

fixed tests

d196504

update default_en.txt as changes occurred on the nomad-FAIR side 3672…

997a9aa

…d5cd3f81ba178f056698713bf223dd3e1afe

Merge branch 'master' into mem_optimization_validation

62732da

This was referenced Mar 30, 2026

testing why premature exiting #763

Closed

Hunt nondeterminism #768

Closed

atomprobe-tc added 2 commits March 30, 2026 13:18

remove

8e648da

mkuehbach mentioned this pull request Apr 1, 2026

Memory optimization NOMAD nexus parser #750

Open

14 tasks

atomprobe-tc added 7 commits April 1, 2026 16:55

update definitions to align with those used in #768 that fix nametype…

69684ab

… resolving issues for em

bump ruff for precommit to align with #769

9437925

cherry picking of changes from #768

d300fad

address point ii)

3ceb050

remove debug comments

9ada503

refactor clean_str_attr to reflect better that it is also sometimes u…

b7cce06

…sed in the code to work on datasets and non-string payload in which case type annotations were imprecise

mkuehbach marked this pull request as ready for review April 1, 2026 19:00

mkuehbach requested review from sanbrock and removed request for sanbrock April 1, 2026 19:20

atomprobe-tc added 6 commits April 3, 2026 23:26

adding skeleton for testing

1a38bce

skeleton for positivity test

8646f21

testing

ae3e755

fixing storage_layout test

408f2b2

Test for validation completed

149fd6b

Test for validation completed

fa60ab6

mkuehbach commented Apr 4, 2026

View reviewed changes

mkuehbach requested review from lukaspie and rettigl April 4, 2026 01:20

mkuehbach requested a review from sherjeelshabih April 4, 2026 01:20

mkuehbach mentioned this pull request Apr 4, 2026

Tame the eventual memory hungryness of the dataconverter and NOMAD parser for large datasets #737

Open

4 tasks

rettigl requested changes Apr 7, 2026

View reviewed changes

lukaspie requested changes Apr 17, 2026

View reviewed changes

mkuehbach mentioned this pull request May 8, 2026

Optimization related to compression, allowing multithreaded blosc2 to be used as a compression library #747

Open

3 tasks

		def is_valid_data_field_hdf(hdf_node: h5py.Dataset, nxdl_type: str, path: str):
		"""Checks whether value of hdf_node is valid according to the type defined in the NXDL."""

		@@ -188,15 +194,15 @@ def validate_hdf_group_against(

		def best_namefit_of(

Conversation

mkuehbach commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rettigl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkuehbach Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkuehbach commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukaspie left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

mkuehbach commented Mar 19, 2026 •

edited

Loading

mkuehbach Apr 7, 2026 •

edited

Loading

mkuehbach commented Apr 7, 2026 •

edited

Loading