Draft
Conversation
added 20 commits
May 8, 2026 12:51
Phase 1-3 of the Unified NeXus File Handler refactor: - Move NexusNode tree to nexus/nexus_tree.py; dataconverter/nexus_tree.py becomes a re-export shim for backward compatibility - Add nexus/annotation.py: Annotator class with NexusTree-based schema resolution replacing get_inherited_hdf_nodes / get_nxdl_doc chains; structured inheritance display with per-element concept paths (NXinstrument::/energy, NXarpes::/NXentry/NXinstrument/analyser/lens_mode) - Add nexus/handler.py: NexusFileHandler + NexusVisitor protocol stubs (foundation for Phase 4 full handler) - Refactor dataconverter/validation.py: FileValidator class with instance-level node cache replaces visititems closures; validate_hdf_group_against is a 1-line wrapper - Add NexusNode helpers: concept_path, get_inheritance_docs, get_inheritance_enums, get_inheritance_concept_paths, has_nxcollection_parent, best_child_for, find_node_at_path - Add _xml_path_in_nxdl module helper for precise within-NXDL element paths - nomad/parser.py: NomadParser thin wrapper with TODO skeleton for future NomadVisitor migration - 329 tests passing
NomadParser.process() now delegates directly to NexusFileHandler + Annotator(parser=callback) instead of going through HandleNexus, which was already a trivial wrapper doing the same thing. This eliminates the extra indirection while keeping the legacy _nexus_populate callback interface intact. The nxdl_path=list[ET.Element] coupling in _nexus_populate and _to_section/_populate_data remains as a future TODO once the NOMAD metainfo navigation is decoupled from XML elements.
…call sites - nexus.py main(): use NexusFileHandler + Annotator directly - testing/nexus_conversion.py get_log_file(): same - tests/nexus/test_nexus.py: migrate all HandleNexus calls to the new API - HandleNexus.__init__: emit DeprecationWarning so external callers are notified - NomadParser docstring: update to reference NexusFileHandler HandleNexus is kept as a deprecated shim for external backward compatibility. All first-party call sites now use NexusFileHandler + Annotator directly.
…to nexus.nexus_tree Update eln_mapper, validation, and convert to import from the canonical nexus.nexus_tree location. dataconverter/nexus_tree.py remains as a backward-compatibility re-export shim for external packages.
…s.nxdata Move pure utility functions out of the 800-line nexus.py into proper sub-modules with minimal dependencies: - nexus/utils.py: decode_if_string (decodes HDF5 bytes/arrays to str) - nexus/nxdata.py: chk_nxdata_axis*, logger_auxiliary_signal, print_default_plottable_header, entry_helper, nxdata_helper, signal_helper, find_attrib_axis_actual_dim_num, get_single_or_multiple_axes, axis_helper nexus.py re-exports these via imports for backward compatibility. annotation.py now imports decode_if_string, chk_nxdata_axis, and get_default_plottable at module level (no more deferred local imports for these three symbols). Tests updated accordingly. Also fixes a pre-existing bug in chk_nxdata_axis_v2: `for i in len(axes)` was corrected to `for i in range(len(axes))`.
…ect suppression Two bugs fixed: 1. _find_nexus_node now passes node_type="field" for h5py.Dataset nodes and node_type="group" for h5py.Group nodes. Previously no node_type was passed, causing fields like 'amplifier_type' to incorrectly match variadic group nodes (e.g. TRANSFORMATIONS with name_any=True, score=2) instead of returning <NOT IN SCHEMA>. 2. get_inheritance_concept_paths() previously suppressed ALL NXobject entries. Now only the bare NXobject root is suppressed; named entries within NXobject (e.g. NXobject::/identifierNAME) are included since they carry meaningful provenance and documentation.
…/NXentry/title' Drop the '::' separator in concept_path and get_inheritance_concept_paths(). Since the within-NXDL path already starts with '/', the file stem is joined directly: 'NXarpes' + '/NXentry/title' = 'NXarpes/NXentry/title'. Bare base-class roots with no path (e.g. 'NXentry') are unchanged. Updated docstrings and inline examples consistently.
- Add module-level `_logger = logging.getLogger("pynxtools")` to nexus.py and
nxdata.py; remove `logger` parameter from all internal functions
(`get_nxdl_doc`, `get_nxdl_attr_doc`, `check_deprecation_enum_axis`,
`process_node`, `get_default_plottable`, all nxdata helpers)
- Delete dead code: `hdf_node_to_self_concept_path` (zero callers)
- Make `nexus_file` CLI argument optional (bare `read_nexus` uses built-in sample)
- Fix `Annotator` call in `main()`: use `documentation=` / `concept=` kwargs
- Update tests to use renamed `documentation=` / `concept=` kwargs
nexus_tree.py: - find_node_at_path: force node_type="group" for all intermediate path segments to prevent schema fields from being selected as parent nodes (which would make all children appear as NOT IN SCHEMA) annotation.py: - _appdef_for: strip() the definition string so trailing whitespace/null bytes from h5py don't silently break tree generation nexus.py: - Rename truly-internal helpers with _ prefix: _check_deprecation_enum_axis, _get_nxdl_attr_doc, _helper_get_inherited_nodes, _get_hdf_path, _get_inherited_hdf_nodes, _process_node - Add get_inherited_hdf_nodes = _get_inherited_hdf_nodes alias for backward compat - Add .. deprecated:: docstrings to get_nxdl_doc, get_default_plottable, get_all_is_a_rel_from_hdf_node documenting their legacy status handler.py: - Update cache_clear() call to use renamed _get_inherited_hdf_nodes
Replace the single `find_node_at_path(hdf_path, ...)` call with a segment-by-segment resolution that reads each intermediate HDF5 group's actual `NX_class` attribute and passes it to `best_child_for`. This eliminates non-deterministic behaviour when multiple variadic schema groups (NXdetector, NXdata, NXcollection, NXparameters…) all score identically under `name_any=True` — the winner previously depended on Python `set` iteration order (hash-randomized), causing the same path to resolve to different schema nodes, `<NOT IN SCHEMA>`, or the wrong concept across repeated runs. With the fix, the NX_class constraint restricts the candidate pool to exactly one matching class, making resolution stable and correct. A fallback to the unconstrained search is retained for groups that carry no `NX_class` attribute.
Moving nexus_tree from dataconverter/ to nexus/ introduced a cycle: nexus.nexus_tree imports dataconverter.helpers, which triggers dataconverter/__init__.py, which eagerly imported validation, which imports nexus.nexus_tree (partially initialised). Fix by defining validate_data_dict directly in helpers.py as a lazy forwarder to validation.validate_data_dict. The __init__.py monkey-patch is preserved as _apply_monkey_patch() for backwards compatibility, but no longer runs at import time.
Replaces the plain base class with an ABC so that the contract is enforced by the type system. All four hooks (on_group, on_field, on_attribute, on_complete) are abstract; hooks that are not meaningful for a given visitor should be implemented as pass. Existing visitors (Annotator, test visitors) already implement all four hooks and are unaffected. The test for the base no-op behavior is updated to use a concrete _NullVisitor.
Replaces the plain base class with an ABC so that the contract is enforced by the type system. All four hooks (on_group, on_field, on_attribute, on_complete) are abstract; hooks that are not meaningful for a given visitor should be implemented as pass. Existing visitors (Annotator, test visitors) already implement all four hooks and are unaffected. The test for the base no-op behavior is updated to use a concrete _NullVisitor.
d79df9e to
a047870
Compare
a047870 to
5dfb5f9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The previous
HandleNexusclass mixed three unrelated responsibilities in a single monolithic object: HDF5 file I/O, depth-first traversal, and application-specific output logic (annotation, NOMAD archive population). This made every processing mode tightly coupled to every other, and prevented the traversal from being reused independently.In addition, the code for
pynx read(the oldread_nexus) was very hard to read and heavily dependent on thenxdl_utilsfrom the `definitions .This PR introduces:
What changes
New architecture
NexusNodepynxtools.nexus.nexus_treeNexusFileHandlerpynxtools.nexus.handlerNexusVisitor(ABC)pynxtools.nexus.handlerNexusVisitoris an abstract base class with four lifecycle hooks:NexusFileHandlerowns file I/O and traversal. It knows nothing about what to do with each node; that is entirely the visitor's responsibility:Annotator(pynxtools.nexus.annotation)Replaces the annotation logic previously embedded in
HandleNexus. UsesNexusNodefor schema resolution instead of the legacynxdl_utils/get_inherited_hdf_nodespath. Supports all three operating modes ofread_nexus(-ddocumentation,-cconcept query, default full annotation).NexusNodeenhancements (pynxtools.nexus.nexus_tree)New methods and properties on
NexusNodethat theAnnotatorand downstream visitors rely on:get_inheritance_docs(): most-specific-first list of(nxdl_stem, doc_text)tuples across the full inheritance chainget_inheritance_enums(): most-specific-first list of(nxdl_stem, [values])tuplesget_inheritance_concept_paths(): list of(concept_label, path)tuples with full within-NXDL paths (e.g.NXinstrument/energy)best_child_for(name, node_type, nx_class): name-fit child selectionfind_node_at_path(path, ..., _cache): full-path traversal with optional dict cachehas_nxcollection_parent(): checks for NXcollection ancestor (exempts required-field checks)concept_pathproperty:: "NXarpes/ENTRY/INSTRUMENT"display stringHandleNexusdeprecation (pynxtools.nexus.nexus)HandleNexusis kept for backward compatibility and emits aDeprecationWarning. All internal call sites have been migrated toNexusFileHandler+Annotator. External code can continue to useHandleNexuswithout changes.pynx read: completely redesigned outputThis is the most user-visible change in the PR. The
read_nexus/pynx readcommand produces fundamentally different output compared to before.Before: schema information came from
nxdl_utils.get_inherited_hdf_nodes, which resolved inheritance lazily, inconsistently, and without type awareness. Output was a flat log of field values interleaved with NXDL lookups; schema matches were oftenmissing or wrong for namefitted groups and fields.
After: schema resolution goes through
NexusNode, which builds the full inheritance chain from NXDL XML at tree-construction time and exposes it via typed methods. For every group, field, and attribute the annotator now emits:NXarpes/ENTRY/INSTRUMENT) with its optionality (REQUIRED/RECOMMENDED/OPTIONAL)NXarpes/NXentry/NXinstrument,NXentry/NXinstrument,NXinstrument) and the most-specific documentation inlineThe node resolution is now type-aware: groups, fields, and attributes are resolved through separate lookup paths, eliminating the confusion between nodes that share a name at the same level but have different XML tag types (field vs. attribute). Namefit matching (e.g.
data_1→DATA) is applied consistently across groups and fields at every depth.Module restructuring
pynxtools.dataconverter.nexus_treepynxtools.nexus.nexus_treenexus.pypynxtools.nexus.utils.decode_if_stringnexus.pypynxtools.nexus.nxdatapynxtools.dataconverter.nexus_treeis kept as a re-export shim for backward compatibility.Circular import fix
Moving
nexus_treeout ofdataconverter/created a circular import:nexus_tree → dataconverter.helpers → dataconverter.__init__ → validation → nexus_tree. Fixed by definingvalidate_data_dictas a lazy forwarder inhelpers.py, making the eager import indataconverter/__init__.pyunnecessary. The monkey-patch pattern is preserved as_apply_monkey_patch()for code that needsthe real function object.
NOMAD parser (
pynxtools.nomad.parsers.parser)NomadParseris now a thin wrapper: it constructs anAnnotator(parser=callback)and passes it toNexusFileHandler. The legacy_nexus_populatecallback contract is fully preserved.Documentation
docs/learn/pynxtools/architecture.md: three-layer architecture overviewdocs/how-tos/pynxtools/implement-a-visitor.md: how-to guide with two worked examplesdocs/reference/cli-api.md: improvedread_nexusdocumentationdocs/index.mdandmkdocs.yaml: navigation for new pagesWhat this PR does NOT change
_to_section,_populate_dataand related functions inparser.pyare unchanged.How to review
The logical reading order matches the dependency graph:
nexus/nexus_tree.py: newNexusNodemethods (pure data model, no side effects)nexus/handler.py:NexusVisitorABC andNexusFileHandlernexus/annotation.py:Annotatorimplementationnexus/nexus.py:HandleNexusdeprecation and CLI cleanupnomad/parsers/parser.py: thinNomadParserwrapperdataconverter/__init__.py+dataconverter/helpers.py: circular import fixdocs/: architecture and how-to docsOpen issues addressed
The following open issues under the
read_nexuslabel and the umbrella tracking issue #526 are affected by this PR:best_child_foris type-aware (node_typeparameter)NxdlAttributeErrorcrash resolving@unitsat deep nesting depthsget_node_at_nxdl_pathis no longer used;@unitsmay still show as NOT IN SCHEMA at some depthsAXISNAME_indices) not recognized_find_attr_nodenow usesbest_child_for(node_type="attribute")which appliesget_nx_namefitto variadic attribute namesKeyError) onAXISNAME_indicespointing to missing fieldaxis_helperis unchanged-c) broken for inherited appdef paths_handle_concept_querystill delegates toget_all_is_a_rel_from_hdf_nodeFixes #59
Fixes #139
Fixes #422