Skip to content

NexusVisitor and Annotator architecture #777

Draft
lukaspie wants to merge 21 commits intomasterfrom
nexus-inheritance-concept-paths
Draft

NexusVisitor and Annotator architecture #777
lukaspie wants to merge 21 commits intomasterfrom
nexus-inheritance-concept-paths

Conversation

@lukaspie
Copy link
Copy Markdown
Collaborator

@lukaspie lukaspie commented May 8, 2026

Motivation

The previous HandleNexus class mixed three unrelated responsibilities in a single monolithic object: HDF5 file I/O, depth-first traversal, and application-specific output logic (annotation, NOMAD archive population). This made every processing mode tightly coupled to every other, and prevented the traversal from being reused independently.

In addition, the code for pynx read (the old read_nexus) was very hard to read and heavily dependent on the nxdl_utils from the `definitions .

This PR introduces:

  • a clean separation of concerns based on the visitor pattern.
  • a refactoring of the annotation tool, which becomes self-contained and produces more sensible output.

What changes

New architecture

Layer Key type Location
Schema NexusNode pynxtools.nexus.nexus_tree
Traversal NexusFileHandler pynxtools.nexus.handler
Processing NexusVisitor (ABC) pynxtools.nexus.handler

NexusVisitor is an abstract base class with four lifecycle hooks:

class NexusVisitor(ABC):
    @abstractmethod
    def on_group(self, hdf_path: str, hdf_node: h5py.Group) -> None: ...
    @abstractmethod
    def on_field(self, hdf_path: str, hdf_node: h5py.Dataset) -> None: ...
    @abstractmethod
    def on_attribute(self, hdf_path: str, attr_name: str, attr_value, parent) -> None: ...
    @abstractmethod
    def on_complete(self, root: h5py.File) -> None: ...

NexusFileHandler owns file I/O and traversal. It knows nothing about what to do with each node; that is entirely the visitor's responsibility:

NexusFileHandler("file.nxs").process(my_visitor)

Annotator (pynxtools.nexus.annotation)

Replaces the annotation logic previously embedded in HandleNexus. Uses NexusNode for schema resolution instead of the legacy nxdl_utils /get_inherited_hdf_nodes path. Supports all three operating modes of read_nexus (-d documentation, -c concept query, default full annotation).

NexusNode enhancements (pynxtools.nexus.nexus_tree)

New methods and properties on NexusNode that the Annotator and downstream visitors rely on:

  • get_inheritance_docs(): most-specific-first list of (nxdl_stem, doc_text) tuples across the full inheritance chain
  • get_inheritance_enums(): most-specific-first list of (nxdl_stem, [values]) tuples
  • get_inheritance_concept_paths(): list of (concept_label, path) tuples with full within-NXDL paths (e.g. NXinstrument/energy)
  • best_child_for(name, node_type, nx_class): name-fit child selection
  • find_node_at_path(path, ..., _cache): full-path traversal with optional dict cache
  • has_nxcollection_parent(): checks for NXcollection ancestor (exempts required-field checks)
  • concept_path property:: "NXarpes/ENTRY/INSTRUMENT" display string

HandleNexus deprecation (pynxtools.nexus.nexus)

HandleNexus is kept for backward compatibility and emits a DeprecationWarning. All internal call sites have been migrated to
NexusFileHandler + Annotator. External code can continue to use HandleNexus without changes.

pynx read: completely redesigned output

This is the most user-visible change in the PR. The read_nexus / pynx read command produces fundamentally different output compared to before.

Before: schema information came from nxdl_utils.get_inherited_hdf_nodes, which resolved inheritance lazily, inconsistently, and without type awareness. Output was a flat log of field values interleaved with NXDL lookups; schema matches were often
missing or wrong for namefitted groups and fields.

After: schema resolution goes through NexusNode, which builds the full inheritance chain from NXDL XML at tree-construction time and exposes it via typed methods. For every group, field, and attribute the annotator now emits:

  • Concept — the exact NXDL path from the active application definition (e.g. NXarpes/ENTRY/INSTRUMENT) with its optionality (REQUIRED / RECOMMENDED / OPTIONAL)
  • Inheritance — the ordered chain of NXDL base classes that define this concept, with full within-NXDL paths for each level (e.g. NXarpes/NXentry/NXinstrument, NXentry/NXinstrument, NXinstrument) and the most-specific documentation inline
  • Value / shape / dtype — for fields, as before

The node resolution is now type-aware: groups, fields, and attributes are resolved through separate lookup paths, eliminating the confusion between nodes that share a name at the same level but have different XML tag types (field vs. attribute). Namefit matching (e.g. data_1DATA) is applied consistently across groups and fields at every depth.

Module restructuring

What Before After
NeXus schema tree pynxtools.dataconverter.nexus_tree pynxtools.nexus.nexus_tree
String decoding helper inline in nexus.py pynxtools.nexus.utils.decode_if_string
NXdata axis/signal helpers inline in nexus.py pynxtools.nexus.nxdata

pynxtools.dataconverter.nexus_tree is kept as a re-export shim for backward compatibility.

Circular import fix

Moving nexus_tree out of dataconverter/ created a circular import: nexus_tree → dataconverter.helpers → dataconverter.__init__ → validation → nexus_tree. Fixed by defining validate_data_dict as a lazy forwarder in helpers.py, making the eager import in dataconverter/__init__.py unnecessary. The monkey-patch pattern is preserved as _apply_monkey_patch() for code that needs
the real function object.

NOMAD parser (pynxtools.nomad.parsers.parser)

NomadParser is now a thin wrapper: it constructs an Annotator(parser=callback) and passes it to NexusFileHandler. The legacy _nexus_populate callback contract is fully preserved.

Documentation

  • New: docs/learn/pynxtools/architecture.md: three-layer architecture overview
  • New: docs/how-tos/pynxtools/implement-a-visitor.md: how-to guide with two worked examples
  • Updated: docs/reference/cli-api.md: improved read_nexus documentation
  • Updated: docs/index.md and mkdocs.yaml: navigation for new pages

What this PR does NOT change

  • NOMAD archive population logic: _to_section, _populate_data and related functions in parser.py are unchanged.

How to review

The logical reading order matches the dependency graph:

  1. nexus/nexus_tree.py: new NexusNode methods (pure data model, no side effects)
  2. nexus/handler.py: NexusVisitor ABC and NexusFileHandler
  3. nexus/annotation.py: Annotator implementation
  4. nexus/nexus.py: HandleNexus deprecation and CLI cleanup
  5. nomad/parsers/parser.py: thin NomadParser wrapper
  6. dataconverter/__init__.py + dataconverter/helpers.py: circular import fix
  7. docs/: architecture and how-to docs

Open issues addressed

The following open issues under the read_nexus label and the umbrella tracking issue #526 are affected by this PR:

Issue Description Status
#59 Child resolution conflates fields and attributes sharing a name Fixedbest_child_for is type-aware (node_type parameter)
#139 NxdlAttributeError crash resolving @units at deep nesting depths Crash eliminatedget_node_at_nxdl_path is no longer used; @units may still show as NOT IN SCHEMA at some depths
#422 Namefitted attributes (e.g. AXISNAME_indices) not recognized Fixed_find_attr_node now uses best_child_for(node_type="attribute") which applies get_nx_namefit to variadic attribute names
#436 Crash (KeyError) on AXISNAME_indices pointing to missing field Not fixed — crash site in axis_helper is unchanged
#128 Concept query (-c) broken for inherited appdef paths Not fixed_handle_concept_query still delegates to get_all_is_a_rel_from_hdf_node

Fixes #59
Fixes #139
Fixes #422

lukaspie added 20 commits May 8, 2026 12:51
Phase 1-3 of the Unified NeXus File Handler refactor:

- Move NexusNode tree to nexus/nexus_tree.py; dataconverter/nexus_tree.py
  becomes a re-export shim for backward compatibility
- Add nexus/annotation.py: Annotator class with NexusTree-based schema
  resolution replacing get_inherited_hdf_nodes / get_nxdl_doc chains;
  structured inheritance display with per-element concept paths
  (NXinstrument::/energy, NXarpes::/NXentry/NXinstrument/analyser/lens_mode)
- Add nexus/handler.py: NexusFileHandler + NexusVisitor protocol stubs
  (foundation for Phase 4 full handler)
- Refactor dataconverter/validation.py: FileValidator class with instance-level
  node cache replaces visititems closures; validate_hdf_group_against is a
  1-line wrapper
- Add NexusNode helpers: concept_path, get_inheritance_docs,
  get_inheritance_enums, get_inheritance_concept_paths,
  has_nxcollection_parent, best_child_for, find_node_at_path
- Add _xml_path_in_nxdl module helper for precise within-NXDL element paths
- nomad/parser.py: NomadParser thin wrapper with TODO skeleton for
  future NomadVisitor migration
- 329 tests passing
NomadParser.process() now delegates directly to
NexusFileHandler + Annotator(parser=callback) instead of going
through HandleNexus, which was already a trivial wrapper doing
the same thing.  This eliminates the extra indirection while
keeping the legacy _nexus_populate callback interface intact.

The nxdl_path=list[ET.Element] coupling in _nexus_populate and
_to_section/_populate_data remains as a future TODO once the
NOMAD metainfo navigation is decoupled from XML elements.
…call sites

- nexus.py main(): use NexusFileHandler + Annotator directly
- testing/nexus_conversion.py get_log_file(): same
- tests/nexus/test_nexus.py: migrate all HandleNexus calls to the new API
- HandleNexus.__init__: emit DeprecationWarning so external callers are notified
- NomadParser docstring: update to reference NexusFileHandler

HandleNexus is kept as a deprecated shim for external backward compatibility.
All first-party call sites now use NexusFileHandler + Annotator directly.
…to nexus.nexus_tree

Update eln_mapper, validation, and convert to import from the canonical
nexus.nexus_tree location. dataconverter/nexus_tree.py remains as a
backward-compatibility re-export shim for external packages.
…s.nxdata

Move pure utility functions out of the 800-line nexus.py into proper
sub-modules with minimal dependencies:

- nexus/utils.py: decode_if_string (decodes HDF5 bytes/arrays to str)
- nexus/nxdata.py: chk_nxdata_axis*, logger_auxiliary_signal,
  print_default_plottable_header, entry_helper, nxdata_helper,
  signal_helper, find_attrib_axis_actual_dim_num,
  get_single_or_multiple_axes, axis_helper

nexus.py re-exports these via imports for backward compatibility.
annotation.py now imports decode_if_string, chk_nxdata_axis, and
get_default_plottable at module level (no more deferred local imports
for these three symbols). Tests updated accordingly.

Also fixes a pre-existing bug in chk_nxdata_axis_v2: `for i in len(axes)`
was corrected to `for i in range(len(axes))`.
…ect suppression

Two bugs fixed:

1. _find_nexus_node now passes node_type="field" for h5py.Dataset nodes and
   node_type="group" for h5py.Group nodes. Previously no node_type was passed,
   causing fields like 'amplifier_type' to incorrectly match variadic group nodes
   (e.g. TRANSFORMATIONS with name_any=True, score=2) instead of returning
   <NOT IN SCHEMA>.

2. get_inheritance_concept_paths() previously suppressed ALL NXobject entries.
   Now only the bare NXobject root is suppressed; named entries within NXobject
   (e.g. NXobject::/identifierNAME) are included since they carry meaningful
   provenance and documentation.
…/NXentry/title'

Drop the '::' separator in concept_path and get_inheritance_concept_paths().
Since the within-NXDL path already starts with '/', the file stem is
joined directly: 'NXarpes' + '/NXentry/title' = 'NXarpes/NXentry/title'.
Bare base-class roots with no path (e.g. 'NXentry') are unchanged.

Updated docstrings and inline examples consistently.
- Add module-level `_logger = logging.getLogger("pynxtools")` to nexus.py and
  nxdata.py; remove `logger` parameter from all internal functions
  (`get_nxdl_doc`, `get_nxdl_attr_doc`, `check_deprecation_enum_axis`,
  `process_node`, `get_default_plottable`, all nxdata helpers)
- Delete dead code: `hdf_node_to_self_concept_path` (zero callers)
- Make `nexus_file` CLI argument optional (bare `read_nexus` uses built-in sample)
- Fix `Annotator` call in `main()`: use `documentation=` / `concept=` kwargs
- Update tests to use renamed `documentation=` / `concept=` kwargs
nexus_tree.py:
- find_node_at_path: force node_type="group" for all intermediate path segments
  to prevent schema fields from being selected as parent nodes (which would make
  all children appear as NOT IN SCHEMA)

annotation.py:
- _appdef_for: strip() the definition string so trailing whitespace/null bytes
  from h5py don't silently break tree generation

nexus.py:
- Rename truly-internal helpers with _ prefix: _check_deprecation_enum_axis,
  _get_nxdl_attr_doc, _helper_get_inherited_nodes, _get_hdf_path,
  _get_inherited_hdf_nodes, _process_node
- Add get_inherited_hdf_nodes = _get_inherited_hdf_nodes alias for backward compat
- Add .. deprecated:: docstrings to get_nxdl_doc, get_default_plottable,
  get_all_is_a_rel_from_hdf_node documenting their legacy status

handler.py:
- Update cache_clear() call to use renamed _get_inherited_hdf_nodes
Replace the single `find_node_at_path(hdf_path, ...)` call with a
segment-by-segment resolution that reads each intermediate HDF5 group's
actual `NX_class` attribute and passes it to `best_child_for`.

This eliminates non-deterministic behaviour when multiple variadic schema
groups (NXdetector, NXdata, NXcollection, NXparameters…) all score
identically under `name_any=True` — the winner previously depended on
Python `set` iteration order (hash-randomized), causing the same path to
resolve to different schema nodes, `<NOT IN SCHEMA>`, or the wrong
concept across repeated runs.

With the fix, the NX_class constraint restricts the candidate pool to
exactly one matching class, making resolution stable and correct.  A
fallback to the unconstrained search is retained for groups that carry
no `NX_class` attribute.
Moving nexus_tree from dataconverter/ to nexus/ introduced a cycle:
nexus.nexus_tree imports dataconverter.helpers, which triggers
dataconverter/__init__.py, which eagerly imported validation, which
imports nexus.nexus_tree (partially initialised).

Fix by defining validate_data_dict directly in helpers.py as a lazy
forwarder to validation.validate_data_dict. The __init__.py monkey-patch
is preserved as _apply_monkey_patch() for backwards compatibility, but
no longer runs at import time.
Replaces the plain base class with an ABC so that the contract is enforced
by the type system. All four hooks (on_group, on_field, on_attribute,
on_complete) are abstract; hooks that are not meaningful for a given visitor
should be implemented as pass. Existing visitors (Annotator, test visitors)
already implement all four hooks and are unaffected. The test for the base
no-op behavior is updated to use a concrete _NullVisitor.
Replaces the plain base class with an ABC so that the contract is enforced
by the type system. All four hooks (on_group, on_field, on_attribute,
on_complete) are abstract; hooks that are not meaningful for a given visitor
should be implemented as pass. Existing visitors (Annotator, test visitors)
already implement all four hooks and are unaffected. The test for the base
no-op behavior is updated to use a concrete _NullVisitor.
@lukaspie lukaspie force-pushed the nexus-inheritance-concept-paths branch from d79df9e to a047870 Compare May 8, 2026 14:18
@lukaspie lukaspie force-pushed the nexus-inheritance-concept-paths branch from a047870 to 5dfb5f9 Compare May 10, 2026 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Namefitted attributes are not recognized Resolving units attribute does not always work nexusutils/nexus/nexus.py child resolution bug

1 participant