Skip to content

Optimization related to compression, allowing multithreaded blosc2 to be used as a compression library#747

Open
mkuehbach wants to merge 18 commits intomasterfrom
add_blosc_but_keep_deflate_the_default
Open

Optimization related to compression, allowing multithreaded blosc2 to be used as a compression library#747
mkuehbach wants to merge 18 commits intomasterfrom
add_blosc_but_keep_deflate_the_default

Conversation

@mkuehbach
Copy link
Copy Markdown
Collaborator

@mkuehbach mkuehbach commented Mar 17, 2026

  • Adding whats required to use blosc2, essentially making explicit what the inclusion of hdf5plugin and pandas already delivered
  • Write all non-scalar non-string datasets using chunked storage layout with h5py's autochunker active by default the main motivation behind this move is that it allows everybody to take advantage of Memory optimization NOMAD nexus parser #750 and Memory optimization PYNXTOOLS validate #752 chunked-by-chunk based processing
  • deflate, i.e. "gzip" is kept as the standard algo by default.

@mkuehbach mkuehbach changed the title carried over from NXapm run-through Adding explicit support for blosc but keeping deflate the default Mar 17, 2026
Comment thread src/pynxtools/dataconverter/writer.py Outdated
Comment thread src/pynxtools/dataconverter/writer.py Outdated
Comment thread src/pynxtools/dataconverter/writer.py
Comment thread src/pynxtools/dataconverter/chunk.py Outdated
@sherjeelshabih
Copy link
Copy Markdown
Collaborator

Can you also add a small text in this PR even for now of what changes this introduces to the way the user interacts with this? I believe the writer.py now expects a "filter" key in the Template object. It will be nice to know the "user interface" changes in the PR.

Thanks for introducing this. I hope it makes it easier for the large datasets we run into.

@mkuehbach
Copy link
Copy Markdown
Collaborator Author

Can you also add a small text in this PR even for now of what changes this introduces to the way the user interacts with this? I believe the writer.py now expects a "filter" key in the Template object. It will be nice to know the "user interface" changes in the PR.

Thanks for introducing this. I hope it makes it easier for the large datasets we run into.

Need to check, thought that "filter" is not required per se, a default kicks in, but yeah i should document this.

@mkuehbach mkuehbach changed the title Adding explicit support for blosc but keeping deflate the default Refactoring compression, adding support for blosc but keeping deflate the default Mar 26, 2026
… with h5py autochunking active by default, add documentation for blosc in learning section
@mkuehbach mkuehbach changed the title Refactoring compression, adding support for blosc but keeping deflate the default Refactoring default storage layout to use chunked layout and adding support for optional blosc custom compression filter Mar 26, 2026
@mkuehbach mkuehbach changed the title Refactoring default storage layout to use chunked layout and adding support for optional blosc custom compression filter Optimization related to compression, allowing multithreaded blosc2 to be used as a compression library May 8, 2026
Copy link
Copy Markdown
Collaborator

@lukaspie lukaspie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I am not an expert

Comment thread docs/learn/pynxtools/compression.md Outdated
Copy link
Copy Markdown
Collaborator

@RubelMozumder RubelMozumder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Though I have a general question, how effectively will this multi-threading work in NOMAD? Where NOMAD launches NeXus parser using an asynchronous thread. I do not have any idea about it.

Comment thread src/pynxtools/dataconverter/writer.py Outdated
Comment thread src/pynxtools/dataconverter/writer.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants