Skip to content

Feature/zarr#675

Merged
sandorkertesz merged 15 commits into
developfrom
feature/zarr
Jun 3, 2025
Merged

Feature/zarr#675
sandorkertesz merged 15 commits into
developfrom
feature/zarr

Conversation

@Oisin-M
Copy link
Copy Markdown
Contributor

@Oisin-M Oisin-M commented Apr 9, 2025

Implements a simple zarr source for reading zarr files (only version 2 at the moment)

@Oisin-M
Copy link
Copy Markdown
Contributor Author

Oisin-M commented Apr 9, 2025

Tests are failing because zarr version 3 requires Python version 3.11 onward

@sandorkertesz
Copy link
Copy Markdown
Collaborator

Hi @Oisin-M, thank you very much for this development. I had a quick look at the code and it seems to me that the "zarr" source is not really needed.

@chpolste
Copy link
Copy Markdown
Member

For datasets on the local filesystem, yes. But the source can also access URLs and other locations supported through xarray.

The url source doesn't handle zarrs properly at the moment because they are directories and can't be cached easily. We have to look into the dataset first to know which files are on the server. And I'd argue that most of the time we wouldn't want to eagerly download and cache the dataset anyway, because the on-demand, chunk-based access that xarray supports for remote zarrs is really useful. We can create an exception in the url source, but other sources then need that too.

The question "is this location a zarr dataset?" is not easy to answer by the zarr reader's detection and will depend on the source. Starting with os.isdir works with a file source but not for a URL, etc. I think we can take a hint here from xarray's choice to have a separate open_zarr in addition to open_dataset (though that also supports engine="zarr").

In some aspects, zarr is an API (see, e.g., kerchunk and virtualizarr, as far as I understand these) and so matches the source paradigm of ek-data as well as the reader.

@sandorkertesz
Copy link
Copy Markdown
Collaborator

For datasets on the local filesystem, yes. But the source can also access URLs and other locations supported through xarray.

The url source doesn't handle zarrs properly at the moment because they are directories and can't be cached easily. We have to look into the dataset first to know which files are on the server. And I'd argue that most of the time we wouldn't want to eagerly download and cache the dataset anyway, because the on-demand, chunk-based access that xarray supports for remote zarrs is really useful. We can create an exception in the url source, but other sources then need that too.

The question "is this location a zarr dataset?" is not easy to answer by the zarr reader's detection and will depend on the source. Starting with os.isdir works with a file source but not for a URL, etc. I think we can take a hint here from xarray's choice to have a separate open_zarr in addition to open_dataset (though that also supports engine="zarr").

In some aspects, zarr is an API (see, e.g., kerchunk and virtualizarr, as far as I understand these) and so matches the source paradigm of ek-data as well as the reader.

@chpolste, thank you for the explanation. I agree, this should be a source, just like e.g. "opendap" .

Copy link
Copy Markdown
Collaborator

@sandorkertesz sandorkertesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this new source! Once it is in sync with the latest develop it can be merged.

@Oisin-M
Copy link
Copy Markdown
Contributor Author

Oisin-M commented May 26, 2025

@sandorkertesz synced with develop now. Just to flag

  • zarr 3 needs python >= 3.11
  • I added zarr to the pip install .[all] install, so it will complain about not finding a version if using an older python version
  • downstream-ci is failing because it uses python 3.10

If you think it's better, can remove from the .[all] for now

@sandorkertesz
Copy link
Copy Markdown
Collaborator

sandorkertesz commented Jun 2, 2025

@Oisin-M unfortunately we cannot afford disabling all the tests due to this dependency. So I suggest you remove the zarr dependency from environment-unit-tests.yml and from "all" in pyproject.toml. It will be re-enabled once the CI python version is updated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 2, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.92%. Comparing base (75f086e) to head (bc17b5e).
⚠️ Report is 328 commits behind head on develop.

Files with missing lines Patch % Lines
tests/sources/test_zarr.py 50.00% 7 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #675      +/-   ##
===========================================
- Coverage    90.98%   90.92%   -0.06%     
===========================================
  Files          167      168       +1     
  Lines        12663    12679      +16     
  Branches       613      614       +1     
===========================================
+ Hits         11521    11529       +8     
- Misses         960      967       +7     
- Partials       182      183       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Oisin-M
Copy link
Copy Markdown
Contributor Author

Oisin-M commented Jun 2, 2025

Hi @sandorkertesz, thanks for taking a look. I have now made zarr optional and also made the zarr test automatically skip if it can't find the zarr package, so tests are all passing

@sandorkertesz sandorkertesz merged commit 1eae98b into develop Jun 3, 2025
245 of 246 checks passed
@sandorkertesz
Copy link
Copy Markdown
Collaborator

sandorkertesz commented Jun 10, 2025

@Oisin-M , I have a question about the source name. What was the reason of choosing "xarray_zarr" and not simply "zarr"? E.g. the "opendap" source is also using xarray to open the data but it does not have "xarray" in the name.

@Oisin-M
Copy link
Copy Markdown
Contributor Author

Oisin-M commented Jun 11, 2025

Hi @sandorkertesz, this was to leave room for having a reader based on the zarr API directly

@sandorkertesz
Copy link
Copy Markdown
Collaborator

@Oisin-M, for simplicity, I suggest that the name should be just "zarr". If you want to add a zarr API based reader in the future it could be controlled by adding a key e.g. "reader" to "from_source". E.g.

from_source("zarr", ..., reader="xarray")
from_source("zarr", ..., reader="zarr")

What do you think?

@Oisin-M
Copy link
Copy Markdown
Contributor Author

Oisin-M commented Jun 13, 2025

Hi @sandorkertesz, I'd be happy to move towards that as long as we return objects with the same methods for both readers

@sandorkertesz
Copy link
Copy Markdown
Collaborator

Well that is the catch. Not sure what from_source("zarr", ..., reader="zarr") would return. But I still think that the first argument in from_source() should describe the source and not the way we read it.

@Oisin-M Oisin-M deleted the feature/zarr branch November 12, 2025 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants