Feature/zarr by Oisin-M · Pull Request #675 · ecmwf/earthkit-data

Oisin-M · 2025-04-09T14:42:48Z

Implements a simple zarr source for reading zarr files (only version 2 at the moment)

Oisin-M · 2025-04-09T15:14:38Z

Tests are failing because zarr version 3 requires Python version 3.11 onward

sandorkertesz · 2025-04-11T06:21:16Z

Hi @Oisin-M, thank you very much for this development. I had a quick look at the code and it seems to me that the "zarr" source is not really needed.

chpolste · 2025-04-11T09:28:54Z

For datasets on the local filesystem, yes. But the source can also access URLs and other locations supported through xarray.

The url source doesn't handle zarrs properly at the moment because they are directories and can't be cached easily. We have to look into the dataset first to know which files are on the server. And I'd argue that most of the time we wouldn't want to eagerly download and cache the dataset anyway, because the on-demand, chunk-based access that xarray supports for remote zarrs is really useful. We can create an exception in the url source, but other sources then need that too.

The question "is this location a zarr dataset?" is not easy to answer by the zarr reader's detection and will depend on the source. Starting with os.isdir works with a file source but not for a URL, etc. I think we can take a hint here from xarray's choice to have a separate open_zarr in addition to open_dataset (though that also supports engine="zarr").

In some aspects, zarr is an API (see, e.g., kerchunk and virtualizarr, as far as I understand these) and so matches the source paradigm of ek-data as well as the reader.

sandorkertesz · 2025-04-22T08:59:19Z

For datasets on the local filesystem, yes. But the source can also access URLs and other locations supported through xarray.

The url source doesn't handle zarrs properly at the moment because they are directories and can't be cached easily. We have to look into the dataset first to know which files are on the server. And I'd argue that most of the time we wouldn't want to eagerly download and cache the dataset anyway, because the on-demand, chunk-based access that xarray supports for remote zarrs is really useful. We can create an exception in the url source, but other sources then need that too.

The question "is this location a zarr dataset?" is not easy to answer by the zarr reader's detection and will depend on the source. Starting with os.isdir works with a file source but not for a URL, etc. I think we can take a hint here from xarray's choice to have a separate open_zarr in addition to open_dataset (though that also supports engine="zarr").

In some aspects, zarr is an API (see, e.g., kerchunk and virtualizarr, as far as I understand these) and so matches the source paradigm of ek-data as well as the reader.

@chpolste, thank you for the explanation. I agree, this should be a source, just like e.g. "opendap" .

sandorkertesz

Thank you for adding this new source! Once it is in sync with the latest develop it can be merged.

Oisin-M · 2025-05-26T14:42:32Z

@sandorkertesz synced with develop now. Just to flag

zarr 3 needs python >= 3.11
I added zarr to the pip install .[all] install, so it will complain about not finding a version if using an older python version
downstream-ci is failing because it uses python 3.10

If you think it's better, can remove from the .[all] for now

sandorkertesz · 2025-06-02T14:43:26Z

@Oisin-M unfortunately we cannot afford disabling all the tests due to this dependency. So I suggest you remove the zarr dependency from environment-unit-tests.yml and from "all" in pyproject.toml. It will be re-enabled once the CI python version is updated.

…into feature/zarr

codecov-commenter · 2025-06-02T16:42:59Z

Codecov Report

❌ Patch coverage is 50.00000% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.92%. Comparing base (75f086e) to head (bc17b5e).
⚠️ Report is 328 commits behind head on develop.

Files with missing lines	Patch %	Lines
tests/sources/test_zarr.py	50.00%	7 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #675      +/-   ##
===========================================
- Coverage    90.98%   90.92%   -0.06%     
===========================================
  Files          167      168       +1     
  Lines        12663    12679      +16     
  Branches       613      614       +1     
===========================================
+ Hits         11521    11529       +8     
- Misses         960      967       +7     
- Partials       182      183       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Oisin-M · 2025-06-02T16:48:31Z

Hi @sandorkertesz, thanks for taking a look. I have now made zarr optional and also made the zarr test automatically skip if it can't find the zarr package, so tests are all passing

sandorkertesz · 2025-06-10T14:10:58Z

@Oisin-M , I have a question about the source name. What was the reason of choosing "xarray_zarr" and not simply "zarr"? E.g. the "opendap" source is also using xarray to open the data but it does not have "xarray" in the name.

Oisin-M · 2025-06-11T08:49:40Z

Hi @sandorkertesz, this was to leave room for having a reader based on the zarr API directly

sandorkertesz · 2025-06-11T11:29:57Z

@Oisin-M, for simplicity, I suggest that the name should be just "zarr". If you want to add a zarr API based reader in the future it could be controlled by adding a key e.g. "reader" to "from_source". E.g.

from_source("zarr", ..., reader="xarray")
from_source("zarr", ..., reader="zarr")

What do you think?

Oisin-M · 2025-06-13T08:08:38Z

Hi @sandorkertesz, I'd be happy to move towards that as long as we return objects with the same methods for both readers

sandorkertesz · 2025-06-13T08:44:24Z

Well that is the catch. Not sure what from_source("zarr", ..., reader="zarr") would return. But I still think that the first argument in from_source() should describe the source and not the way we read it.

chpolste and others added 6 commits April 9, 2025 14:43

Add minimal zarr source and reader

2c17b7c

small basic test for zarr source

acff17e

Access via XArrayFieldList instead of XArrayDatasetWrapper

8cd6857

clean up magic checking for directories

fae6dbb

pass qa

6e8ce0f

update pyproject and envs

1bc9f77

sandorkertesz approved these changes Apr 23, 2025

View reviewed changes

Oisin-M added 5 commits May 26, 2025 12:50

Merge branch 'develop' into feature/zarr

bc2136f

rename source zarr to xarray-zarr and support zarr2 and 3

1d905fa

bugfix default func arg

5f633f6

bugfix zarr test

f5e30d3

add zarr to pyproject

b81feec

Merge branch 'develop' into feature/zarr

67e7bd6

Oisin-M added 3 commits June 2, 2025 18:27

make zarr optional

3dfaded

skip zarr test if zarr not installed

4982957

Merge branch 'feature/zarr' of https://github.com/ecmwf/earthkit-data …

bc17b5e

…into feature/zarr

sandorkertesz merged commit 1eae98b into develop Jun 3, 2025
245 of 246 checks passed

Oisin-M deleted the feature/zarr branch November 12, 2025 12:53

Conversation

Oisin-M commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oisin-M commented Apr 9, 2025

Uh oh!

sandorkertesz commented Apr 11, 2025

Uh oh!

chpolste commented Apr 11, 2025

Uh oh!

sandorkertesz commented Apr 22, 2025

Uh oh!

sandorkertesz left a comment

Choose a reason for hiding this comment

Uh oh!

Oisin-M commented May 26, 2025

Uh oh!

sandorkertesz commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Oisin-M commented Jun 2, 2025

Uh oh!

Uh oh!

sandorkertesz commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oisin-M commented Jun 11, 2025

Uh oh!

sandorkertesz commented Jun 11, 2025

Uh oh!

Oisin-M commented Jun 13, 2025

Uh oh!

sandorkertesz commented Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Oisin-M commented Apr 9, 2025 •

edited

Loading

sandorkertesz commented Jun 2, 2025 •

edited

Loading

codecov-commenter commented Jun 2, 2025 •

edited

Loading

sandorkertesz commented Jun 10, 2025 •

edited

Loading