Skip to content

feature: experimental gribjump source#689

Merged
sandorkertesz merged 58 commits into
ecmwf:developfrom
andreas-grafberger:feature/gribjump
Sep 24, 2025
Merged

feature: experimental gribjump source#689
sandorkertesz merged 58 commits into
ecmwf:developfrom
andreas-grafberger:feature/gribjump

Conversation

@andreas-grafberger
Copy link
Copy Markdown
Contributor

@andreas-grafberger andreas-grafberger commented May 6, 2025

Add experimental gribjump source

Introduction

This PR adds an experimental gribjump source to earthkit-data, allowing fast extraction of grid cells from GRIB messages in FDB using GribJump.


High-Level Code Approach

  • New Source: Implements src/earthkit/data/sources/gribjump.py with a gribjump source.
    • Accepts one of ranges, indices, or mask (enforced: exactly one) to define the extracted points.
    • Optional coords_from_fdb flag fetches coordinates by loading a single reference field from FDB.
    • Wraps pygribjump extraction logic and returns a SimpleFieldList supporting .to_numpy(), .to_xarray(), and limited selection/grouping.
  • Documentation:
    • Adds/updates sections in docs/guide/sources.rst and docs/install.rst.
    • New example notebook in docs/examples/gribjump.ipynb.
  • Testing:
    • Tests in tests/sources/test_gribjump.py

Still Missing / To-Do

  • Dependency: Add pygribjump as a dependency, currently still blocked
  • CI Integration: Ensure CI runners can run GribJump tests
  • Performance: Optimize indices/mask-to-range conversion for large/multi-field extractions.

Potential Future Improvements

  • ⚠️ Grid Validation: Optional checking that indices match field grids (currently user’s responsibility; see warnings).
  • Performance: Optimize indices/mask-to-range conversion for large/multi-field extractions.
  • Lazy Loading & Caching: Allow efficient lazy-loading and caching of selected subsets (source.sel(...).to_xarray() still triggers all fields to be loaded)
  • Metadata: Enrich the limited user-supplied metadata with more information (e.g. by converting parameter ids to their short names automatically). pymetkit / covjsonkit could be useful here.
  • Verbose: Hide verbose gribjump extraction progress output
  • API: More robust .sel()/.isel()/grouping.
  • Thread safety: Current implementation is not thread-safe
  • Error Handling: Friendlier messages for misconfigurations (e.g. missing data in FDB)
  • Tests: Add cases for pickling, parallel queries, unusual grids.
  • Examples: Add more real-world/multi-field/coordinate extraction examples.

Caveats & Warnings

  • Experimental: API may change.
  • No Grid Validation – USER RESPONSIBILITY: There is no built-in validation that the specified indices/ranges/masks actually match the grid of the selected GRIB fields. It is the user's responsibility to ensure the indices align with the data grid. If they do not, they may silently obtain incorrect or nonsensical results with no error or warning from the library.**
    To bypass grid checks (needed for this feature), one must set GRIBJUMP_IGNORE_GRID.
  • Selection Limitation: Selection/grouping only works if criteria match request dict types.
  • Dependencies: Requires both pygribjump and pyfdb, plus a working FDB and GribJump setup.

@FussyDuck
Copy link
Copy Markdown

FussyDuck commented May 6, 2025

CLA assistant check
All committers have signed the CLA.

Support for to_xarray, robust selection, efficient loading of subsets, tests, etc. still missing.

This approach is inspired by FieldlistFromDicts and GribFieldListInMemory and I suspect
this approach is most in line with how similar problems have been dealt with in earthkit-data
by other contributors.

The main issues I encountered are the following:

The extracted output of GribJump is not an actual Field but an array with raw,
extracted values from different grid cells in the field. Therefore, most of the
existing abstractions and utilities do not fit this scenario.  After trying out
a few different things myself, I found the "list-of-dicts" source which
theoretically also supports non-field values (numpy arrays with arbitrary
metadata in dictionary form). I am still a bit concerned that breaking
abstractions here in our case could be more confusing and bug-prone in the
future.

Inspired by GribFieldListInMemory, the FieldExtractList wraps the methods of
its superclass SimpleFieldList that make it an iterator to lazily load the data
using gribjump once it's actually needed. However, this might be a bit brittle
as currently implemented and could be improved.
I might remove the type hints as I noticed that they aren't used anywhere else in the codebase.
@ChrisspyB
Copy link
Copy Markdown
Member

Out of curiosity, does earthkit current support a pyfdb datasource? I suspect not, but if so it will likely have a lot in common with this.

@andreas-grafberger
Copy link
Copy Markdown
Contributor Author

Out of curiosity, does earthkit current support a pyfdb datasource? I suspect not, but if so it will likely have a lot in common with this.

It does actually :) Earthkit-data already implements an fdb source which allows easy loading of fields from an fdb using ekd.from_source("fdb", request). In its implementation, it can just easily open a file stream for the request using pyfdb.FDB.retrieve, mutate into a StreamSource, and all the Grib-specific logic like the to_xarray/to_numpy/sel methods can then be handled by grib-specific readers in earthkit.data.readers.grib.

Just for future reference and some context:
From what I see, we cannot easily reuse most of that code for our use-case. Also, my current understanding is that the actual metadata that is required to create an xarray dataset comes from each grib message itself. So unlike the FDB source, we either have to make use of whatever the user gives us (e.g. the request dictionary itself or possibly some additional, optional metadata), or we might need to consider loading this manually ourselves from the FDB. The latter approach seems like it could become quite messy though and possibly problematic performance-wise (depending on how easy it is to efficiently read metadata from grib messages without loading the entire message into memory). In offline discussions we decided against going down this road for now but maybe this will change at some point.

@sandorkertesz
Copy link
Copy Markdown
Collaborator

The fdb source can now be loaded lazily, see #677. It retrieves the first field as a template and with it it is able to generate most of the metadata for all the fields without retrieving them. It is still experimental.

@sandorkertesz
Copy link
Copy Markdown
Collaborator

@andreas-grafberger

I think the current implementation makes perfect sense. The most important aspect is to get the high level interface via from_source() right, the rest can be refactored later without changing any existing user code.

Representing the resulting data as a “list-of-dicts” fieldlist is the best option at the moment even if the geography is not available. I am quite happy to accept and merge this PR. Improvements can be done in separate issues/PRs.

Further thoughts:

  • all the missing metadata (bar the geography) can be added to the fields using a similar technique as in the lazy loaded FDB source
  • this fieldlist should be convertible into Xarray (we can do it for spectral data where no real geopgraphy is available, so it should work for the GribJump fieldlist too). It would require changes in the UserMetadata and Xarray engine code.
  • it would be very useful to improve order_by() with parameter type specification (e.g. use numbers or str), but it is a generic issue and not specific to GribJump
  • would be nice to see how you would like to use the GribJump fieldlists, i.e. what are the concrete use-cases

Before, we immediately casted all values in the users'
request dictionary to strings since GribJump mandates that for its requests.
However, this meant that also each field's metadata would solely have string values,
making any filtering only work with strings. This commit only does this
casting for the actual dictionaries passed to GribJump.

Example:

request = {"a": [1, 2]}

from_source("gribjump", request).sel(a=1) # -> empty

from_source("gribjump", request).sel(a=1) # -> fields for {"a": 1}
@corentincarton
Copy link
Copy Markdown
Contributor

@sandorkertesz, thanks for your help and comments!

As discussed with @andreas-grafberger, it's probably best to start working on a small notebook and see how it looks like in terms of outputs.
It would be interesting to see for a resulting fieldlists "fields":

  • The output of fields.ls()
  • The structure (dim, coords, metadata) of the xarray dataset/array from fields.to_xarray()
    As use cases, I would consider the following:
  • A list of points provided by the user (representing stations or gauges) -> list of indices GribJump interface
  • A mask defined by the user (subregion, catchment, etc.) -> numpy mask input GribJump interface
    We can then work on how to extend the metadata with other information: geographic (latitude/longitude, grid), user provided coordinates of the output field, etc.

Recently, SimpleFieldList.to_xarray was extended to allow also user-supplied
metadata dictionaries to be used for xarray dataset construction.
This commit removes the marking of FieldExtractList.to_xarray as
not implemented, allowing that class to use the implementation from
SimpleFieldList.
@andreas-grafberger
Copy link
Copy Markdown
Contributor Author

Thanks a lot @sandorkertesz for the hints and merging !701, which makes the xarray conversion much easier for us!
@corentincarton I've added a very short notebook showing the usage and output of .ls(), .sel(), and .to_xarray() in this gist: https://gist.github.com/andreas-grafberger/3d5a92d2991bf770d64de6b38394693c.

I think this is looking quite promising now. There's some small tweaks I want to look into, which we discussed, like adding a coordinate dimension for different points, and possibly adding some support so users can check if the underlying field has the expected grid.

@sandorkertesz One problem I ran into (and this is loosely related to our plans to extend the metadata with more precise geo information and allowing users to further pass on additional metadata) is that e.g. re-forecast datasets which use an hdate to represent the base date of the forecast reference time instead of the date keyword cause problems.
Since the UserMetadata uses the date keyword by default and ignores the hdate, the base_datetime and forecast_reference_time are wrong, causing the source.ls() output to be wrong and the source.to_xarray() call to fail:

request = {
    "class": "ce",
    "expver": "0001",
    "date": "20230101",
    "step": [6],
    "time": ["0000", "0600"],
    "hdate": ["20200101", "20200102"],
    ...
}
source = ekd.from_source('gribjump', request, mask=station_mask)

image

source.to_xarray()

File [/.../earthkit-data/src/earthkit/data/utils/xarray/check.py:156](.../earthkit-data/src/earthkit/data/utils/xarray/check.py#line=155), in CubeChecker.check(self, details)
    153 for k, v in md.items():
    154     text_meta += f"{k}:\n {v}\n"
--> 156 raise ValueError(f"{text_num}{text_dims}{text_ns}{text_meta}")

ValueError: Input does not form a full hypercube. Expected number of fields based on the dimensions does not match actual number of fields: 2 != 4.
Dimensions: 
 forecast_reference_time (2) ['2023-01-01T00:00:00', '2023-01-01T06:00:00']
First difference from the expected dimensions occurs at field[1] for dimension "forecast_reference_time". Expected 2023-01-01T06:00:00 got 2023-01-01T00:00:00.
Comparing field[1] with field[0], the following metadata difference was found:

Field[1] metadata:
dims:
 {'forecast_reference_time': '2023-01-01T00:00:00'}
namespace=mars:
 {}
namespace=ls:
 {}
namespace=time:
 {}
namespace=vertical:
 {}
namespace=geography:
 {}

@sandorkertesz I'l look into how we could solve this on our e.g. by allowing users to rename certain metadata keys, but since this could come up also for others who use the UserMetadata class, I'll try to see if there's a way to handle this more generally in the UserMetadata class directly.

@sandorkertesz
Copy link
Copy Markdown
Collaborator

sandorkertesz commented May 22, 2025

@andreas-grafberger, I think this is a generic problem and would even be the same for a GRIB fieldlist data.

However, you can build your own time dimensions in the Xarray engine and you should be able to use hdate. See this example:

https://earthkit-data.readthedocs.io/en/latest/examples/xarray_engine_seasonal.html

Please also check the to_xarray() docs: https://earthkit-data.readthedocs.io/en/latest/_api/data/readers/grib/index/index.html#data.readers.grib.index.GribFieldList.to_xarray

I wonder if it helps. If it cannot be used please share the full request with me so that I can investigate the problem in detail.

@sandorkertesz
Copy link
Copy Markdown
Collaborator

sandorkertesz commented May 22, 2025

Most possibly date/time for GRIB data with hdate is not handles correctly in earthkiy-data. If it is the case it has to be fixed in a separate issue.

@andreas-grafberger
Copy link
Copy Markdown
Contributor Author

Thanks @sandorkertesz for the pointers! I can confirm that passing source.to_xarray(dim_roles={"date": "hdate"}) fixes the issue for the xarray creation itself. So the following code snippet here runs through for me now.

request = {
    "class": "ce",
    "date": "20230101",
    "time": ["0000", "0600"],
    "hdate": ["20250101", "20250102"],
    ...
}
source = ekd.from_source('gribjump', request, mask=mask)
ds = source.to_xarray(dim_roles={"date": "hdate"})
ds

<xarray.Dataset> Size: 7GB
Dimensions:     (hdate_time: 2, values: 7446075)
Coordinates:
  * hdate_time  (hdate_time) <U19 9kB '2025-01-01T00:00:00' ... '2025-02-02T0...
Dimensions without coordinates: values
Data variables:
    240023      (hdate_time, values) float64 7GB ...
Attributes: (12/13)
    param:        240023
    class:        ce
    hdate:        20250101
    time:         0000
...

As you also mentioned, this doesn't fix the issue for source.ls() but I'm happy to deal with this in a separate issue as a follow up item. I'll still need to do a bit of reading to understand how the hdate is actually used in different contexts (if there is a difference).

Behavior for MARS source
Just FYI though, I can confirm that the MARS source deals with the hdate/date difference in my dataset correctly.

request = {
    "class": "ce",
    "date": "20230101",
    "step": 6,
    "time": ["0000", "0600"],
    "hdate": ["0000", "0600"],
    ...
}
mars_source = ekd.from_source('mars', request)

As can be seen in this screenshot, the "dataDate" column of the ls() output shows the hdate, as expected by users. And also the forecast_reference_time dimension is created correctly in the to_xarray() method based on the hdate + time + step.

image

@andreas-grafberger
Copy link
Copy Markdown
Contributor Author

andreas-grafberger commented May 23, 2025

Quick follow up on my previous statement here:

I'll still need to do a bit of reading to understand how the hdate is actually used in different contexts (if there is a difference).

I'm e.g. not sure if the following here would be simplistic and break other use-cases:

diff --git a/src/earthkit/data/utils/metadata/dict.py b/src/earthkit/data/utils/metadata/dict.py
index a598d8c..36af84d 100644
--- a/src/earthkit/data/utils/metadata/dict.py
+++ b/src/earthkit/data/utils/metadata/dict.py
@@ -416,6 +416,11 @@ class UserMetadata(Metadata):
             v = to_datetime(v)
             return v

+
+        v = self._datetime("hdate", "time")
+        if v is not None:
+            return v
+
         v = self._datetime("date", "time")
         if v is not None:
             return v

@sandorkertesz
Copy link
Copy Markdown
Collaborator

@andreas-grafberger, in the case of the MARS request you have GRIB files and ecCodes generates the right dataDate, dataTime, etc. for you.

In UserMetadata we should replicate all this logic. UserMetadata is very simple at the moment only providing support for some basic use-cases. Ideally it should be based on format independent metadata, which does not exist yet, but there is an ongoing effort to formulate it. So it is a much wider problem.

Addressing it in a separate issue is a very good idea (since it not strictly related to your PR).

When users create an xarray dataset for the values retrieved via GribJump, we
think it is important for users to explicitly attach the indices from the
original grid to each returned value.

Unfortunately, this is currently not easy to do directly via the UserMetadata /
UserGeomery, which is why we manually add this information in
FieldExtractList.to_xarray for now. This should be cleaned up and likely should
be natively supported in earthkit-data. Also, although returning the index
in the flattened, original grid works for now, we likely want to rather
include the origin (x, y) index and ultimately (lat, lon) coordinates instead
(or additionaly).

The change in pygribjump that enables this lives in [this
PR](ecmwf/gribjump#60).
…p.ExtractionRequest and original fdb request dict

This makes the code a bit easier to read since we don't have to track both as separate lists
@andreas-grafberger andreas-grafberger self-assigned this Jul 17, 2025
Comment thread src/earthkit/data/sources/gribjump.py Outdated
Comment thread src/earthkit/data/sources/gribjump.py
Comment thread src/earthkit/data/sources/gribjump.py
Comment thread src/earthkit/data/sources/gribjump.py Outdated
Comment thread tests/sources/test_gribjump.py Outdated
Comment thread tests/sources/test_gribjump.py Outdated
Comment thread tests/sources/test_gribjump.py Outdated
Comment thread tests/sources/test_gribjump.py Outdated
Comment thread src/earthkit/data/sources/gribjump.py Outdated
Comment thread src/earthkit/data/sources/gribjump.py Outdated
At the time of this commit, there is only one available version for
pygribjump available on pypi, which is a pre-release version
(0.10.3.dev20250827). Before merging, it should be discussed whether
this is acceptable or if we want to wait until a stable release of
pygribjump is available.
@andreas-grafberger
Copy link
Copy Markdown
Contributor Author

With 8203df0, I now also added pygribjump as an optional dependency. At this current point in time though, there is only a pre-release version of pygribjump available on PyPi (https://pypi.org/project/pygribjump/#history). When I tested this, I saw that pip's current behavior is that it will install this pre-release package even if --pre is not set if there is no other "stable" wheel available.

According to PEP 440, this should also be the desired behavior:

Pre-releases of any kind, including developmental releases, are implicitly excluded from all version specifiers, unless they are already present on the system, explicitly requested by the user, or if the only available version that satisfies the version specifier is a pre-release.

FYI @sandorkertesz; If this behavior is fine for you, the only missing part of this PR would now be making sure that the added tests are run automatically as part of the CI. I briefly searched around a bit and the logs of https://github.com/ecmwf/earthkit-data/actions/runs/17317447618/job/49162998643 make me think that this might not be trivial since the FDB tests are also not run in the CI? I'll take a look at this next.

Comment thread docs/guide/sources.rst
@sandorkertesz
Copy link
Copy Markdown
Collaborator

@andreas-grafberger, thank you for this impressive development! Once the branch is synced with develop it can be merged.

@andreas-grafberger
Copy link
Copy Markdown
Contributor Author

andreas-grafberger commented Sep 23, 2025

@andreas-grafberger, thank you for this impressive development! Once the branch is synced with develop it can be merged.

@sandorkertesz Thank you! I just merged develop into this branch. When it comes to testing, I manually ran the tests I added in this PR but not the entire testsuite. Should adding the approved-for-ci label be enough to run the entire test-suite including downstream tests even when this PR's source branch lives in a fork?

@sandorkertesz sandorkertesz added the approved-for-ci Approved to run CI on ECMWF machines label Sep 24, 2025
@sandorkertesz
Copy link
Copy Markdown
Collaborator

Yes, the approved-for-ci label triggers the downstream CI. I am not sure if you are able to add it to a PR from a fork.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 29.37853% with 125 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.18%. Comparing base (a312ba5) to head (c5fa528).

Files with missing lines Patch % Lines
tests/sources/test_gribjump.py 29.37% 124 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #689      +/-   ##
===========================================
- Coverage    85.91%   85.18%   -0.73%     
===========================================
  Files          176      177       +1     
  Lines        13534    13711     +177     
  Branches       678      680       +2     
===========================================
+ Hits         11628    11680      +52     
- Misses        1716     1840     +124     
- Partials       190      191       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sandorkertesz sandorkertesz merged commit fc7f702 into ecmwf:develop Sep 24, 2025
132 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved-for-ci Approved to run CI on ECMWF machines contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants