feature: experimental gribjump source by andreas-grafberger · Pull Request #689 · ecmwf/earthkit-data

andreas-grafberger · 2025-05-06T07:13:53Z

Add experimental `gribjump` source

Introduction

This PR adds an experimental gribjump source to earthkit-data, allowing fast extraction of grid cells from GRIB messages in FDB using GribJump.

High-Level Code Approach

New Source: Implements src/earthkit/data/sources/gribjump.py with a gribjump source.
- Accepts one of ranges, indices, or mask (enforced: exactly one) to define the extracted points.
- Optional coords_from_fdb flag fetches coordinates by loading a single reference field from FDB.
- Wraps pygribjump extraction logic and returns a SimpleFieldList supporting .to_numpy(), .to_xarray(), and limited selection/grouping.
Documentation:
- Adds/updates sections in docs/guide/sources.rst and docs/install.rst.
- New example notebook in docs/examples/gribjump.ipynb.
Testing:
- Tests in tests/sources/test_gribjump.py

Still Missing / To-Do

Dependency: Add pygribjump as a dependency, currently still blocked
CI Integration: Ensure CI runners can run GribJump tests
Performance: Optimize indices/mask-to-range conversion for large/multi-field extractions.

Potential Future Improvements

⚠️ Grid Validation: Optional checking that indices match field grids (currently user’s responsibility; see warnings).
Performance: Optimize indices/mask-to-range conversion for large/multi-field extractions.
Lazy Loading & Caching: Allow efficient lazy-loading and caching of selected subsets (source.sel(...).to_xarray() still triggers all fields to be loaded)
Metadata: Enrich the limited user-supplied metadata with more information (e.g. by converting parameter ids to their short names automatically). pymetkit / covjsonkit could be useful here.
Verbose: Hide verbose gribjump extraction progress output
API: More robust .sel()/.isel()/grouping.
Thread safety: Current implementation is not thread-safe
Error Handling: Friendlier messages for misconfigurations (e.g. missing data in FDB)
Tests: Add cases for pickling, parallel queries, unusual grids.
Examples: Add more real-world/multi-field/coordinate extraction examples.

Caveats & Warnings

Experimental: API may change.
No Grid Validation – USER RESPONSIBILITY: There is no built-in validation that the specified indices/ranges/masks actually match the grid of the selected GRIB fields. It is the user's responsibility to ensure the indices align with the data grid. If they do not, they may silently obtain incorrect or nonsensical results with no error or warning from the library.**
To bypass grid checks (needed for this feature), one must set GRIBJUMP_IGNORE_GRID.
Selection Limitation: Selection/grouping only works if criteria match request dict types.
Dependencies: Requires both pygribjump and pyfdb, plus a working FDB and GribJump setup.

FussyDuck · 2025-05-06T07:13:59Z

All committers have signed the CLA.

Support for to_xarray, robust selection, efficient loading of subsets, tests, etc. still missing. This approach is inspired by FieldlistFromDicts and GribFieldListInMemory and I suspect this approach is most in line with how similar problems have been dealt with in earthkit-data by other contributors. The main issues I encountered are the following: The extracted output of GribJump is not an actual Field but an array with raw, extracted values from different grid cells in the field. Therefore, most of the existing abstractions and utilities do not fit this scenario. After trying out a few different things myself, I found the "list-of-dicts" source which theoretically also supports non-field values (numpy arrays with arbitrary metadata in dictionary form). I am still a bit concerned that breaking abstractions here in our case could be more confusing and bug-prone in the future. Inspired by GribFieldListInMemory, the FieldExtractList wraps the methods of its superclass SimpleFieldList that make it an iterator to lazily load the data using gribjump once it's actually needed. However, this might be a bit brittle as currently implemented and could be improved.

I might remove the type hints as I noticed that they aren't used anywhere else in the codebase.

ChrisspyB · 2025-05-06T10:44:14Z

Out of curiosity, does earthkit current support a pyfdb datasource? I suspect not, but if so it will likely have a lot in common with this.

andreas-grafberger · 2025-05-06T11:34:11Z

Out of curiosity, does earthkit current support a pyfdb datasource? I suspect not, but if so it will likely have a lot in common with this.

It does actually :) Earthkit-data already implements an fdb source which allows easy loading of fields from an fdb using ekd.from_source("fdb", request). In its implementation, it can just easily open a file stream for the request using pyfdb.FDB.retrieve, mutate into a StreamSource, and all the Grib-specific logic like the to_xarray/to_numpy/sel methods can then be handled by grib-specific readers in earthkit.data.readers.grib.

Just for future reference and some context:
From what I see, we cannot easily reuse most of that code for our use-case. Also, my current understanding is that the actual metadata that is required to create an xarray dataset comes from each grib message itself. So unlike the FDB source, we either have to make use of whatever the user gives us (e.g. the request dictionary itself or possibly some additional, optional metadata), or we might need to consider loading this manually ourselves from the FDB. The latter approach seems like it could become quite messy though and possibly problematic performance-wise (depending on how easy it is to efficiently read metadata from grib messages without loading the entire message into memory). In offline discussions we decided against going down this road for now but maybe this will change at some point.

sandorkertesz · 2025-05-14T15:44:47Z

The fdb source can now be loaded lazily, see #677. It retrieves the first field as a template and with it it is able to generate most of the metadata for all the fields without retrieving them. It is still experimental.

sandorkertesz · 2025-05-15T08:30:03Z

@andreas-grafberger

I think the current implementation makes perfect sense. The most important aspect is to get the high level interface via from_source() right, the rest can be refactored later without changing any existing user code.

Representing the resulting data as a “list-of-dicts” fieldlist is the best option at the moment even if the geography is not available. I am quite happy to accept and merge this PR. Improvements can be done in separate issues/PRs.

Further thoughts:

all the missing metadata (bar the geography) can be added to the fields using a similar technique as in the lazy loaded FDB source
this fieldlist should be convertible into Xarray (we can do it for spectral data where no real geopgraphy is available, so it should work for the GribJump fieldlist too). It would require changes in the UserMetadata and Xarray engine code.
it would be very useful to improve order_by() with parameter type specification (e.g. use numbers or str), but it is a generic issue and not specific to GribJump
would be nice to see how you would like to use the GribJump fieldlists, i.e. what are the concrete use-cases

Before, we immediately casted all values in the users' request dictionary to strings since GribJump mandates that for its requests. However, this meant that also each field's metadata would solely have string values, making any filtering only work with strings. This commit only does this casting for the actual dictionaries passed to GribJump. Example: request = {"a": [1, 2]} from_source("gribjump", request).sel(a=1) # -> empty from_source("gribjump", request).sel(a=1) # -> fields for {"a": 1}

corentincarton · 2025-05-15T15:39:20Z

@sandorkertesz, thanks for your help and comments!

As discussed with @andreas-grafberger, it's probably best to start working on a small notebook and see how it looks like in terms of outputs.
It would be interesting to see for a resulting fieldlists "fields":

The output of fields.ls()
The structure (dim, coords, metadata) of the xarray dataset/array from fields.to_xarray()
As use cases, I would consider the following:
A list of points provided by the user (representing stations or gauges) -> list of indices GribJump interface
A mask defined by the user (subregion, catchment, etc.) -> numpy mask input GribJump interface
We can then work on how to extend the metadata with other information: geographic (latitude/longitude, grid), user provided coordinates of the output field, etc.

Recently, SimpleFieldList.to_xarray was extended to allow also user-supplied metadata dictionaries to be used for xarray dataset construction. This commit removes the marking of FieldExtractList.to_xarray as not implemented, allowing that class to use the implementation from SimpleFieldList.

andreas-grafberger · 2025-05-20T10:16:19Z

Thanks a lot @sandorkertesz for the hints and merging !701, which makes the xarray conversion much easier for us!
@corentincarton I've added a very short notebook showing the usage and output of .ls(), .sel(), and .to_xarray() in this gist: https://gist.github.com/andreas-grafberger/3d5a92d2991bf770d64de6b38394693c.

I think this is looking quite promising now. There's some small tweaks I want to look into, which we discussed, like adding a coordinate dimension for different points, and possibly adding some support so users can check if the underlying field has the expected grid.

@sandorkertesz One problem I ran into (and this is loosely related to our plans to extend the metadata with more precise geo information and allowing users to further pass on additional metadata) is that e.g. re-forecast datasets which use an hdate to represent the base date of the forecast reference time instead of the date keyword cause problems.
Since the UserMetadata uses the date keyword by default and ignores the hdate, the base_datetime and forecast_reference_time are wrong, causing the source.ls() output to be wrong and the source.to_xarray() call to fail:

request = {
    "class": "ce",
    "expver": "0001",
    "date": "20230101",
    "step": [6],
    "time": ["0000", "0600"],
    "hdate": ["20200101", "20200102"],
    ...
}
source = ekd.from_source('gribjump', request, mask=station_mask)

source.to_xarray()

File [/.../earthkit-data/src/earthkit/data/utils/xarray/check.py:156](.../earthkit-data/src/earthkit/data/utils/xarray/check.py#line=155), in CubeChecker.check(self, details)
    153 for k, v in md.items():
    154     text_meta += f"{k}:\n {v}\n"
--> 156 raise ValueError(f"{text_num}{text_dims}{text_ns}{text_meta}")

ValueError: Input does not form a full hypercube. Expected number of fields based on the dimensions does not match actual number of fields: 2 != 4.
Dimensions: 
 forecast_reference_time (2) ['2023-01-01T00:00:00', '2023-01-01T06:00:00']
First difference from the expected dimensions occurs at field[1] for dimension "forecast_reference_time". Expected 2023-01-01T06:00:00 got 2023-01-01T00:00:00.
Comparing field[1] with field[0], the following metadata difference was found:

Field[1] metadata:
dims:
 {'forecast_reference_time': '2023-01-01T00:00:00'}
namespace=mars:
 {}
namespace=ls:
 {}
namespace=time:
 {}
namespace=vertical:
 {}
namespace=geography:
 {}

@sandorkertesz I'l look into how we could solve this on our e.g. by allowing users to rename certain metadata keys, but since this could come up also for others who use the UserMetadata class, I'll try to see if there's a way to handle this more generally in the UserMetadata class directly.

sandorkertesz · 2025-05-22T10:34:30Z

@andreas-grafberger, I think this is a generic problem and would even be the same for a GRIB fieldlist data.

However, you can build your own time dimensions in the Xarray engine and you should be able to use hdate. See this example:

https://earthkit-data.readthedocs.io/en/latest/examples/xarray_engine_seasonal.html

Please also check the to_xarray() docs: https://earthkit-data.readthedocs.io/en/latest/_api/data/readers/grib/index/index.html#data.readers.grib.index.GribFieldList.to_xarray

I wonder if it helps. If it cannot be used please share the full request with me so that I can investigate the problem in detail.

sandorkertesz · 2025-05-22T10:54:15Z

Most possibly date/time for GRIB data with hdate is not handles correctly in earthkiy-data. If it is the case it has to be fixed in a separate issue.

andreas-grafberger · 2025-05-23T11:46:51Z

Thanks @sandorkertesz for the pointers! I can confirm that passing source.to_xarray(dim_roles={"date": "hdate"}) fixes the issue for the xarray creation itself. So the following code snippet here runs through for me now.

request = {
    "class": "ce",
    "date": "20230101",
    "time": ["0000", "0600"],
    "hdate": ["20250101", "20250102"],
    ...
}
source = ekd.from_source('gribjump', request, mask=mask)
ds = source.to_xarray(dim_roles={"date": "hdate"})
ds

<xarray.Dataset> Size: 7GB
Dimensions:     (hdate_time: 2, values: 7446075)
Coordinates:
  * hdate_time  (hdate_time) <U19 9kB '2025-01-01T00:00:00' ... '2025-02-02T0...
Dimensions without coordinates: values
Data variables:
    240023      (hdate_time, values) float64 7GB ...
Attributes: (12/13)
    param:        240023
    class:        ce
    hdate:        20250101
    time:         0000
...

As you also mentioned, this doesn't fix the issue for source.ls() but I'm happy to deal with this in a separate issue as a follow up item. I'll still need to do a bit of reading to understand how the hdate is actually used in different contexts (if there is a difference).

Behavior for MARS source
Just FYI though, I can confirm that the MARS source deals with the hdate/date difference in my dataset correctly.

request = {
    "class": "ce",
    "date": "20230101",
    "step": 6,
    "time": ["0000", "0600"],
    "hdate": ["0000", "0600"],
    ...
}
mars_source = ekd.from_source('mars', request)

As can be seen in this screenshot, the "dataDate" column of the ls() output shows the hdate, as expected by users. And also the forecast_reference_time dimension is created correctly in the to_xarray() method based on the hdate + time + step.

andreas-grafberger · 2025-05-23T12:01:38Z

Quick follow up on my previous statement here:

I'll still need to do a bit of reading to understand how the hdate is actually used in different contexts (if there is a difference).

I'm e.g. not sure if the following here would be simplistic and break other use-cases:

diff --git a/src/earthkit/data/utils/metadata/dict.py b/src/earthkit/data/utils/metadata/dict.py
index a598d8c..36af84d 100644
--- a/src/earthkit/data/utils/metadata/dict.py
+++ b/src/earthkit/data/utils/metadata/dict.py
@@ -416,6 +416,11 @@ class UserMetadata(Metadata):
             v = to_datetime(v)
             return v

+
+        v = self._datetime("hdate", "time")
+        if v is not None:
+            return v
+
         v = self._datetime("date", "time")
         if v is not None:
             return v

sandorkertesz · 2025-05-27T12:50:01Z

@andreas-grafberger, in the case of the MARS request you have GRIB files and ecCodes generates the right dataDate, dataTime, etc. for you.

In UserMetadata we should replicate all this logic. UserMetadata is very simple at the moment only providing support for some basic use-cases. Ideally it should be based on format independent metadata, which does not exist yet, but there is an ongoing effort to formulate it. So it is a much wider problem.

Addressing it in a separate issue is a very good idea (since it not strictly related to your PR).

When users create an xarray dataset for the values retrieved via GribJump, we think it is important for users to explicitly attach the indices from the original grid to each returned value. Unfortunately, this is currently not easy to do directly via the UserMetadata / UserGeomery, which is why we manually add this information in FieldExtractList.to_xarray for now. This should be cleaned up and likely should be natively supported in earthkit-data. Also, although returning the index in the flattened, original grid works for now, we likely want to rather include the origin (x, y) index and ultimately (lat, lon) coordinates instead (or additionaly). The change in pygribjump that enables this lives in [this PR](ecmwf/gribjump#60).

…p.ExtractionRequest and original fdb request dict This makes the code a bit easier to read since we don't have to track both as separate lists

…e information

…JUMP_HOME

At the time of this commit, there is only one available version for pygribjump available on pypi, which is a pre-release version (0.10.3.dev20250827). Before merging, it should be discussed whether this is acceptable or if we want to wait until a stable release of pygribjump is available.

andreas-grafberger · 2025-08-29T09:16:27Z

With 8203df0, I now also added pygribjump as an optional dependency. At this current point in time though, there is only a pre-release version of pygribjump available on PyPi (https://pypi.org/project/pygribjump/#history). When I tested this, I saw that pip's current behavior is that it will install this pre-release package even if --pre is not set if there is no other "stable" wheel available.

According to PEP 440, this should also be the desired behavior:

Pre-releases of any kind, including developmental releases, are implicitly excluded from all version specifiers, unless they are already present on the system, explicitly requested by the user, or if the only available version that satisfies the version specifier is a pre-release.

FYI @sandorkertesz; If this behavior is fine for you, the only missing part of this PR would now be making sure that the added tests are run automatically as part of the CI. I briefly searched around a bit and the logs of https://github.com/ecmwf/earthkit-data/actions/runs/17317447618/job/49162998643 make me think that this might not be trivial since the FDB tests are also not run in the CI? I'll take a look at this next.

Implements PR suggestions from ecmwf#689 (comment).

sandorkertesz · 2025-09-23T09:21:18Z

@andreas-grafberger, thank you for this impressive development! Once the branch is synced with develop it can be merged.

andreas-grafberger · 2025-09-23T16:11:48Z

@andreas-grafberger, thank you for this impressive development! Once the branch is synced with develop it can be merged.

@sandorkertesz Thank you! I just merged develop into this branch. When it comes to testing, I manually ran the tests I added in this PR but not the entire testsuite. Should adding the approved-for-ci label be enough to run the entire test-suite including downstream tests even when this PR's source branch lives in a fork?

sandorkertesz · 2025-09-24T08:27:49Z

Yes, the approved-for-ci label triggers the downstream CI. I am not sure if you are able to add it to a PR from a fork.

codecov-commenter · 2025-09-24T08:37:10Z

Codecov Report

❌ Patch coverage is 29.37853% with 125 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.18%. Comparing base (a312ba5) to head (c5fa528).

Files with missing lines	Patch %	Lines
tests/sources/test_gribjump.py	29.37%	124 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #689      +/-   ##
===========================================
- Coverage    85.91%   85.18%   -0.73%     
===========================================
  Files          176      177       +1     
  Lines        13534    13711     +177     
  Branches       678      680       +2     
===========================================
+ Hits         11628    11680      +52     
- Misses        1716     1840     +124     
- Partials       190      191       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions Bot added the contributor label May 6, 2025

andreas-grafberger added 6 commits May 6, 2025 09:32

wip: first rough draft of a GribJumpSource

545e65d

wip: experimental tests for easier development

134086f

format changes using pre-commit hooks and fix small bug

bb467a2

tests: add a few more simple tests and add NO_GRIBJUMP flag for pytest

824421c

tidy: small cleanup, improve variable naming and fix type hints

5d2bb85

I might remove the type hints as I noticed that they aren't used anywhere else in the codebase.

andreas-grafberger force-pushed the feature/gribjump branch from 93f4889 to 5d2bb85 Compare May 6, 2025 07:32

Merge branch 'develop' into feature/gribjump

6fab49d

andreas-grafberger added 2 commits May 16, 2025 12:41

Merge remote-tracking branch 'origin/develop' into feature/gribjump

ee15824

andreas-grafberger mentioned this pull request May 28, 2025

Recognize reforecast / hindcast metadata correctly in UserMetadata #714

Closed

andreas-grafberger added 5 commits June 3, 2025 09:35

Merge branch 'develop' into feature/gribjump

0f34de5

tidy: add some more error handling and improve tests

e274ae7

refactor: introduce ExtractionRequest wrapper that combines pygribjum…

711da54

…p.ExtractionRequest and original fdb request dict This makes the code a bit easier to read since we don't have to track both as separate lists

feat(test): modify (now failing) test to expect latitude and longitud…

9c31181

…e information

andreas-grafberger self-assigned this Jul 17, 2025

andreas-grafberger added 2 commits July 17, 2025 12:55

docs: improve wording of warning

7eb1dac

Merge remote-tracking branch 'origin/develop' into feature/gribjump

1d543c8

ChrisspyB reviewed Jul 25, 2025

View reviewed changes

andreas-grafberger added 7 commits July 25, 2025 11:42

fix: allow fdb and gribjump to be configured via FDB5_CONFIG and GRIB…

e373e14

…JUMP_HOME

chore: update docstring

90b1f95

feat: pass log context to gribjump

d6b506a

refactor: simplify gribjump log context

a806a71

test: add t_gribjump.grib test data with expver xxxx

23afbfc

Merge remote-tracking branch 'origin/develop' into feature/gribjump

91f3f3e

andreas-grafberger added 8 commits September 16, 2025 21:12

Merge remote-tracking branch 'origin/develop' into feature/gribjump

e158e2b

docs: clarify gribjump install instructions and dependency handling

85d9513

docs: move warning before parameters section

e3a68d4

docs: clarify parameter description and types

c03c3ee

add pyfdb as a gribjump group dependency and update docs

972b97b

last docs and typo fixes

8d031d3

Merge remote-tracking branch 'origin/develop' into feature/gribjump

be5269f

docs: change notebook to also set FDB_HOME

f2c5f0e

sandorkertesz reviewed Sep 19, 2025

View reviewed changes

Comment thread docs/guide/sources.rst

docs: reference gribjump example notebook in missing locations

88bca26

Implements PR suggestions from ecmwf#689 (comment).

Merge remote-tracking branch 'origin/develop' into feature/gribjump

c5fa528

sandorkertesz added the approved-for-ci Approved to run CI on ECMWF machines label Sep 24, 2025

sandorkertesz merged commit fc7f702 into ecmwf:develop Sep 24, 2025
132 checks passed

Oisin-M mentioned this pull request Sep 30, 2025

faster extraction using gribjump ecmwf/hat#67

Closed

Conversation

andreas-grafberger commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add experimental gribjump source

Introduction

High-Level Code Approach

Still Missing / To-Do

Potential Future Improvements

Caveats & Warnings

Uh oh!

FussyDuck commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisspyB commented May 6, 2025

Uh oh!

andreas-grafberger commented May 6, 2025

Uh oh!

sandorkertesz commented May 14, 2025

Uh oh!

sandorkertesz commented May 15, 2025

Uh oh!

corentincarton commented May 15, 2025

Uh oh!

andreas-grafberger commented May 20, 2025

Uh oh!

sandorkertesz commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandorkertesz commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreas-grafberger commented May 23, 2025

Uh oh!

andreas-grafberger commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandorkertesz commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreas-grafberger commented Aug 29, 2025

Uh oh!

Uh oh!

sandorkertesz commented Sep 23, 2025

Uh oh!

andreas-grafberger commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandorkertesz commented Sep 24, 2025

Uh oh!

codecov-commenter commented Sep 24, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

andreas-grafberger commented May 6, 2025 •

edited

Loading

Add experimental `gribjump` source

FussyDuck commented May 6, 2025 •

edited

Loading

sandorkertesz commented May 22, 2025 •

edited

Loading

sandorkertesz commented May 22, 2025 •

edited

Loading

andreas-grafberger commented May 23, 2025 •

edited

Loading

andreas-grafberger commented Sep 23, 2025 •

edited

Loading