feature: experimental gribjump source#689
Conversation
Support for to_xarray, robust selection, efficient loading of subsets, tests, etc. still missing. This approach is inspired by FieldlistFromDicts and GribFieldListInMemory and I suspect this approach is most in line with how similar problems have been dealt with in earthkit-data by other contributors. The main issues I encountered are the following: The extracted output of GribJump is not an actual Field but an array with raw, extracted values from different grid cells in the field. Therefore, most of the existing abstractions and utilities do not fit this scenario. After trying out a few different things myself, I found the "list-of-dicts" source which theoretically also supports non-field values (numpy arrays with arbitrary metadata in dictionary form). I am still a bit concerned that breaking abstractions here in our case could be more confusing and bug-prone in the future. Inspired by GribFieldListInMemory, the FieldExtractList wraps the methods of its superclass SimpleFieldList that make it an iterator to lazily load the data using gribjump once it's actually needed. However, this might be a bit brittle as currently implemented and could be improved.
I might remove the type hints as I noticed that they aren't used anywhere else in the codebase.
93f4889 to
5d2bb85
Compare
|
Out of curiosity, does earthkit current support a pyfdb datasource? I suspect not, but if so it will likely have a lot in common with this. |
It does actually :) Earthkit-data already implements an fdb source which allows easy loading of fields from an fdb using Just for future reference and some context: |
|
The |
|
I think the current implementation makes perfect sense. The most important aspect is to get the high level interface via Representing the resulting data as a “list-of-dicts” fieldlist is the best option at the moment even if the geography is not available. I am quite happy to accept and merge this PR. Improvements can be done in separate issues/PRs. Further thoughts:
|
Before, we immediately casted all values in the users'
request dictionary to strings since GribJump mandates that for its requests.
However, this meant that also each field's metadata would solely have string values,
making any filtering only work with strings. This commit only does this
casting for the actual dictionaries passed to GribJump.
Example:
request = {"a": [1, 2]}
from_source("gribjump", request).sel(a=1) # -> empty
from_source("gribjump", request).sel(a=1) # -> fields for {"a": 1}
|
@sandorkertesz, thanks for your help and comments! As discussed with @andreas-grafberger, it's probably best to start working on a small notebook and see how it looks like in terms of outputs.
|
Recently, SimpleFieldList.to_xarray was extended to allow also user-supplied metadata dictionaries to be used for xarray dataset construction. This commit removes the marking of FieldExtractList.to_xarray as not implemented, allowing that class to use the implementation from SimpleFieldList.
|
Thanks a lot @sandorkertesz for the hints and merging !701, which makes the xarray conversion much easier for us! I think this is looking quite promising now. There's some small tweaks I want to look into, which we discussed, like adding a coordinate dimension for different points, and possibly adding some support so users can check if the underlying field has the expected grid. @sandorkertesz One problem I ran into (and this is loosely related to our plans to extend the metadata with more precise geo information and allowing users to further pass on additional metadata) is that e.g. re-forecast datasets which use an request = {
"class": "ce",
"expver": "0001",
"date": "20230101",
"step": [6],
"time": ["0000", "0600"],
"hdate": ["20200101", "20200102"],
...
}
source = ekd.from_source('gribjump', request, mask=station_mask)@sandorkertesz I'l look into how we could solve this on our e.g. by allowing users to rename certain metadata keys, but since this could come up also for others who use the |
|
@andreas-grafberger, I think this is a generic problem and would even be the same for a GRIB fieldlist data. However, you can build your own time dimensions in the Xarray engine and you should be able to use hdate. See this example: https://earthkit-data.readthedocs.io/en/latest/examples/xarray_engine_seasonal.html Please also check the to_xarray() docs: https://earthkit-data.readthedocs.io/en/latest/_api/data/readers/grib/index/index.html#data.readers.grib.index.GribFieldList.to_xarray I wonder if it helps. If it cannot be used please share the full request with me so that I can investigate the problem in detail. |
|
Most possibly date/time for GRIB data with hdate is not handles correctly in earthkiy-data. If it is the case it has to be fixed in a separate issue. |
|
Thanks @sandorkertesz for the pointers! I can confirm that passing request = {
"class": "ce",
"date": "20230101",
"time": ["0000", "0600"],
"hdate": ["20250101", "20250102"],
...
}
source = ekd.from_source('gribjump', request, mask=mask)
ds = source.to_xarray(dim_roles={"date": "hdate"})
ds
<xarray.Dataset> Size: 7GB
Dimensions: (hdate_time: 2, values: 7446075)
Coordinates:
* hdate_time (hdate_time) <U19 9kB '2025-01-01T00:00:00' ... '2025-02-02T0...
Dimensions without coordinates: values
Data variables:
240023 (hdate_time, values) float64 7GB ...
Attributes: (12/13)
param: 240023
class: ce
hdate: 20250101
time: 0000
...As you also mentioned, this doesn't fix the issue for Behavior for MARS source request = {
"class": "ce",
"date": "20230101",
"step": 6,
"time": ["0000", "0600"],
"hdate": ["0000", "0600"],
...
}
mars_source = ekd.from_source('mars', request)As can be seen in this screenshot, the "dataDate" column of the |
|
Quick follow up on my previous statement here:
I'm e.g. not sure if the following here would be simplistic and break other use-cases: |
|
@andreas-grafberger, in the case of the MARS request you have GRIB files and ecCodes generates the right dataDate, dataTime, etc. for you. In UserMetadata we should replicate all this logic. UserMetadata is very simple at the moment only providing support for some basic use-cases. Ideally it should be based on format independent metadata, which does not exist yet, but there is an ongoing effort to formulate it. So it is a much wider problem. Addressing it in a separate issue is a very good idea (since it not strictly related to your PR). |
When users create an xarray dataset for the values retrieved via GribJump, we think it is important for users to explicitly attach the indices from the original grid to each returned value. Unfortunately, this is currently not easy to do directly via the UserMetadata / UserGeomery, which is why we manually add this information in FieldExtractList.to_xarray for now. This should be cleaned up and likely should be natively supported in earthkit-data. Also, although returning the index in the flattened, original grid works for now, we likely want to rather include the origin (x, y) index and ultimately (lat, lon) coordinates instead (or additionaly). The change in pygribjump that enables this lives in [this PR](ecmwf/gribjump#60).
…p.ExtractionRequest and original fdb request dict This makes the code a bit easier to read since we don't have to track both as separate lists
At the time of this commit, there is only one available version for pygribjump available on pypi, which is a pre-release version (0.10.3.dev20250827). Before merging, it should be discussed whether this is acceptable or if we want to wait until a stable release of pygribjump is available.
|
With 8203df0, I now also added According to PEP 440, this should also be the desired behavior: FYI @sandorkertesz; If this behavior is fine for you, the only missing part of this PR would now be making sure that the added tests are run automatically as part of the CI. I briefly searched around a bit and the logs of https://github.com/ecmwf/earthkit-data/actions/runs/17317447618/job/49162998643 make me think that this might not be trivial since the FDB tests are also not run in the CI? I'll take a look at this next. |
Implements PR suggestions from ecmwf#689 (comment).
|
@andreas-grafberger, thank you for this impressive development! Once the branch is synced with develop it can be merged. |
@sandorkertesz Thank you! I just merged |
|
Yes, the |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #689 +/- ##
===========================================
- Coverage 85.91% 85.18% -0.73%
===========================================
Files 176 177 +1
Lines 13534 13711 +177
Branches 678 680 +2
===========================================
+ Hits 11628 11680 +52
- Misses 1716 1840 +124
- Partials 190 191 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|


Add experimental
gribjumpsourceIntroduction
This PR adds an experimental
gribjumpsource to earthkit-data, allowing fast extraction of grid cells from GRIB messages in FDB using GribJump.High-Level Code Approach
src/earthkit/data/sources/gribjump.pywith agribjumpsource.ranges,indices, ormask(enforced: exactly one) to define the extracted points.coords_from_fdbflag fetches coordinates by loading a single reference field from FDB.SimpleFieldListsupporting.to_numpy(),.to_xarray(), and limited selection/grouping.docs/guide/sources.rstanddocs/install.rst.docs/examples/gribjump.ipynb.tests/sources/test_gribjump.pyStill Missing / To-Do
Potential Future Improvements
source.sel(...).to_xarray()still triggers all fields to be loaded).sel()/.isel()/grouping.Caveats & Warnings
indices/ranges/masksactually match the grid of the selected GRIB fields. It is the user's responsibility to ensure the indices align with the data grid. If they do not, they may silently obtain incorrect or nonsensical results with no error or warning from the library.**To bypass grid checks (needed for this feature), one must set
GRIBJUMP_IGNORE_GRID.pygribjumpandpyfdb, plus a working FDB and GribJump setup.