Conversation
Implements the ECMWF AIFS (Artificial Intelligence Forecasting System) deterministic forecast product as a new dataset with 16 variables, 0.25° global grid, 6-hourly inits, 0-360h lead times. Handles the S3 path change (aifs/ -> aifs-single/) on 2025-02-26 and GRIB master table version differences for precipitation variables between early and current data.
aldenks
left a comment
There was a problem hiding this comment.
lol got nerd sniped into reviewing this. I do think we should do AIFS ENS first but after these comments its basically there
- Rename ecmwf/aifs to ecmwf/aifs_deterministic, update dataset_id - Use shared ecmwf_grib_index.py instead of custom byte-range parser - Support deterministic (non-ensemble) data in shared GRIB index parser - Set max_vars_per_download_group=10, handle multi-band GRIB reads - Use group_by() from common.iterating for source_groups - Remove skin_temperature_surface, rename to precipitable_water_atmosphere - Add replica storage config, update k8s resources to match GFS - Add comprehensive region_job unit tests and snapshot value assertions - Add cross-dataset consistency exceptions for ECMWF naming
Add guidance on: finding a reference dataset before coding, cross-dataset variable naming consistency, running the chunk/shard tool instead of estimating, shared provider utilities, minimum RegionJob test coverage, copying k8s resource values from similar datasets, replica storage setup, snapshot value assertions, and pre-submission cross-dataset checks.
|
@aldenks robot addressed your suggestions. lmk how you want to proceed? I'm not precious about the PR if you want to redo or start with ENS only. I'd lean towards doing the deterministic one in addition to ENS |
| # ECMWF AIFS uses ECMWF parameter names per CLAUDE.md conventions, | ||
| # while NOAA datasets use NOAA parameter names. | ||
| ( | ||
| "precipitable_water_atmosphere", | ||
| "short_name", | ||
| "ecmwf-aifs-deterministic-forecast-15-day-0-25-degree", | ||
| ), | ||
| ( | ||
| "precipitable_water_atmosphere", | ||
| "long_name", | ||
| "ecmwf-aifs-deterministic-forecast-15-day-0-25-degree", | ||
| ), |
There was a problem hiding this comment.
we do want these consistent across datasets, not excluded from the check
| "scripts/**/*.py" = [ | ||
| "INP001", # not a package | ||
| "T201", # print is expected in CLI scripts | ||
| "S108", # /tmp usage is fine for staging | ||
| ] |
There was a problem hiding this comment.
don't think we need this addition, this PR isn't modifying scripts
| "boto3>=1.42.49", | ||
| "cf-xarray>=0.10.10", | ||
| "ecmwf-api-client>=1.6.5", |
There was a problem hiding this comment.
remove, no need to add permanently
|
|
||
| t2m_updated = updated_ds.sel(latitude=0, longitude=0).temperature_2m | ||
| assert t2m_updated.shape == (2, 2) # 2 init_times x 2 lead_times | ||
| assert not np.all(np.isnan(t2m_updated.values)) |
There was a problem hiding this comment.
also snapshot complete temperature and precip values at the test point after the update. we should see the initial values from the backfill unchanged along with new values
| assert float(point_6h["precipitation_rate_surface"]) == pytest.approx( | ||
| 0.00164794921875 | ||
| ) |
There was a problem hiding this comment.
use numpy testing assert_allclose following conventions in other tests
src/reformatters/ecmwf/aifs_deterministic/forecast/template_config.py
Outdated
Show resolved
Hide resolved
src/reformatters/ecmwf/aifs_deterministic/forecast/template_config.py
Outdated
Show resolved
Hide resolved
src/reformatters/ecmwf/aifs_deterministic/forecast/template_config.py
Outdated
Show resolved
Hide resolved
| # ~4MB uncompressed, ~0.8MB compressed | ||
| var_chunks: dict[Dim, int] = { | ||
| "init_time": 4, # 1 day of 6-hourly data | ||
| "lead_time": 61, # All lead times | ||
| "latitude": 64, # 12 chunks over 721 pixels | ||
| "longitude": 64, # 23 chunks over 1440 pixels | ||
| } |
There was a problem hiding this comment.
this is too small and has too many init times. read the docs/chunk_shard_size_tool.md and use that tool to find a chunk+shard size set that meets the documented targets
- make spatial chunks larger
- exactly 1 init time per chunk and per shard
If stuck, see GFS forecast template config if you need examples, it has the same resolution and init frequency
|
@mrshll to your larger question -- i think we should do this before aifs because it's basically there! |
…nfig.py Co-authored-by: Alden Keefe Sampson <alden@dynamical.org>
Co-authored-by: Alden Keefe Sampson <alden@dynamical.org>
…nfig.py Co-authored-by: Alden Keefe Sampson <alden@dynamical.org>
…nfig.py Co-authored-by: Alden Keefe Sampson <alden@dynamical.org>
- Fix chunk/shard sizes: 1 init_time per chunk/shard, larger spatial chunks (240x240) per chunk_shard_size tool output - Make precipitable_water_atmosphere metadata consistent across datasets (short_name=pwat, long_name=Precipitable water) and remove stale CF compliance exceptions - Restructure dataset_integration_guide.md: remove ECMWF-specific examples, revert chunk/shard docs to simple wording, simplify storage config section, remove pre-submission checks (duplicate of AGENTS.md) - Use np.testing.assert_allclose in integration test and add post-update snapshot values for temperature_2m and precipitation_rate_surface - Remove ecmwf-api-client dev dependency and scripts ruff ignore rules - Regenerate zarr template https://claude.ai/code/session_01C5o4hepUqwYHeEeYk5rqjy
Summary
Review feedback addressed
ecmwf/aifstoecmwf/aifs_deterministic, update dataset_id toecmwf-aifs-deterministic-forecast-15-day-0-25-degreeecmwf_grib_index.pyinstead of custom byte-range parser; generalize it to support deterministic (non-ensemble) datamax_vars_per_download_group=10and handle multi-band GRIB reads via rasterio metadata matchinggroup_by()fromcommon.iteratingforsource_groupsskin_temperature_surfacevariable, renametotal_column_water_vapour_atmospheretoprecipitable_water_atmosphereEcmwfAifsIcechunkAwsOpenDataDatasetStorageConfig)25 */6 * * *with 30-minute deadlineTest plan
uv run pytest tests/ecmwf/aifs_deterministic/— all passuv run pytest tests/ecmwf/test_ecmwf_grib_index.py— all passuv run pytest tests/common/datasets_cf_compliance_test.py— all pass (excluding pre-existing NASA SMAP failure)uv run pytest tests/common/common_template_config_subclasses_test.py— all pass (excluding pre-existing NASA SMAP failure)uv run pytest -m slow tests/ecmwf/aifs_deterministic/— integration test passes with snapshot valuesuv run ruff format && uv run ruff check— cleanuv run ty check— all checks passed