Add NOAA MRMS CONUS hourly precipitation analysis dataset by aldenks · Pull Request #472 · dynamical-org/reformatters

aldenks · 2026-02-27T03:41:36Z

Summary

This PR adds support for the NOAA Multi-Radar Multi-Sensor (MRMS) CONUS hourly precipitation analysis dataset. The implementation includes data ingestion, reformatting, and operational update capabilities for multiple precipitation-related variables from MRMS.

Towards #461, closes #473

Key Changes

Template Configuration (template_config.py): Defines the dataset structure with 4 data variables (precipitation_surface, precipitation_pass_1_surface, precipitation_radar_only_surface, and categorical_precipitation_type_surface) covering the Continental US at 0.01° resolution with hourly frequency from October 2014 onwards.
Region Job Implementation (region_job.py): Implements data processing logic including:
- Source file coordinate generation with support for multiple MRMS product versions
- Handling of MRMS v12.0 transition (October 2020) with fallback to pre-v12 products (GaugeCorr_QPE_01H) for historical data
- Download support from three sources: AWS S3, Iowa Mesonet archive, and NCEP
- GRIB2 decompression and rasterio-based data extraction
- Deaccumulation of hourly precipitation accumulations to rates
- Processing region buffering for proper deaccumulation without gaps
Dataset Class (dynamical_dataset.py): Provides the main dataset interface with:
- Operational update scheduling (every 3 hours via Kubernetes CronJob)
- Data validation pipeline
- Support for both backfill and incremental updates
Zarr Templates: Complete Zarr v3 metadata templates for all coordinates and data variables with optimized chunking and compression (zstd + blosc).
Tests: Comprehensive test coverage including:
- Source file coordinate URL generation for different sources and time periods
- Product version selection logic (v12 vs pre-v12)
- Processing region buffering behavior
- End-to-end backfill and operational update workflows
- Data validation

Notable Implementation Details

Version-aware product selection: Automatically selects appropriate MRMS product names based on data timestamp, with pre-v12 products only available via Iowa Mesonet archive
Pass 1 availability: Pass 1 precipitation data only available from October 2020 onwards; earlier requests are skipped
Deaccumulation handling: First timestep in output is NaN due to deaccumulation requiring a prior timestep
Spatial reference: Uses IAU 1965 spheroid (native to MRMS) with sub-pixel difference from WGS84 at 0.01° resolution

https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK

TODOs

make sure integration tests aren't too slow / don't OOM

small run to

check actual compressed size and update chunks/shards if needed
check resource requirements and adjust

aldenks · 2026-02-27T04:23:49Z

tests/noaa/mrms/conus_analysis_hourly/region_job_test.py

+
+    data = region_job.read_data(updated_coord, radar_var)
+    assert data.shape == (3500, 7000)
+    assert not np.all(np.isnan(data))


change all the not np.all(np.isnan(... checks in all tests in this PR to assert that the values are all finite. the only nans should be at the very first timestep of the entire dataset.

aldenks · 2026-02-27T04:25:12Z

tests/noaa/mrms/conus_analysis_hourly/dynamical_dataset_test.py

+            common_keys = set(template_attrs) & set(file_attrs)
+            for key in common_keys:


assert "spatial_ref" in common_key and "crs_wkt" in common_keys so we know its not empty

aldenks · 2026-02-27T04:26:22Z

tests/noaa/mrms/conus_analysis_hourly/dynamical_dataset_test.py

+
+
+@pytest.mark.slow
+def test_single_file_integration(tmp_path: Path) -> None:


we have download_file and read_data tests in region_job_test.py. remove this test and move just the download + crs/spatial coords test part of it to template_config_test.py

Implement the noaa-mrms-conus-analysis-hourly dataset with: - Three data sources: Iowa Mesonet (pre-v12), AWS S3 (primary), NCEP (fallback) - Four variables: precipitation_surface, precipitation_pass_1_surface, precipitation_radar_only_surface, categorical_precipitation_type_surface - Deaccumulation of QPE accumulations to precipitation rates - MRMS v12.0 product discontinuity handling (GaugeCorr_QPE → MultiSensor_QPE) - Gzip-compressed GRIB2 source file support - Template, tests, and dataset registration https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK

Add test_single_file_integration that downloads a real MRMS file from S3, reads all template variables, and verifies GRIB lat/lon and CRS attributes match template dimension_coordinates and spatial_ref. Also update attribution to include NOAA NCEP as a source. https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK

- Change time chunks from 72 to 720 (30 days), shards from 2160 to 720 - Change lat/lon shards from 4x to 10x chunk size - Remove early return for existing decompressed files to handle retry of corrupt downloads https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK

- Change `not np.all(np.isnan(...))` to `np.all(np.isfinite(...))` in region_job_test.py - Assert spatial_ref and crs_wkt keys exist in common_keys before comparing - Move CRS/spatial coordinate validation from dynamical_dataset_test.py to template_config_test.py - Remove test_single_file_integration (download/read coverage already in region_job_test.py) https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK

Monkeypatches _get_template to .sel() on the time dimension, reducing the template size for the integration test. Print statements for snapshot capture still present - will be replaced with assertions. https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK

Process only 2 hours for backfill + 1 for update instead of 3+1, and replace print statements with assert_allclose/assert_array_equal snapshot checks at a point with meaningful data (snow, non-zero precip). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rd boundary test - Override time shard encoding to size 2 so the operational update naturally crosses a shard boundary, testing deaccumulation buffering without a separate test - Replace write_shards to only write the first spatial shard (1 of 8), cutting write time by ~87% - Combined test: 69s+14s(failing) → 10s(passing), full suite: 86s → 13s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Iowa Mesonet filenames don't use the MRMS_ prefix that S3 and NCEP use. e.g. GaugeCorr_QPE_01H_00.00_... not MRMS_GaugeCorr_QPE_01H_00.00_... Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Some pre-v12 Iowa Mesonet MRMS files contain a duplicate GRIB message encoded with standard meteorological discipline (0) alongside the MRMS-specific discipline (209). Read band 1 in this case after asserting band 2 has the expected discipline, keeping the original assertion for all other multi-band cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Shared memory was 65.7GB (over limit). New config: - time chunk/shard: 720h→648h (30→27 days), shared memory 59.1GB - spatial chunk: 175×175→100×100 (1.75°→1°), ~1.2MB compressed at 5% - spatial shard: 1750×1750→700×1400 (5×5 shards, geographically square) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

After reducing lat/lon shard sizes from 1750×1750 to 700×1400 in c5a7c4f, the test's first-shard assertion was reading into unwritten shards, causing assert_no_nulls to fail on NaN fill values. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

aldenks · 2026-03-02T14:14:29Z