Add NOAA MRMS CONUS hourly precipitation analysis dataset#472
Add NOAA MRMS CONUS hourly precipitation analysis dataset#472
Conversation
|
|
||
| data = region_job.read_data(updated_coord, radar_var) | ||
| assert data.shape == (3500, 7000) | ||
| assert not np.all(np.isnan(data)) |
There was a problem hiding this comment.
change all the not np.all(np.isnan(... checks in all tests in this PR to assert that the values are all finite. the only nans should be at the very first timestep of the entire dataset.
| common_keys = set(template_attrs) & set(file_attrs) | ||
| for key in common_keys: |
There was a problem hiding this comment.
assert "spatial_ref" in common_key and "crs_wkt" in common_keys so we know its not empty
|
|
||
|
|
||
| @pytest.mark.slow | ||
| def test_single_file_integration(tmp_path: Path) -> None: |
There was a problem hiding this comment.
we have download_file and read_data tests in region_job_test.py. remove this test and move just the download + crs/spatial coords test part of it to template_config_test.py
Implement the noaa-mrms-conus-analysis-hourly dataset with: - Three data sources: Iowa Mesonet (pre-v12), AWS S3 (primary), NCEP (fallback) - Four variables: precipitation_surface, precipitation_pass_1_surface, precipitation_radar_only_surface, categorical_precipitation_type_surface - Deaccumulation of QPE accumulations to precipitation rates - MRMS v12.0 product discontinuity handling (GaugeCorr_QPE → MultiSensor_QPE) - Gzip-compressed GRIB2 source file support - Template, tests, and dataset registration https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK
Add test_single_file_integration that downloads a real MRMS file from S3, reads all template variables, and verifies GRIB lat/lon and CRS attributes match template dimension_coordinates and spatial_ref. Also update attribution to include NOAA NCEP as a source. https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK
- Change time chunks from 72 to 720 (30 days), shards from 2160 to 720 - Change lat/lon shards from 4x to 10x chunk size - Remove early return for existing decompressed files to handle retry of corrupt downloads https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK
- Change `not np.all(np.isnan(...))` to `np.all(np.isfinite(...))` in region_job_test.py - Assert spatial_ref and crs_wkt keys exist in common_keys before comparing - Move CRS/spatial coordinate validation from dynamical_dataset_test.py to template_config_test.py - Remove test_single_file_integration (download/read coverage already in region_job_test.py) https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK
Monkeypatches _get_template to .sel() on the time dimension, reducing the template size for the integration test. Print statements for snapshot capture still present - will be replaced with assertions. https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK
Process only 2 hours for backfill + 1 for update instead of 3+1, and replace print statements with assert_allclose/assert_array_equal snapshot checks at a point with meaningful data (snow, non-zero precip). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rd boundary test - Override time shard encoding to size 2 so the operational update naturally crosses a shard boundary, testing deaccumulation buffering without a separate test - Replace write_shards to only write the first spatial shard (1 of 8), cutting write time by ~87% - Combined test: 69s+14s(failing) → 10s(passing), full suite: 86s → 13s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Iowa Mesonet filenames don't use the MRMS_ prefix that S3 and NCEP use. e.g. GaugeCorr_QPE_01H_00.00_... not MRMS_GaugeCorr_QPE_01H_00.00_... Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Some pre-v12 Iowa Mesonet MRMS files contain a duplicate GRIB message encoded with standard meteorological discipline (0) alongside the MRMS-specific discipline (209). Read band 1 in this case after asserting band 2 has the expected discipline, keeping the original assertion for all other multi-band cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Shared memory was 65.7GB (over limit). New config: - time chunk/shard: 720h→648h (30→27 days), shared memory 59.1GB - spatial chunk: 175×175→100×100 (1.75°→1°), ~1.2MB compressed at 5% - spatial shard: 1750×1750→700×1400 (5×5 shards, geographically square) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After reducing lat/lon shard sizes from 1750×1750 to 700×1400 in c5a7c4f, the test's first-shard assertion was reading into unwritten shards, causing assert_no_nulls to fail on NaN fill values. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
df700e8 to
b4754ab
Compare
| data_var_group: Sequence[NoaaMrmsDataVar], | ||
| ) -> Sequence[NoaaMrmsSourceFileCoord]: | ||
| times = pd.to_datetime(processing_region_ds["time"].values) | ||
| data_var = data_var_group[0] |
| class NoaaMrmsSourceFileCoord(SourceFileCoord): | ||
| time: Timestamp | ||
| product: str | ||
| level: str = "00.00" |
| product = internal.mrms_product_pre_v12 | ||
| else: | ||
| product = internal.mrms_product |
There was a problem hiding this comment.
add mrms_fallback_products_pre_v12 and mrms_fallback_products. these should be empty tuples for all variables in the template config except precipitation_surface. add a new fallback_products: tuple[str, ...] on the source file coord and fill it in here.
for precipitation surface:
mrms_fallback_products = ("MultiSensor_QPE_01H_Pass1", "RadarOnly_QPE_01H")
mrms_fallback_products_pre_v12 = ("RadarOnly_QPE_01H",)
| except FileNotFoundError: | ||
| if coord.time > (pd.Timestamp.now() - pd.Timedelta(hours=12)): | ||
| return self._download_from_source(coord, source="ncep") | ||
| raise |
There was a problem hiding this comment.
update this logic to the follwoing after making the changes to add the fallback product attributes the mrms data var internal attrs and mrms source file coord.
from reformatters.common.pydantic import replace
is_pre_v12 = coord.time < MRMS_V12_START
is_recent = coord.time > (pd.Timestamp.now() - pd.Timedelta(hours=12))
sources =
match
case is_pre_v12 -> [iowa]
case not is_pre_v12 and not is_recent -> [s3]
case is_recent -> [s3, nomads]
products = [coord.product, *coord.fallback_products]
last_exception: Exception | None = None
for product in products:
for source in sources:
try:
return self._download_from_source(replace(coord, product=product), source=source)
except FileNotFoundError as e:
last_exception = e
continue
raise last_exception
| # Some pre-v12 Iowa Mesonet files have a duplicate GRIB message with | ||
| # standard meteorological discipline (0) alongside the MRMS-specific one (209). | ||
| # Band 1 (discipline 209) is always the authoritative MRMS data. | ||
| band2_discipline = reader.tags(2).get("GRIB_DISCIPLINE", "") | ||
| assert band2_discipline == "0(Meteorological)", ( | ||
| f"Expected band 2 GRIB_DISCIPLINE '0(Meteorological)', found '{band2_discipline}' in {coord.downloaded_path}" | ||
| ) |
There was a problem hiding this comment.
rather than expect a specific order do this:
if reader.count == 2 and coord.time < MRMS_V12_START: find the band with grib discipline 209 and use that. assert that a band with 209 exists if we are in the count == 2 and pre v12 case. set a rasterio_band = n variable in both if/else branches and then use that in the reader.read call.
| mrms_product: str | ||
| # Pre-v12 product name on Iowa Mesonet (e.g. GaugeCorr_QPE_01H for precipitation_surface) | ||
| mrms_product_pre_v12: str | None = None | ||
| mrms_level: str = "00.00" |
There was a problem hiding this comment.
remove the default value and explicitly set this in all data variables in this template
| long_name="Precipitation rate", | ||
| units="kg m-2 s-1", | ||
| step_type="avg", | ||
| comment="Average precipitation rate over the previous hour. Derived from MultiSensor_QPE_01H_Pass2 from October 2020, GaugeCorr_QPE_01H before. Units equivalent to mm/s.", |
There was a problem hiding this comment.
| comment="Average precipitation rate over the previous hour. Derived from MultiSensor_QPE_01H_Pass2 from October 2020, GaugeCorr_QPE_01H before. Units equivalent to mm/s.", | |
| comment="Average precipitation rate over the previous hour. Derived from MultiSensor_QPE_01H_Pass2 from October 2020, GaugeCorr_QPE_01H before. If primary product is unavailable, falls back to MultiSensor_QPE_01H_Pass1 and then RadarOnly_QPE_01H. Units equivalent to mm/s.", |
Co-authored-by: aldenks <463484+aldenks@users.noreply.github.com>
…lbacks, and robust pre-v12 band selection (#480) * Initial plan * Implement PR #472 feedback for MRMS fallback products and band selection Co-authored-by: aldenks <463484+aldenks@users.noreply.github.com> * Default MRMS fallback tuples and remove explicit empty assignments Co-authored-by: aldenks <463484+aldenks@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: aldenks <463484+aldenks@users.noreply.github.com>
Summary
This PR adds support for the NOAA Multi-Radar Multi-Sensor (MRMS) CONUS hourly precipitation analysis dataset. The implementation includes data ingestion, reformatting, and operational update capabilities for multiple precipitation-related variables from MRMS.
Towards #461, closes #473
Key Changes
Template Configuration (
template_config.py): Defines the dataset structure with 4 data variables (precipitation_surface, precipitation_pass_1_surface, precipitation_radar_only_surface, and categorical_precipitation_type_surface) covering the Continental US at 0.01° resolution with hourly frequency from October 2014 onwards.Region Job Implementation (
region_job.py): Implements data processing logic including:Dataset Class (
dynamical_dataset.py): Provides the main dataset interface with:Zarr Templates: Complete Zarr v3 metadata templates for all coordinates and data variables with optimized chunking and compression (zstd + blosc).
Tests: Comprehensive test coverage including:
Notable Implementation Details
https://claude.ai/code/session_01JD7KMBFUaoUjEYcNa6VhtK
TODOs
small run to