Skip to content

File-pattern with hive partitioning#659

Merged
sandorkertesz merged 17 commits into
developfrom
feature/defer-file-pattern-scan
Apr 30, 2025
Merged

File-pattern with hive partitioning#659
sandorkertesz merged 17 commits into
developfrom
feature/defer-file-pattern-scan

Conversation

@sandorkertesz
Copy link
Copy Markdown
Collaborator

@sandorkertesz sandorkertesz commented Mar 21, 2025

This PR implements #637.

Adds the hive_partitioning option to the "file-pattern" source.

The examples below are using these GRIB data files from the tests:

ls tests/data/pattern/1
r_2020-09-22T12:00:00_0.grib	r_2020-09-22T12:00:00_6.grib	
t_2020-09-22T12:00:00_24.grib	z_2020-09-22T12:00:00_12.grib
r_2020-09-22T12:00:00_12.grib	t_2020-09-22T12:00:00_0.grib	
t_2020-09-22T12:00:00_6.grib	z_2020-09-22T12:00:00_24.grib
r_2020-09-22T12:00:00_24.grib	t_2020-09-22T12:00:00_12.grib	
z_2020-09-22T12:00:00_0.grib	z_2020-09-22T12:00:00_6.grib

hive_partitioning=False

This is the default, the "file-pattern" source behaves as before. Namely:

  • values must be specified for each pattern items
  • constructs all the possible file names from the pattern and the values
  • the files are handled as a "multi" source. For GRIB data the result will be a single FieldList.
  • when a file does not exist an exception is raised.
pattern = "tests/data/pattern/1/{shortName}_{date:date(%Y-%m-%dT%H:%M)}_{step}.grib"
ds = from_source("file-pattern", pattern,
                   {"shortName": ["t", "r","z"}, 
                    "date": datetime.dateime(2020,9,22,12), 
                    "step": [0, 6])

# ds is FieldList merged from 6 different GRIB files

# this call loads every GRIB message (one at a time) from the 6 files to check metadata
r = ds.sel(shortName="t", step=12)

hive_partitioning=True

  • the pattern values are now optional
  • each pattern item is interpreted as a metadata key
  • from_source() returns an object that only supports the sel() method
  • sel() substitutes its args and kwargs into the pattern and collects the matching files. During the scan it only enters directories matching the pattern. The collected file paths are handled as a "multi" source. For GRIB data the result will be a single FieldList.
  • if some keys are not part of the pattern an extra sel() is run on the resulting FieldList.
pattern = "tests/data/pattern/1/{shortName}_{date:date(%Y-%m-%dT%H:%M)}_{step}.grib"
ds = from_source("file-pattern", pattern, hive_partitioning=True)

# ds is an object that only offers the sel() method

# using hive partitioning keys. 
# This call does not scan any GRIB files, no GRIB messages are loaded
r = ds.sel(shortName="t", step=12)

# using hive partitioning keys + extra keys from GRIB header. 
# This call only reads the GRIB messages from 1 file
r = ds.sel(shortName="t", step=12, levtype="pl")

Remarks:

  • paths/file names collected during the filesytem scan are not cached

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 21, 2025

Codecov Report

Attention: Patch coverage is 98.66667% with 3 lines in your changes missing coverage. Please review.

Project coverage is 90.99%. Comparing base (31caead) to head (f4326ce).

Files with missing lines Patch % Lines
tests/patterns/test_patterns.py 91.89% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #659      +/-   ##
===========================================
+ Coverage    90.86%   90.99%   +0.13%     
===========================================
  Files          160      162       +2     
  Lines        12145    12363     +218     
  Branches       593      605      +12     
===========================================
+ Hits         11035    11250     +215     
- Misses         930      932       +2     
- Partials       180      181       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sandorkertesz sandorkertesz marked this pull request as ready for review April 30, 2025 09:04
@sandorkertesz sandorkertesz changed the title WIP: file-pattern with hive partitioning File-pattern with hive partitioning Apr 30, 2025
@sandorkertesz sandorkertesz merged commit c80c8db into develop Apr 30, 2025
122 of 124 checks passed
@sandorkertesz sandorkertesz deleted the feature/defer-file-pattern-scan branch April 30, 2025 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants