[New feature] Add GeoParquet support for Sentinel2 from PC#669
Draft
robmarkcole wants to merge 3 commits into
Draft
[New feature] Add GeoParquet support for Sentinel2 from PC#669robmarkcole wants to merge 3 commits into
robmarkcole wants to merge 3 commits into
Conversation
- Implemented functions to create mock GeoParquet rows for Sentinel2. - Added tests for batch preparation of window metadata and item retrieval by name. - Updated `uv.lock` to include duckdb version 1.5.3 and incremented revision number. - Added duckdb as an extra dependency for the project.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a GeoParquet metadata backend to the Planetary Computer data sources. Previously, scene discovery during dataset prepare always used the Planetary Computer STAC API, making one search request per window, which can hit API rate limits on large jobs.
The new backend queries the collection-level GeoParquet item table (hosted on Azure Blob Storage) using DuckDB. It downloads only the relevant date-range partitions, runs a spatial and temporal filter locally, and returns matching items in a single bulk operation per
get_itemscall. Because rslearn already batches all windows into oneget_itemscall during prepare, this effectively replaces N per-window STAC requests with one bulk query.The feature is opt-in via a new metadata_backend parameter (default "stac", new option "geoparquet") on the PlanetaryComputer base class and all its subclasses including Sentinel2. An optional
metadata_cache_dircan be set to cache the partition file list and query results between runs. DuckDB is required and is included in the existingextraoptional dependencies.uv.lockto include duckdb version 1.5.3 and incremented revision number.Note: I need to apply this to a large-scale dataset to see if there is a significant speedup