Add incremental update with pyarrow (no backend abstraction) by atiannicelli · Pull Request #90 · OvertureMaps/overturemaps-py

Alex Iannicelli (atiannicelli) · 2026-02-23T14:36:58Z

PR 4 of 4: Incremental Update

This PR adds the core incremental update functionality using pyarrow, following jwass's feedback to avoid per-format backend classes.

New files:

overturemaps/update.py - Core update logic using pyarrow:
- apply_update() - Main update function: read → join → filter → write
- fetch_features_pyarrow() - Fetch features from S3 by ID using pyarrow.dataset
- read_local_file() / write_local_file() - Simple I/O (geoparquet only for now)

Modified files:

overturemaps/cli.py - Added:
- update run -o <file> - Apply incremental update
- update status -o <file> - Show current state
- download command now saves .state file when -o is used

Update Algorithm (following jwass's guidance):

Read existing local file as pyarrow Table
Query changelog for (added_ids, modified_ids, removed_ids)
Fetch new/modified features from S3 using pyarrow.dataset
Filter out removed+modified from existing using pc.is_in
pa.concat_tables([kept, new_features])
Write back using pq.write_table()

CLI Examples:

# Initial download with state tracking
overturemaps download --bbox=-97.8,30.2,-97.6,30.4 --type=building -f geoparquet -o buildings.parquet
# Creates: buildings.parquet and buildings.parquet.state

# Apply incremental update
overturemaps update run -o buildings.parquet
# Reads state, queries changelog, applies changes

# Check status
overturemaps update status -o buildings.parquet

Implementation:

✅ Uses pyarrow tables directly (no geopandas for core logic)
✅ Uses pyarrow.dataset to fetch from S3 (NOT DuckDB)
✅ Uses pyarrow.compute.is_in for filtering
✅ NO BaseBackend abstraction - direct pyarrow operations
✅ NO DuckDB dependency
✅ Currently supports geoparquet only (GeoJSON support deferred)

Tests:

tests/test_update.py - Unit and integration tests for update logic

Dependencies:

✅ No new dependencies (uses pyarrow only)
Depends on: PR Add releases CLI (list, latest) #87 (releases), PR [FEATURE] Add download state tracking and releases check #88 (state), PR [FEATURE] Add changelog query/summary #89 (changelog)

Related PRs:

Depends on: PR Add releases CLI (list, latest) #87, [FEATURE] Add download state tracking and releases check #88, [FEATURE] Add changelog query/summary #89
Future work: GeoJSON/GeoJSONSeq support, PostGIS backend (separate PR)

Key differences from PR #85:

No BaseBackend / GeoParquetBackend / PostGISBackend abstraction
No fetch.py with DuckDB - uses pyarrow.dataset directly in update.py
Simpler, more direct implementation as per jwass's feedback
Focus on geoparquet first (most common use case)

Adam Lastowka (Rachmanin0xFF) · 2026-02-28T02:25:00Z

Really cool, but maybe you already know what I'm about to say...

I think this will break on some schema updates. What will happen if we drop or rename a column in a new release? What if we narrow a data type?

We either need a way to say "oops, won't work" and fail gracefully, or hold off on this until we have some sort of reliable schema migration tool. I vote for the latter.

Add incremental update with pyarrow (no backend abstraction)

f1a11c1

Alex Iannicelli (atiannicelli) closed this Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add incremental update with pyarrow (no backend abstraction)#90

Add incremental update with pyarrow (no backend abstraction)#90
Alex Iannicelli (atiannicelli) wants to merge 1 commit intomainfrom
feature/update

Alex Iannicelli (atiannicelli) commented Feb 23, 2026

Uh oh!

Adam Lastowka (Rachmanin0xFF) commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alex Iannicelli (atiannicelli) commented Feb 23, 2026

Uh oh!

Adam Lastowka (Rachmanin0xFF) commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants