Skip to content

Add incremental update with pyarrow (no backend abstraction)#90

Closed
Alex Iannicelli (atiannicelli) wants to merge 1 commit intomainfrom
feature/update
Closed

Add incremental update with pyarrow (no backend abstraction)#90
Alex Iannicelli (atiannicelli) wants to merge 1 commit intomainfrom
feature/update

Conversation

@atiannicelli
Copy link
Copy Markdown
Collaborator

PR 4 of 4: Incremental Update

This PR adds the core incremental update functionality using pyarrow, following jwass's feedback to avoid per-format backend classes.

New files:

  • overturemaps/update.py - Core update logic using pyarrow:
    • apply_update() - Main update function: read → join → filter → write
    • fetch_features_pyarrow() - Fetch features from S3 by ID using pyarrow.dataset
    • read_local_file() / write_local_file() - Simple I/O (geoparquet only for now)

Modified files:

  • overturemaps/cli.py - Added:
    • update run -o <file> - Apply incremental update
    • update status -o <file> - Show current state
    • download command now saves .state file when -o is used

Update Algorithm (following jwass's guidance):

  1. Read existing local file as pyarrow Table
  2. Query changelog for (added_ids, modified_ids, removed_ids)
  3. Fetch new/modified features from S3 using pyarrow.dataset
  4. Filter out removed+modified from existing using pc.is_in
  5. pa.concat_tables([kept, new_features])
  6. Write back using pq.write_table()

CLI Examples:

# Initial download with state tracking
overturemaps download --bbox=-97.8,30.2,-97.6,30.4 --type=building -f geoparquet -o buildings.parquet
# Creates: buildings.parquet and buildings.parquet.state

# Apply incremental update
overturemaps update run -o buildings.parquet
# Reads state, queries changelog, applies changes

# Check status
overturemaps update status -o buildings.parquet

Implementation:

  • ✅ Uses pyarrow tables directly (no geopandas for core logic)
  • ✅ Uses pyarrow.dataset to fetch from S3 (NOT DuckDB)
  • ✅ Uses pyarrow.compute.is_in for filtering
  • NO BaseBackend abstraction - direct pyarrow operations
  • NO DuckDB dependency
  • ✅ Currently supports geoparquet only (GeoJSON support deferred)

Tests:

  • tests/test_update.py - Unit and integration tests for update logic

Dependencies:

Related PRs:

Key differences from PR #85:

  • No BaseBackend / GeoParquetBackend / PostGISBackend abstraction
  • No fetch.py with DuckDB - uses pyarrow.dataset directly in update.py
  • Simpler, more direct implementation as per jwass's feedback
  • Focus on geoparquet first (most common use case)

@Rachmanin0xFF
Copy link
Copy Markdown
Collaborator

Really cool, but maybe you already know what I'm about to say...

I think this will break on some schema updates. What will happen if we drop or rename a column in a new release? What if we narrow a data type?

We either need a way to say "oops, won't work" and fail gracefully, or hold off on this until we have some sort of reliable schema migration tool. I vote for the latter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants