Add incremental update with pyarrow (no backend abstraction)#90
Closed
Alex Iannicelli (atiannicelli) wants to merge 1 commit intomainfrom
Closed
Add incremental update with pyarrow (no backend abstraction)#90Alex Iannicelli (atiannicelli) wants to merge 1 commit intomainfrom
Alex Iannicelli (atiannicelli) wants to merge 1 commit intomainfrom
Conversation
Collaborator
|
Really cool, but maybe you already know what I'm about to say... I think this will break on some schema updates. What will happen if we drop or rename a column in a new release? What if we narrow a data type? We either need a way to say "oops, won't work" and fail gracefully, or hold off on this until we have some sort of reliable schema migration tool. I vote for the latter. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR 4 of 4: Incremental Update
This PR adds the core incremental update functionality using pyarrow, following jwass's feedback to avoid per-format backend classes.
New files:
overturemaps/update.py- Core update logic using pyarrow:apply_update()- Main update function: read → join → filter → writefetch_features_pyarrow()- Fetch features from S3 by ID using pyarrow.datasetread_local_file()/write_local_file()- Simple I/O (geoparquet only for now)Modified files:
overturemaps/cli.py- Added:update run -o <file>- Apply incremental updateupdate status -o <file>- Show current statedownloadcommand now saves.statefile when-ois usedUpdate Algorithm (following jwass's guidance):
pyarrow.datasetpc.is_inpa.concat_tables([kept, new_features])pq.write_table()CLI Examples:
Implementation:
Tests:
tests/test_update.py- Unit and integration tests for update logicDependencies:
Related PRs:
Key differences from PR #85:
BaseBackend/GeoParquetBackend/PostGISBackendabstractionfetch.pywith DuckDB - uses pyarrow.dataset directly inupdate.py