fix: align column order before incremental merges#17
Conversation
There was a problem hiding this comment.
Pull request overview
Aligns existing and incoming Polars DataFrames to a shared schema/column order before incremental merges so vertical concatenation into parquet-backed stores is stable even when providers reorder columns or introduce new ones.
Changes:
- Introduces
align_frames_for_concat(and helpers) to align column sets/order and optionally fill missing columns. - Applies alignment across shared update/merge paths (update manager, hive/chunked storage, storage manager, ETF/futures downloaders).
- Adds regression tests covering provider column reordering and schema alignment/fill behavior.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/test_update_manager.py | Adds regression for APPEND_ONLY updates when incoming columns are reordered. |
| tests/test_etf_downloader.py | Adds regression ensuring ETF updates succeed and persist canonical column order with reordered provider output. |
| tests/core/test_schemas.py | Adds unit tests for align_frames_for_concat (reorder, new columns, fill-values). |
| src/ml4t/data/core/schemas.py | Implements align_frames_for_concat + helpers to align columns for concat-safe merges. |
| src/ml4t/data/update_manager.py | Uses alignment before concat in multiple incremental strategies. |
| src/ml4t/data/storage/hive.py | Uses alignment in update_combined_file prior to concat. |
| src/ml4t/data/storage/chunked.py | Uses alignment before concatenating chunk merges. |
| src/ml4t/data/managers/storage_manager.py | Replaces bespoke “match columns + fill defaults” logic with shared alignment helper. |
| src/ml4t/data/futures/downloader.py | Uses alignment before merging newly downloaded data into existing parquet. |
| src/ml4t/data/futures/book_downloader.py | Uses alignment before merging year partitions into existing parquet. |
| src/ml4t/data/etfs/downloader.py | Uses alignment before append/unique; ensures load_ohlcv returns canonical column ordering. |
Comments suppressed due to low confidence (1)
src/ml4t/data/storage/hive.py:519
update_combined_filecomputesrows_beforeafter concatenating existing+new and then returnsrows_after - rows_before, which will always be<= 0and can be negative when new data overlaps existing timestamps. This return value is used asrecords_addedbyProviderUpdater, so it can under-report or even report negative additions. Consider calculatingrows_beforeas the pre-update row count (e.g., existing unique row count) and returningrows_after - rows_before(or simplyrows_after - len(existing_df)after dedup) so the metric reflects net new records.
existing_df, data = align_frames_for_concat(existing_df, data)
combined = pl.concat([existing_df, data])
else:
combined = data
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| existing, new = align_frames_for_concat( | ||
| existing, | ||
| new, | ||
| right_fill_values={"dividends": 0.0, "splits": 1.0}, |
There was a problem hiding this comment.
_merge_data now preserves columns that exist only in new, but the current call only supplies right_fill_values. If new introduces optional equity columns like dividends/splits that were absent from existing (e.g., older parquet written without them), align_frames_for_concat will add those columns to existing filled with nulls. Downstream logic typically treats missing dividends as 0.0 and splits as 1.0 (see prior behavior in this method), so consider also passing the same defaults via left_fill_values to avoid nulls in historical rows when these columns are introduced later.
| existing, new = align_frames_for_concat( | |
| existing, | |
| new, | |
| right_fill_values={"dividends": 0.0, "splits": 1.0}, | |
| fill_values = {"dividends": 0.0, "splits": 1.0} | |
| existing, new = align_frames_for_concat( | |
| existing, | |
| new, | |
| left_fill_values=fill_values, | |
| right_fill_values=fill_values, |
Summary
Testing