Skip to content

fix: align column order before incremental merges#17

Merged
stefan-jansen merged 3 commits into
mainfrom
fix/align-update-column-order
Apr 29, 2026
Merged

fix: align column order before incremental merges#17
stefan-jansen merged 3 commits into
mainfrom
fix/align-update-column-order

Conversation

@stefan-jansen

Copy link
Copy Markdown
Contributor

Summary

  • align existing and incoming DataFrames before vertical concatenation in shared update/storage paths
  • reuse the same alignment in ETF and futures downloaders that merge into existing parquet files
  • add regressions for reordered provider columns and schema alignment behavior

Testing

  • uv run ruff check src tests
  • uv run ty check
  • uv run pytest -q

Copilot AI review requested due to automatic review settings April 29, 2026 15:26

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Aligns existing and incoming Polars DataFrames to a shared schema/column order before incremental merges so vertical concatenation into parquet-backed stores is stable even when providers reorder columns or introduce new ones.

Changes:

  • Introduces align_frames_for_concat (and helpers) to align column sets/order and optionally fill missing columns.
  • Applies alignment across shared update/merge paths (update manager, hive/chunked storage, storage manager, ETF/futures downloaders).
  • Adds regression tests covering provider column reordering and schema alignment/fill behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/test_update_manager.py Adds regression for APPEND_ONLY updates when incoming columns are reordered.
tests/test_etf_downloader.py Adds regression ensuring ETF updates succeed and persist canonical column order with reordered provider output.
tests/core/test_schemas.py Adds unit tests for align_frames_for_concat (reorder, new columns, fill-values).
src/ml4t/data/core/schemas.py Implements align_frames_for_concat + helpers to align columns for concat-safe merges.
src/ml4t/data/update_manager.py Uses alignment before concat in multiple incremental strategies.
src/ml4t/data/storage/hive.py Uses alignment in update_combined_file prior to concat.
src/ml4t/data/storage/chunked.py Uses alignment before concatenating chunk merges.
src/ml4t/data/managers/storage_manager.py Replaces bespoke “match columns + fill defaults” logic with shared alignment helper.
src/ml4t/data/futures/downloader.py Uses alignment before merging newly downloaded data into existing parquet.
src/ml4t/data/futures/book_downloader.py Uses alignment before merging year partitions into existing parquet.
src/ml4t/data/etfs/downloader.py Uses alignment before append/unique; ensures load_ohlcv returns canonical column ordering.
Comments suppressed due to low confidence (1)

src/ml4t/data/storage/hive.py:519

  • update_combined_file computes rows_before after concatenating existing+new and then returns rows_after - rows_before, which will always be <= 0 and can be negative when new data overlaps existing timestamps. This return value is used as records_added by ProviderUpdater, so it can under-report or even report negative additions. Consider calculating rows_before as the pre-update row count (e.g., existing unique row count) and returning rows_after - rows_before (or simply rows_after - len(existing_df) after dedup) so the metric reflects net new records.
            existing_df, data = align_frames_for_concat(existing_df, data)
            combined = pl.concat([existing_df, data])
        else:
            combined = data


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +156 to +159
existing, new = align_frames_for_concat(
existing,
new,
right_fill_values={"dividends": 0.0, "splits": 1.0},

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_merge_data now preserves columns that exist only in new, but the current call only supplies right_fill_values. If new introduces optional equity columns like dividends/splits that were absent from existing (e.g., older parquet written without them), align_frames_for_concat will add those columns to existing filled with nulls. Downstream logic typically treats missing dividends as 0.0 and splits as 1.0 (see prior behavior in this method), so consider also passing the same defaults via left_fill_values to avoid nulls in historical rows when these columns are introduced later.

Suggested change
existing, new = align_frames_for_concat(
existing,
new,
right_fill_values={"dividends": 0.0, "splits": 1.0},
fill_values = {"dividends": 0.0, "splits": 1.0}
existing, new = align_frames_for_concat(
existing,
new,
left_fill_values=fill_values,
right_fill_values=fill_values,

Copilot uses AI. Check for mistakes.
@stefan-jansen stefan-jansen merged commit 8575cab into main Apr 29, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants