fix: align column order before incremental merges by stefan-jansen · Pull Request #17 · ml4t/data

stefan-jansen · 2026-04-29T15:26:16Z

Summary

align existing and incoming DataFrames before vertical concatenation in shared update/storage paths
reuse the same alignment in ETF and futures downloaders that merge into existing parquet files
add regressions for reordered provider columns and schema alignment behavior

Testing

uv run ruff check src tests
uv run ty check
uv run pytest -q

Copilot

Pull request overview

Aligns existing and incoming Polars DataFrames to a shared schema/column order before incremental merges so vertical concatenation into parquet-backed stores is stable even when providers reorder columns or introduce new ones.

Changes:

Introduces align_frames_for_concat (and helpers) to align column sets/order and optionally fill missing columns.
Applies alignment across shared update/merge paths (update manager, hive/chunked storage, storage manager, ETF/futures downloaders).
Adds regression tests covering provider column reordering and schema alignment/fill behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/test_update_manager.py	Adds regression for APPEND_ONLY updates when incoming columns are reordered.
tests/test_etf_downloader.py	Adds regression ensuring ETF updates succeed and persist canonical column order with reordered provider output.
tests/core/test_schemas.py	Adds unit tests for `align_frames_for_concat` (reorder, new columns, fill-values).
src/ml4t/data/core/schemas.py	Implements `align_frames_for_concat` + helpers to align columns for concat-safe merges.
src/ml4t/data/update_manager.py	Uses alignment before concat in multiple incremental strategies.
src/ml4t/data/storage/hive.py	Uses alignment in `update_combined_file` prior to concat.
src/ml4t/data/storage/chunked.py	Uses alignment before concatenating chunk merges.
src/ml4t/data/managers/storage_manager.py	Replaces bespoke “match columns + fill defaults” logic with shared alignment helper.
src/ml4t/data/futures/downloader.py	Uses alignment before merging newly downloaded data into existing parquet.
src/ml4t/data/futures/book_downloader.py	Uses alignment before merging year partitions into existing parquet.
src/ml4t/data/etfs/downloader.py	Uses alignment before append/unique; ensures `load_ohlcv` returns canonical column ordering.

Comments suppressed due to low confidence (1)

src/ml4t/data/storage/hive.py:519

update_combined_file computes rows_before after concatenating existing+new and then returns rows_after - rows_before, which will always be <= 0 and can be negative when new data overlaps existing timestamps. This return value is used as records_added by ProviderUpdater, so it can under-report or even report negative additions. Consider calculating rows_before as the pre-update row count (e.g., existing unique row count) and returning rows_after - rows_before (or simply rows_after - len(existing_df) after dedup) so the metric reflects net new records.

            existing_df, data = align_frames_for_concat(existing_df, data)
            combined = pl.concat([existing_df, data])
        else:
            combined = data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T15:32:43Z

+        existing, new = align_frames_for_concat(
+            existing,
+            new,
+            right_fill_values={"dividends": 0.0, "splits": 1.0},


_merge_data now preserves columns that exist only in new, but the current call only supplies right_fill_values. If new introduces optional equity columns like dividends/splits that were absent from existing (e.g., older parquet written without them), align_frames_for_concat will add those columns to existing filled with nulls. Downstream logic typically treats missing dividends as 0.0 and splits as 1.0 (see prior behavior in this method), so consider also passing the same defaults via left_fill_values to avoid nulls in historical rows when these columns are introduced later.

Suggested change

existing, new = align_frames_for_concat(

existing,

new,

right_fill_values={"dividends": 0.0, "splits": 1.0},

fill_values = {"dividends": 0.0, "splits": 1.0}

existing, new = align_frames_for_concat(

existing,

new,

left_fill_values=fill_values,

right_fill_values=fill_values,

fix: align column order before incremental merges

ada0223

Copilot AI review requested due to automatic review settings April 29, 2026 15:26

Copilot started reviewing on behalf of stefan-jansen April 29, 2026 15:26 View session

Stefan Jansen added 2 commits April 29, 2026 11:27

ci: pin setup-uv action to v7

5407b01

style: format schema alignment helper

827b479

Copilot AI reviewed Apr 29, 2026

View reviewed changes

stefan-jansen merged commit 8575cab into main Apr 29, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: align column order before incremental merges#17

fix: align column order before incremental merges#17
stefan-jansen merged 3 commits into
mainfrom
fix/align-update-column-order

stefan-jansen commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stefan-jansen commented Apr 29, 2026

Summary

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants