refactor(reconcile): consolidate data compare (join, mismatch, output) — #745 by BesikiML · Pull Request #2509 · databrickslabs/lakebridge

BesikiML · 2026-06-16T14:20:20Z

Summary

Refactors compare.py so hash reconcile, aggregate reconcile, and capture (inner-join) flows share core building blocks where appropriate.

Related: #745

What changed

Added shared join/persist helpers and reused them across hash + aggregate paths.
Added shared mismatch filtering/output assembly helpers to reduce duplication.
Kept capture flow semantics, but reused join primitive and split projections into helpers.
Added module-level flow documentation (hash / aggregate / capture).
Added compatibility follow-up for aggregate join import usage and JDBC option fixture names.

API / migration

Prefer prepare_persisted_aggregate_join over join_aggregate_data.
Aggregate mismatch logic now uses module-private helper _mismatch_rows_for_aggregate_mappin.

Testing

Reconcile compare unit/integration checks were run during development.

…cile output assembly; route capture mismatch through _inner_join_for_capture_mismatch; add module docstring for hash vs aggregate vs capture flows.

Restore join_aggregate_data compatibility alias for existing imports and align integration JdbcReaderOptions fixture keys with current dataclass fields.

codecov · 2026-06-16T14:26:36Z

Codecov Report

❌ Patch coverage is 29.16667% with 51 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.01%. Comparing base (7ebd945) to head (4ce9358).

Files with missing lines	Patch %	Lines
...rc/databricks/labs/lakebridge/reconcile/compare.py	29.16%	51 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2509      +/-   ##
==========================================
- Coverage   69.10%   69.01%   -0.10%     
==========================================
  Files         105      105              
  Lines        9482     9521      +39     
  Branches     1050     1051       +1     
==========================================
+ Hits         6553     6571      +18     
- Misses       2735     2756      +21     
  Partials      194      194

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-06-16T14:30:05Z

✅ 169/169 passed, 7 flaky, 2 skipped, 1h14m43s total

Flaky tests:

🤪 test_installs_and_runs_pypi_bladebridge (25.945s)
🤪 test_recon_for_report_type_is_data (1m1.326s)
🤪 test_transpiles_informatica_to_sparksql_non_interactive[True] (23.338s)
🤪 test_transpiles_informatica_to_sparksql_non_interactive[False] (23.442s)
🤪 test_transpile_teradata_sql_non_interactive[False] (6.178s)
🤪 test_transpile_teradata_sql_non_interactive[True] (6.765s)
🤪 test_transpile_teradata_sql (6.66s)

_{Running from acceptance #4924}

m-abulazm

Verdict: Behavior-safe, but works against its own goal. The refactor is genuinely behavior-preserving across all three flows (I verified this exhaustively — see below). But for a PR whose stated purpose is
to reduce duplication, it nets +143 lines and ~14 new helpers (only ~6 of which are actually shared), and it bundles in dead code and an unmigrated rename. I'd request changes on the cheap cleanups (dead
constant, rename churn) and ask the author to reconsider the abstraction depth before merge. Nothing here is a correctness blocker.

✅ Behavior preservation — clean (the important part for a refactor)

All three flows trace byte-for-byte equivalent to main. Specifically verified:

Hash reconcile (reconcile_data/_get_mismatch_data): same full-outer join + alias order, same selectExpr prefixed projection, same persist checkpoint, correct missing-side mapping, same mismatch pipeline.
Projection lists are built from the original df.columns (not the post-withColumn frame), so the added hash_match/agg_data_match columns never leak into a startswith projection — no column-set drift.
Aggregate reconcile: full/cross branches identical; the cross branch never forwards on=None; the col→c loop-variable rename is a verified no-op (it only shadowed the imported col inside a comprehension
that never calls it).
Capture: f"{source_alias}." with source_alias="base" produces a byte-identical column reference to the base 'base.' literal; select_expr order and the final sorted(compare_columns) reorder are unchanged;
backtick/quoted-name normalization path is untouched.
The one semantic-looking change — if mismatch: → if mismatch_df is not None in _data_reconcile_output — is equivalent: PySpark DataFrame defines neither bool nor len, so a non-None frame is always
truthy. (Marginally safer, even.)

⚠️ One caveat: there are no fast unit tests for compare.py — correctness is guarded only by the live-Spark integration tests in tests/integration/reconcile/test_compare.py. The equivalence above rests on
static tracing, so those integration tests should be run before merge.

🔧 Findings to address

Dead, unrelated constant in conftest.py (request change — easy)
tests/integration/reconcile/conftest.py:75 adds DIAMONDS_JDBC_READER_OPTIONS plus a JdbcReaderOptions import to support it. It's referenced nowhere in the repo (git grep finds only the definition). It's also
orthogonal scope for a compare.py refactor. → Remove the constant and the now-unused import; land any real parallel-read fixture in the PR that actually consumes it.
Rename + backward-compat alias = pure churn (request change — easy)
join_aggregate_data → prepare_persisted_aggregate_join, then a module-level alias join_aggregate_data = prepare_persisted_aggregate_join "for existing imports/callers." But the only caller
(reconciliation.py:14, :282) was never migrated and still uses the old name — so the new name has zero call sites, and the docstring "Replaces the former entry point" is contradicted by the unchanged caller.
These symbols aren't a public API (no all, not re-exported). → Pick one: migrate the single call site and delete the alias, or drop the rename entirely. Don't ship two names.
Net +143 lines for a "consolidation"; ~half the new helpers are single-use (the core altitude issue)
~6 helpers genuinely dedupe (≥2 callers) and should stay: _aliased_join, _join_prepare_persist, _filter_to_value_mismatches, _mismatch_projection_for_prefixed_columns, _joined_rows_missing_on_side,
_data_reconcile_output. The other ~8 are single-use wrappers that add indirection without removing duplication: _persist_reconcile_dataframe (a one-line forward), _inner_join_for_capture_mismatch,
_select_prefixed_columns_for_hash_reconcile, _select_aggregate_joined_columns, _value_mismatch_where_both_present, _mismatch_rows_for_prefixed_compare_column, _capture_mismatch_base_compare_projections,
_capture_mismatch_per_column_match_exprs. Per CLAUDE.md "no single-use abstractions," inlining these would likely flip the PR to a net line reduction — matching its stated goal.
_join_prepare_persist + prepare-callback is heavier than needed (low)
8 params (5 keyword-only) + a Callable callback for 2 call sites, each forced to wrap its select in a closure that re-passes source/target it already holds. Consider: each flow calls the shared _aliased_join
directly, then does its own select + persist. That keeps the one genuinely-shared piece (the cross/full branch) without the pipeline indirection.

📝 Minor / nits

Naming: prefixed_compare_column / compare_basename are module-novel coinages generalizing over a constant (_HASH_COLUMN_NAME) that has exactly one value at the call site. Prefer existing module vocabulary.
PR description: references _mismatch_rows_for_aggregate_mappin — the real symbol is _mismatch_rows_for_aggregate_mappings. And it frames the conftest change as "aligning JDBC option fixture keys," but the
diff aligns nothing — it adds one brand-new unused constant.
Dead defensive branch: _aliased_join raises ValueError("join condition 'on' is required…") for how != "cross" and on is None, which no caller can trigger.

👍 Worth keeping

_data_reconcile_output, _joined_rows_missing_on_side, _aliased_join, _filter_to_value_mismatches, _mismatch_projection_for_prefixed_columns, and the col→c shadowing fix are real, behavior-preserving dedup.

BesikiML added 2 commits June 16, 2026 10:14

Extract shared join/prepare/persist, value-mismatch filter, and recon…

e6b9286

…cile output assembly; route capture mismatch through _inner_join_for_capture_mismatch; add module docstring for hash vs aggregate vs capture flows.

Fix reconcile compare compatibility and JDBC option fixture names.

c0807a8

Restore join_aggregate_data compatibility alias for existing imports and align integration JdbcReaderOptions fixture keys with current dataclass fields.

BesikiML requested a review from a team as a code owner June 16, 2026 14:20

style(reconcile): format compare and reconcile conftest

7775d11

BesikiML requested review from bishwajit-db and m-abulazm June 17, 2026 12:23

BesikiML self-assigned this Jun 17, 2026

Merge branch 'main' into 745-cla-clean

d2939b5

m-abulazm requested changes Jun 23, 2026

View reviewed changes

m-abulazm added the do-not-merge label Jun 23, 2026

Merge branch 'main' into 745-cla-clean

4ce9358

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(reconcile): consolidate data compare (join, mismatch, output) — #745#2509

refactor(reconcile): consolidate data compare (join, mismatch, output) — #745#2509
BesikiML wants to merge 5 commits into
mainfrom
745-cla-clean

BesikiML commented Jun 16, 2026

Uh oh!

codecov Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

m-abulazm left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

BesikiML commented Jun 16, 2026

Summary

What changed

API / migration

Testing

Uh oh!

codecov Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m-abulazm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 16, 2026 •

edited

Loading

github-actions Bot commented Jun 16, 2026 •

edited

Loading