Skip to content

Add BigQuery as a reconcile source#2527

Open
take60 wants to merge 7 commits into
mainfrom
feature/reconcile-bigquery
Open

Add BigQuery as a reconcile source#2527
take60 wants to merge 7 commits into
mainfrom
feature/reconcile-bigquery

Conversation

@take60

@take60 take60 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What

Adds BigQuery as a reconcile source, at parity with the other sources (schema / row / data / all / aggregate). BigQuery was already supported by the transpiler and profiler; this closes the gap for reconcile.

How it works

BigQuery uses the same Lakehouse Federation remote_query path as Snowflake/Oracle/etc. — no new dependencies. BigQueryDataSource uses backtick-quoted 3-part project.dataset.table names and reads metadata from INFORMATION_SCHEMA.COLUMNS. sqlglot already ships a BigQuery dialect, so query generation and schema comparison are reused unchanged.

Changes

  • Connector reconcile/connectors/bigquery.py (BigQueryDataSource); registered in ReconSourceType and source_adapter.create_adapter; install prompt + recon_capture display name.
  • Schema type handling: the connector's INFORMATION_SCHEMA query canonicalizes the few BigQuery types sqlglot can't bridge to Databricks (BIGNUMERICstring, bare NUMERICdecimal(38,9), TIMEstring, JSONvariant, RANGE<T>struct<…>); everything else is left to sqlglot.
  • Row hashing: adds a BigQuery hash algorithm (TO_HEX(SHA256()), matching Databricks sha2(...,256)) and a scale-aware decimal transform (FORMAT('%.<scale>f', col)) so BigQuery's trailing-zero-stripped numeric strings match Spark's scale-padded DECIMAL strings.
  • Docs + tests: supported-sources row, a BigQuery config tab (with a compute note), connector tests, a type-coverage guardrail test, and an adapter test.

Testing

  • make fmt / make lint (pylint 10.0/10.0) and make test — green (1319 passed).
  • End-to-end on a real workspace: a BigQuery source (UC Federation connection) reconciled against an identical Databricks copy via the deployed reconcile jobschema, row, and data all matched (0 mismatches / 0 missing).

Notes / limitations

  • Compute: BigQuery reads use remote_query, which requires Databricks Runtime 17.3+ or serverless compute (the reconcile job's default cluster may run an older runtime — point it at a DBR 17.3+ cluster via job_overrides.existing_cluster_id, or run serverless). This applies to all Lakehouse Federation reconcile sources. Documented in the config tab.
  • INTERVAL maps to two Databricks columns, which the 1:1 schema comparison can't represent — surfaces as a visible mismatch (documented).

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.37%. Comparing base (7ebd945) to head (ac18df1).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2527      +/-   ##
==========================================
+ Coverage   69.10%   69.37%   +0.26%     
==========================================
  Files         105      106       +1     
  Lines        9482     9565      +83     
  Branches     1050     1056       +6     
==========================================
+ Hits         6553     6636      +83     
  Misses       2735     2735              
  Partials      194      194              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

✅ 173/173 passed, 4 flaky, 2 skipped, 1h23m4s total

Flaky tests:

  • 🤪 test_installs_and_runs_local_bladebridge (12.588s)
  • 🤪 test_transpiles_informatica_to_sparksql (25.649s)
  • 🤪 test_transpile_teradata_sql (27.105s)
  • 🤪 test_transpile_teradata_sql_non_interactive[False] (6.12s)

Running from acceptance #4927

Adds BigQuery as a Lakehouse Federation reconcile source (schema/row/data/all/aggregate),
reusing the existing remote_query path like the other federation connectors.

- New BigQueryDataSource: remote_query reads, backtick 3-part `project.dataset.table` names,
  INFORMATION_SCHEMA schema query with scale/precision canonicalization
- Register BIGQUERY in ReconSourceType and source_adapter; install prompts + result display name
- Row hashing for BigQuery: TO_HEX(SHA256()) (matches Databricks sha2) and scale-aware decimal
  FORMAT so cross-engine hashes match Databricks DECIMAL string output
- Docs (supported sources + config tab incl. DBR 17.3+/serverless compute note) and unit tests
  incl. a type-coverage guardrail
@take60 take60 force-pushed the feature/reconcile-bigquery branch from d73cf77 to 13141ee Compare June 24, 2026 06:50
@take60 take60 marked this pull request as ready for review June 24, 2026 07:26
@take60 take60 requested a review from a team as a code owner June 24, 2026 07:26

@m-abulazm m-abulazm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added UC connection with name bigquery_sandbox for the e2e test

take60 added 4 commits June 24, 2026 22:02
…ion tests

- bigquery.py: reference tables and INFORMATION_SCHEMA two-part (dataset.table);
  the project is abstracted by the UC connection, matching the other federated
  connectors. list_schemas uses bare SCHEMATA via the connection's default project.
- install: drop the BigQuery project prompt; catalog is empty for the bigquery dialect.
- unit tests: update assertions from three-part to two-part naming.
- integration: add bigquery e2e (report_type=schema) plus read_schema/list tests
  against the bigquery_sandbox UC connection.
main removed profiler_dashboard from LakebridgeConfiguration (#2512); update the
BigQuery reconcile install test to match.
catalog="" is dropped by blueprint serde and reloads as None, breaking the required
str field (e2e SerdeError). BigQuery has no separate catalog, so mirror the dataset
into catalog (non-empty, round-trips); the connector ignores it (two-part naming).
@take60

take60 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@m-abulazm thanks for setting up the connection.

All 4 BigQuery acceptance tests now fail on the same single cause — a connection grant, not
code:

PERMISSION_DENIED: User does not have USE CONNECTION on Connection 'bigquery_sandbox'
(the deployed-job path shows it masked as bigquery_TEST_CATALOG, but it's the same
connection.)

Two things from your side:

  1. Grant USE CONNECTION on the BigQuery connection to the acceptance test principal
  2. Heads-up for the next step: once the grant is in, remote_query will try to materialize
    results into the read dataset (public) — so please also make sure that dataset is writable by
    the connection's service account, otherwise we'll hit that right after. (Not failing on this
    yet — execution stops at the grant first.)

Thanks

@bishwajit-db

Copy link
Copy Markdown
Contributor

Materialization dataset: undocumented write requirement + no config knob

In production the BigQuery materialization target always defaults to the read dataset (_mat_dataset falls back to schema; create_adapter never sets materialization_dataset). Two gaps follow from that:

  1. Docs: the source dataset must be writable by the connection's service account (remote_query materializes results there), but only the DBR 17.3+ requirement is documented. Suggest a line in the BigQuery config notes, e.g.: "ensure the source dataset, or a dedicated materialization dataset, is writable by the connection's service account."

  2. Config: materialization_dataset is a constructor-only arg with no path from config (SourceConnectionConfig has no field, create_adapter never passes it), so it's only settable in test code. Adding it to SourceConnectionConfig and plumbing it through create_adapter would let users keep the source dataset read-only. It also fixes list_schemas, which passes _mat_dataset("") and resolves to an empty materializationDataset when none is configured.

The doc line unblocks the PR; the config change can be a follow-up.

take60 added 2 commits June 25, 2026 21:23
…ataset

Address review feedback: document that the source dataset (or a dedicated materialization
dataset) must be writable by the connection's service account, since remote_query
materializes results there. Also correct the catalog description for two-part naming —
the project is taken from the UC connection; catalog mirrors the dataset.
@take60

take60 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @bishwajit-db for the review!

(1) Docs — done here: added a "Writable dataset required" note (source or a dedicated materialization dataset must be writable by the connection's SA). Also fixed the naming text for two-part — project comes from the UC connection, schema is the dataset, catalog mirrors it.

(2) Config — agreed, taking as a follow-up: add materialization_dataset to SourceConnectionConfig and thread it through create_adapter (also fixes list_schemas' empty materializationDataset). Separate PR so this can land on the doc fix. #2529

@m-abulazm thanks for setting up UC connection and grant the permission. added testing and verified.

@take60 take60 enabled auto-merge June 25, 2026 13:16
@take60 take60 requested a review from m-abulazm June 25, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants