Add BigQuery as a reconcile source#2527
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2527 +/- ##
==========================================
+ Coverage 69.10% 69.37% +0.26%
==========================================
Files 105 106 +1
Lines 9482 9565 +83
Branches 1050 1056 +6
==========================================
+ Hits 6553 6636 +83
Misses 2735 2735
Partials 194 194 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
✅ 173/173 passed, 4 flaky, 2 skipped, 1h23m4s total Flaky tests:
Running from acceptance #4927 |
Adds BigQuery as a Lakehouse Federation reconcile source (schema/row/data/all/aggregate), reusing the existing remote_query path like the other federation connectors. - New BigQueryDataSource: remote_query reads, backtick 3-part `project.dataset.table` names, INFORMATION_SCHEMA schema query with scale/precision canonicalization - Register BIGQUERY in ReconSourceType and source_adapter; install prompts + result display name - Row hashing for BigQuery: TO_HEX(SHA256()) (matches Databricks sha2) and scale-aware decimal FORMAT so cross-engine hashes match Databricks DECIMAL string output - Docs (supported sources + config tab incl. DBR 17.3+/serverless compute note) and unit tests incl. a type-coverage guardrail
d73cf77 to
13141ee
Compare
m-abulazm
left a comment
There was a problem hiding this comment.
added UC connection with name bigquery_sandbox for the e2e test
…ion tests - bigquery.py: reference tables and INFORMATION_SCHEMA two-part (dataset.table); the project is abstracted by the UC connection, matching the other federated connectors. list_schemas uses bare SCHEMATA via the connection's default project. - install: drop the BigQuery project prompt; catalog is empty for the bigquery dialect. - unit tests: update assertions from three-part to two-part naming. - integration: add bigquery e2e (report_type=schema) plus read_schema/list tests against the bigquery_sandbox UC connection.
main removed profiler_dashboard from LakebridgeConfiguration (#2512); update the BigQuery reconcile install test to match.
catalog="" is dropped by blueprint serde and reloads as None, breaking the required str field (e2e SerdeError). BigQuery has no separate catalog, so mirror the dataset into catalog (non-empty, round-trips); the connector ignores it (two-part naming).
|
@m-abulazm thanks for setting up the connection. All 4 BigQuery acceptance tests now fail on the same single cause — a connection grant, not PERMISSION_DENIED: User does not have USE CONNECTION on Connection 'bigquery_sandbox' Two things from your side:
Thanks |
|
Materialization dataset: undocumented write requirement + no config knob In production the BigQuery materialization target always defaults to the read dataset (
The doc line unblocks the PR; the config change can be a follow-up. |
…ataset Address review feedback: document that the source dataset (or a dedicated materialization dataset) must be writable by the connection's service account, since remote_query materializes results there. Also correct the catalog description for two-part naming — the project is taken from the UC connection; catalog mirrors the dataset.
|
Thanks @bishwajit-db for the review! (1) Docs — done here: added a "Writable dataset required" note (source or a dedicated materialization dataset must be writable by the connection's SA). Also fixed the naming text for two-part — project comes from the UC connection, schema is the dataset, catalog mirrors it. (2) Config — agreed, taking as a follow-up: add materialization_dataset to SourceConnectionConfig and thread it through create_adapter (also fixes list_schemas' empty materializationDataset). Separate PR so this can land on the doc fix. #2529 @m-abulazm thanks for setting up UC connection and grant the permission. added testing and verified. |
What
Adds BigQuery as a reconcile source, at parity with the other sources (
schema/row/data/all/aggregate). BigQuery was already supported by the transpiler and profiler; this closes the gap for reconcile.How it works
BigQuery uses the same Lakehouse Federation
remote_querypath as Snowflake/Oracle/etc. — no new dependencies.BigQueryDataSourceuses backtick-quoted 3-partproject.dataset.tablenames and reads metadata fromINFORMATION_SCHEMA.COLUMNS. sqlglot already ships a BigQuery dialect, so query generation and schema comparison are reused unchanged.Changes
reconcile/connectors/bigquery.py(BigQueryDataSource); registered inReconSourceTypeandsource_adapter.create_adapter; install prompt +recon_capturedisplay name.INFORMATION_SCHEMAquery canonicalizes the few BigQuery types sqlglot can't bridge to Databricks (BIGNUMERIC→string, bareNUMERIC→decimal(38,9),TIME→string,JSON→variant,RANGE<T>→struct<…>); everything else is left to sqlglot.TO_HEX(SHA256()), matching Databrickssha2(...,256)) and a scale-aware decimal transform (FORMAT('%.<scale>f', col)) so BigQuery's trailing-zero-stripped numeric strings match Spark's scale-paddedDECIMALstrings.Testing
make fmt/make lint(pylint 10.0/10.0) andmake test— green (1319 passed).schema,row, anddataall matched (0 mismatches / 0 missing).Notes / limitations
remote_query, which requires Databricks Runtime 17.3+ or serverless compute (the reconcile job's default cluster may run an older runtime — point it at a DBR 17.3+ cluster viajob_overrides.existing_cluster_id, or run serverless). This applies to all Lakehouse Federation reconcile sources. Documented in the config tab.INTERVALmaps to two Databricks columns, which the 1:1 schema comparison can't represent — surfaces as a visible mismatch (documented).