Refactor: bug fixes, resource safety, shared utilities, logging, validation, and performance by vjsingh1984 · Pull Request #4 · gregwood-db/databricks-dr-examples

vjsingh1984 · 2026-02-19T05:17:18Z

Fixes #3

Summary

This PR is a comprehensive refactoring of the DR sync scripts, addressing 6 confirmed bugs, warehouse resource leaks, massive code duplication, zero observability, no safety nets, and performance bottlenecks — all while preserving full backward compatibility.

Existing users editing common.py and running python sync_*.py will not be disrupted.

The changes are organized into 8 atomic commits (reviewable independently):

#	Commit	What it does
1	Fix 6 confirmed bugs	Fixes crash bug in `sync_grs_ext.py`, data corruption in `sync_views.py`, operator precedence bug, missing f-strings, example list bug
2	Fix warehouse resource leaks	`managed_warehouse()` context manager guarantees cleanup; `sys.exit()` replaced with `RuntimeError`
3	Add `dr_sync/` shared utilities	Extracts ~400 lines of duplicated code into reusable modules (`sql_utils`, `csv_mapping`, `thread_utils`, `workspace`, `exceptions`)
4	Add `DRSyncConfig`	Dataclass with `from_common_module()` (backward compat) and `from_env()` (secure tokens via env vars)
5	Add structured logging	Replaces all 76 `print()` calls with `logger.info/warning/error` across 9 scripts
6	Add validation and `--dry-run`	CSV validation, `--dry-run` flag, `--log-level` CLI support for all scripts
7	Improve performance	O(1) dict lookups, Spark SQL filter pushdown, parallel schema creation
8	Apply code formatting	black formatting, ruff config for Databricks builtins, import order fixes

Why Each Change Matters

Bug Fixes (Commit 1)

The most critical bug is in sync_grs_ext.py: loaded_table_types is initialized as an empty list but never populated in the loop body, causing a DataFrame column length mismatch that crashes the script when writing the status table. The operator precedence bug in sync_views.py can cause system tables to be synced or user tables to be skipped.

Warehouse Leaks (Commit 2)

Every run of sync_tables.py leaks 2 serverless warehouses. Three other scripts leak 1 each. Over repeated DR runs, this accumulates orphaned warehouses that incur cost and clutter. The new managed_warehouse() context manager wraps creation in try/finally — even if the script crashes, the warehouse is deleted.

Shared Utilities (Commit 3)

The SQL polling loop was copied 8 times across 4 files. When the sync_grs_ext.py copy drifted (missing f-string), only that copy broke — a classic duplication hazard. Consolidating into dr_sync/sql_utils.py means one implementation to maintain, test, and improve (e.g., the shared version adds timeout protection and statement cancellation that none of the originals had).

Configuration (Commit 4)

DRSyncConfig.from_common_module() reads the existing common.py — zero disruption. But from_env() provides a secure alternative: DR_SYNC_SOURCE_TOKEN as an env var instead of a checked-in file. Scripts auto-detect which to use.

Logging (Commit 5)

DR operations are high-stakes. When something fails at 3 AM, operators need timestamps, log levels, and the ability to grep for ERROR — not undifferentiated print() output. All 76 calls are now properly leveled (info for progress, warning for skips, error for failures).

Dry-Run (Commit 6)

python sync_tables.py --dry-run shows every table that would be cloned, every warehouse that would be created, every external table that would be dropped — without executing anything. This lets operators validate their configuration before committing to a DR sync. CSV validation catches mapping file errors (duplicates, missing columns, empty values) before execution begins.

Performance (Commit 7)

For workspaces with many credentials/locations, the O(n)→O(1) lookup improvement is meaningful. Pushing SQL filters down to Spark eliminates transferring the entire system.information_schema.tables to Python. Parallel schema creation speeds up the catalog/schema sync phase.

New Files

File	Purpose
`dr_sync/__init__.py`	Package init with public API re-exports
`dr_sync/sql_utils.py`	`execute_statement_sync()`, `managed_warehouse()`, `drop_table_if_exists()`
`dr_sync/config.py`	`DRSyncConfig` dataclass
`dr_sync/csv_mapping.py`	CSV loading, validation, safe lookups
`dr_sync/thread_utils.py`	`parallel_map()` with `as_completed()`, `ProgressCounter`
`dr_sync/workspace.py`	`create_client()` factory
`dr_sync/exceptions.py`	`DRSyncError` hierarchy
`dr_sync/log.py`	`setup_logging()`
`.env.example`	Environment variable template
`ruff.toml`	Ruff config for Databricks builtins

How to Verify

# Pre-commit checks
pre-commit run --all-files  # black, ruff, mypy all pass

# Dry-run any script
python sync_tables.py --dry-run
python sync_creds_and_locs.py --dry-run --log-level DEBUG

# Existing workflow (backward compatible)
# Edit common.py as before, then:
python sync_catalogs_and_schemas.py

# New env-var workflow
export DR_SYNC_SOURCE_HOST=...
export DR_SYNC_SOURCE_TOKEN=...
python sync_tables.py --dry-run

Test Plan

pre-commit run --all-files passes (black, ruff, mypy)
Each script runs with --dry-run and logs planned operations
sync_creds_and_locs.py creates credentials/locations in a test workspace
sync_catalogs_and_schemas.py creates catalogs/schemas in a test workspace
sync_tables.py deep clones tables end-to-end
No orphaned warehouses remain after script completion
common.py workflow: edit common.py, run script, verify it reads config
Env-var workflow: set DR_SYNC_* vars, run script, verify it uses them

- sync_views.py: Fix incorrect header (said sync_tables.py) - sync_views.py: Fix f-string on dict key ("status" field) - sync_views.py: Fix operator precedence in filter (missing parens) - sync_grs_ext.py: Fix missing f-string prefix in error status - sync_grs_ext.py: Remove unpopulated loaded_table_types column that caused DataFrame column length mismatch - examples/clone_to_secondary.py: Fix catalog list (was single string instead of two list elements) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- sync_tables.py: Wrap both source and target warehouse usage in try/finally blocks to guarantee cleanup on success or failure - sync_shared_tables.py: Add try/finally for target warehouse cleanup - sync_grs_ext.py: Add try/finally for target warehouse cleanup - sync_views.py: Add try/finally for target warehouse cleanup - sync_shared_tables.py: Replace sys.exit() with raise RuntimeError() to avoid killing notebook kernels; remove unused sys import Each leaked warehouse persists in the workspace (even auto-stopped) and can be accidentally restarted, incurring cost. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Create dr_sync/ package with reusable utilities: - sql_utils.py: execute_statement_sync() with timeout/cancellation, managed_warehouse() context manager, drop_table_if_exists() - workspace.py: create_client() factory with env var support - csv_mapping.py: load_mapping() with validation, lookup_value() - thread_utils.py: parallel_map() with error isolation, ProgressCounter - exceptions.py: DRSyncError hierarchy (ConfigurationError, MappingError, StatementError, WarehouseError, SyncError) Refactor sync scripts to use shared utilities: - Replace 8 identical polling loops with execute_statement_sync() - Replace 5 warehouse creation blocks with managed_warehouse() - Replace 2 duplicated drop_table functions with drop_table_if_exists() - Replace raw pd.read_csv calls with load_mapping/lookup_value Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add dr_sync/config.py with DRSyncConfig dataclass supporting: - from_common_module(): backward-compatible import from common.py - from_env(): secure configuration via DR_SYNC_* environment variables - validate(): returns list of configuration errors - Add .env.example template with all DR_SYNC_* variable names - Update all 9 sync scripts to auto-detect config source: env vars used when DR_SYNC_SOURCE_HOST is set, else common.py Existing users editing common.py are not disrupted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add dr_sync/log.py with setup_logging() for console + optional file output, compatible with Databricks notebooks (stdout) - Replace all 76 print() calls across 9 sync scripts with logger calls: - logger.info() for normal progress messages - logger.warning() for skip conditions (GCP, already exists) - logger.error() for failures (missing mappings, creation errors) - Update dr_sync/sql_utils.py to use logging in managed_warehouse() and drop_table_if_exists() - All log messages use %s lazy formatting instead of f-strings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add CSV validation functions to dr_sync/csv_mapping.py: - validate_catalog_mapping(): checks for duplicates - validate_cred_mapping(): checks cloud-specific required columns - validate_ext_location_mapping(): checks for empty URLs - Add --dry-run flag to all 9 sync scripts: - SQL scripts: skip warehouse creation entirely, log planned SQL ops - SDK scripts: skip create/update API calls, log what would change - Supported via config.dry_run and DR_SYNC_DRY_RUN env var - Add --log-level flag (DEBUG/INFO/WARNING/ERROR) to all scripts - CLI args only activate in __name__ == "__main__" blocks, so notebook execution is not affected Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace O(n) DataFrame scans with O(1) dict lookups in sync_creds_and_locs.py and sync_catalogs_and_schemas.py. Push Spark SQL filters down from Python to the database engine in sync_grs_ext.py and sync_tables.py. Parallelize schema creation across catalogs.

Run black formatter across all Python files. Fix E402 import order in dr_sync/sql_utils.py. Add ruff.toml with Databricks notebook builtins (spark, sql, display). Add .gitignore for Python, IDE, and env files.

vjsingh1984 and others added 8 commits February 18, 2026 22:34

Apply black formatting, fix import order, add ruff and gitignore config

c9425c6

Run black formatter across all Python files. Fix E402 import order in dr_sync/sql_utils.py. Add ruff.toml with Databricks notebook builtins (spark, sql, display). Add .gitignore for Python, IDE, and env files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: bug fixes, resource safety, shared utilities, logging, validation, and performance#4

Refactor: bug fixes, resource safety, shared utilities, logging, validation, and performance#4
vjsingh1984 wants to merge 8 commits into
gregwood-db:mainfrom
vjsingh1984:feature/performance

vjsingh1984 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vjsingh1984 commented Feb 19, 2026

Summary

Why Each Change Matters

Bug Fixes (Commit 1)

Warehouse Leaks (Commit 2)

Shared Utilities (Commit 3)

Configuration (Commit 4)

Logging (Commit 5)

Dry-Run (Commit 6)

Performance (Commit 7)

New Files

How to Verify

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant