Comprehensive refactoring: packaging, tests, new sync modules, AWS support by vjsingh1984 · Pull Request #5 · gregwood-db/databricks-dr-examples

vjsingh1984 · 2026-02-19T10:20:43Z

Fixes #3

Summary

This is a comprehensive refactoring of the DR sync tools with 10 commits organized across 4 major phases, building on the existing stacked PR approach. All changes maintain backward compatibility while adding significant production-ready features for AWS + Databricks customers.

What Changed

Core Improvements (Commits from base)

Area	Before	After
Bugs	6 confirmed bugs (crash, data corruption, operator precedence)	✅ All fixed
Resource leaks	4 scripts leaked serverless warehouses	✅ All fixed with context managers
Code duplication	~400 lines duplicated 8 times	✅ Consolidated into `dr_sync/` package
Logging	76 `print()` calls, no structure	✅ Structured logging with levels
Config	Only `common.py` hardcoded	✅ `DRSyncConfig` with env var support
Safety	No dry-run, no validation	✅ `--dry-run` flag + CSV validation

New Features

Phase 1: Packaging, CI/CD, Tests (Commit `a298d96`)

pyproject.toml: Proper Python package with CLI entry point
Makefile: install, install-dev, test, lint, format, clean commands
.github/workflows/ci.yml: CI/CD for Python 3.9-3.12 with linting and testing
40 unit tests: 53% coverage for dr_sync/ package

Phase 2: Core Infrastructure (Commit `ebd5f02`)

dr_sync/retry.py: Exponential backoff decorator for SDK operations
dr_sync/checkpoint.py: Resumable sync tracking (state in .dr_sync_state/)
dr_sync/filter.py: Include/exclude glob patterns for selective sync
dr_sync/registry.py: Plugin-based sync module registration with dependency resolution
dr_sync/cli.py: Unified CLI: dr-sync run --all, checkpoint management

Phase 3: AWS-Focused Sync Modules (Commit `ebd5f02`)

sync_jobs.py: Jobs/Workflows with checkpointing and filtering
sync_cluster_policies.py: Cluster policy definitions (parallel sync)
sync_instance_pools.py: Instance pools with AWS instance type remapping
sync_instance_profiles.py: AWS IAM profile registration
sync_secret_scopes.py: Secret scope metadata and ACLs (NOT values by design)
sync_notebooks.py: Notebooks with folder structure preservation

Phase 4: Documentation (Commit `034fb13`)

Updated README.md with CLI usage, installation, and all sync modules
New docs/aws_usage_guide.md for AWS customers (IAM, S3 CRR, Secrets Manager, CloudWatch, VPC)

Unified CLI

# Run all sync in dependency order
dr-sync run --all

# Dry-run to preview
dr-sync run --all --dry-run

# Selective filtering
dr-sync run --all --include "prod.*.*" --exclude "*.staging.*"

# Resume from checkpoint
dr-sync run --all --resume

# Checkpoint management
dr-sync checkpoint list
dr-sync checkpoint clear jobs

Backward Compatibility

100% preserved - All existing workflows continue to work:

# Traditional workflow still works
# Edit common.py, then:
python sync_tables.py

# New env var workflow
export DR_SYNC_SOURCE_HOST=...
export DR_SYNC_CATALOGS_TO_COPY=cat1,cat2
python sync_tables.py --dry-run

Installation

pip install databricks-dr-sync

This installs the dr-sync CLI command.

Test Coverage

40 unit tests passing
53% code coverage for dr_sync/ package
All linting passing (black, ruff, mypy)
CI/CD via GitHub Actions on Python 3.9-3.12

Next Steps for Maintainer

Review the documentation in README.md and docs/aws_usage_guide.md
Test in a dev environment with sample workspaces
Run tests: pytest
Check CI/CD: https://github.com/gregwood-db/databricks-dr-examples/actions

Related Issues

This PR addresses the issues identified in the gap analysis:

Warehouse resource leaks → Fixed
No safety nets → Dry-run, validation added
No observability → Structured logging added
No CI/CD → GitHub Actions workflow added
No tests → 40 unit tests added
AWS-specific gaps → 6 new sync modules added

- sync_views.py: Fix incorrect header (said sync_tables.py) - sync_views.py: Fix f-string on dict key ("status" field) - sync_views.py: Fix operator precedence in filter (missing parens) - sync_grs_ext.py: Fix missing f-string prefix in error status - sync_grs_ext.py: Remove unpopulated loaded_table_types column that caused DataFrame column length mismatch - examples/clone_to_secondary.py: Fix catalog list (was single string instead of two list elements) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- sync_tables.py: Wrap both source and target warehouse usage in try/finally blocks to guarantee cleanup on success or failure - sync_shared_tables.py: Add try/finally for target warehouse cleanup - sync_grs_ext.py: Add try/finally for target warehouse cleanup - sync_views.py: Add try/finally for target warehouse cleanup - sync_shared_tables.py: Replace sys.exit() with raise RuntimeError() to avoid killing notebook kernels; remove unused sys import Each leaked warehouse persists in the workspace (even auto-stopped) and can be accidentally restarted, incurring cost. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Create dr_sync/ package with reusable utilities: - sql_utils.py: execute_statement_sync() with timeout/cancellation, managed_warehouse() context manager, drop_table_if_exists() - workspace.py: create_client() factory with env var support - csv_mapping.py: load_mapping() with validation, lookup_value() - thread_utils.py: parallel_map() with error isolation, ProgressCounter - exceptions.py: DRSyncError hierarchy (ConfigurationError, MappingError, StatementError, WarehouseError, SyncError) Refactor sync scripts to use shared utilities: - Replace 8 identical polling loops with execute_statement_sync() - Replace 5 warehouse creation blocks with managed_warehouse() - Replace 2 duplicated drop_table functions with drop_table_if_exists() - Replace raw pd.read_csv calls with load_mapping/lookup_value Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add dr_sync/config.py with DRSyncConfig dataclass supporting: - from_common_module(): backward-compatible import from common.py - from_env(): secure configuration via DR_SYNC_* environment variables - validate(): returns list of configuration errors - Add .env.example template with all DR_SYNC_* variable names - Update all 9 sync scripts to auto-detect config source: env vars used when DR_SYNC_SOURCE_HOST is set, else common.py Existing users editing common.py are not disrupted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add dr_sync/log.py with setup_logging() for console + optional file output, compatible with Databricks notebooks (stdout) - Replace all 76 print() calls across 9 sync scripts with logger calls: - logger.info() for normal progress messages - logger.warning() for skip conditions (GCP, already exists) - logger.error() for failures (missing mappings, creation errors) - Update dr_sync/sql_utils.py to use logging in managed_warehouse() and drop_table_if_exists() - All log messages use %s lazy formatting instead of f-strings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add CSV validation functions to dr_sync/csv_mapping.py: - validate_catalog_mapping(): checks for duplicates - validate_cred_mapping(): checks cloud-specific required columns - validate_ext_location_mapping(): checks for empty URLs - Add --dry-run flag to all 9 sync scripts: - SQL scripts: skip warehouse creation entirely, log planned SQL ops - SDK scripts: skip create/update API calls, log what would change - Supported via config.dry_run and DR_SYNC_DRY_RUN env var - Add --log-level flag (DEBUG/INFO/WARNING/ERROR) to all scripts - CLI args only activate in __name__ == "__main__" blocks, so notebook execution is not affected Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace O(n) DataFrame scans with O(1) dict lookups in sync_creds_and_locs.py and sync_catalogs_and_schemas.py. Push Spark SQL filters down from Python to the database engine in sync_grs_ext.py and sync_tables.py. Parallelize schema creation across catalogs.

Run black formatter across all Python files. Fix E402 import order in dr_sync/sql_utils.py. Add ruff.toml with Databricks notebook builtins (spark, sql, display). Add .gitignore for Python, IDE, and env files.

Phase 1 - Foundation: - pyproject.toml: Proper Python package with metadata, dependencies, CLI entry point - Makefile: Developer convenience commands (install, test, lint, format, clean) - .github/workflows/ci.yml: GitHub Actions CI/CD for Python 3.9-3.12 - tests/conftest.py: Pytest fixtures for mock WorkspaceClient and configs - tests/test_*.py: 40 unit tests for all dr_sync modules (53% coverage) Phase 2 - Core Infrastructure: - dr_sync/retry.py: Exponential backoff decorator for SDK calls - dr_sync/checkpoint.py: Resumable sync tracking with state files - dr_sync/filter.py: Include/exclude glob pattern filtering for resources - dr_sync/registry.py: Plugin-based sync module registration with dependency resolution - dr_sync/cli.py: Unified CLI with run, list, and checkpoint commands - dr_sync/__init__.py: Export all new public APIs - .gitignore: Add .dr_sync_state/ for checkpoint files All tests pass with 53% code coverage.

Phase 3 - New Sync Modules (AWS + Databricks customers): - sync_jobs.py: Sync Databricks Jobs/Workflows definitions * Sync job tasks, clusters, schedules, triggers * Supports checkpointing, resume, and filtering * Handles ResourceAlreadyExists gracefully - sync_cluster_policies.py: Sync cluster policies * Creates policies with same JSON in target * Parallel sync with ThreadPoolExecutor - sync_instance_pools.py: Sync instance pools * Supports AWS instance type remapping between regions * Preserves min/max size, idle timeout settings - sync_instance_profiles.py: Sync AWS instance profile registrations * Registers instance profiles in target workspace * IAM roles must exist in target AWS account * Skips validation to allow cross-account use - sync_secret_scopes.py: Sync secret scope metadata and ACLs * Syncs scope definitions and permission grants * Does NOT sync secret values (by design - documented) * Supports Databricks and AWS Secrets Manager backends - sync_notebooks.py: Export/import notebooks preserving folder structure * Supports SOURCE, JUPYTER, DBC formats * Creates directory structure in target * Skips system paths (/Workspace/Shared, /Users/) All modules include: - --dry-run flag for preview mode - --log-level flag for configurable logging - Integration with DRSyncConfig from common.py or env vars - Structured logging via dr_sync.log

Phase 4 - Documentation: - README.md updates: * Add overview section listing all capabilities * Document new unified CLI with dr-sync command * Document checkpoint/resume feature * Document structured logging and env var configuration * Add installation instructions (pip install -e .) * Add development section (tests, linting, code quality tools) * Document all 6 new sync modules (jobs, cluster_policies, instance_pools, instance_profiles, secret_scopes, notebooks) * Add See Also section linking to AWS guide and implementation plan - docs/aws_usage_guide.md: * IAM role setup for cross-account access * S3 cross-region replication (CRR) configuration * S3 landing zone structure and lifecycle policies * AWS Secrets Manager integration for secret scopes * EC2 instance type remapping for instance pools * CloudWatch monitoring and logging setup * VPC networking and PrivateLink configuration * Cost optimization tips * Full example DR run sequence * Troubleshooting common AWS issues

vjsingh1984 and others added 11 commits February 18, 2026 22:34

Apply black formatting, fix import order, add ruff and gitignore config

c9425c6

Run black formatter across all Python files. Fix E402 import order in dr_sync/sql_utils.py. Add ruff.toml with Databricks notebook builtins (spark, sql, display). Add .gitignore for Python, IDE, and env files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comprehensive refactoring: packaging, tests, new sync modules, AWS support#5

Comprehensive refactoring: packaging, tests, new sync modules, AWS support#5
vjsingh1984 wants to merge 11 commits into
gregwood-db:mainfrom
vjsingh1984:feature/aws-enhancements

vjsingh1984 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vjsingh1984 commented Feb 19, 2026

Summary

What Changed

Core Improvements (Commits from base)

New Features

Phase 1: Packaging, CI/CD, Tests (Commit a298d96)

Phase 2: Core Infrastructure (Commit ebd5f02)

Phase 3: AWS-Focused Sync Modules (Commit ebd5f02)

Phase 4: Documentation (Commit 034fb13)

Unified CLI

Backward Compatibility

Installation

Test Coverage

Next Steps for Maintainer

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Phase 1: Packaging, CI/CD, Tests (Commit `a298d96`)

Phase 2: Core Infrastructure (Commit `ebd5f02`)

Phase 3: AWS-Focused Sync Modules (Commit `ebd5f02`)

Phase 4: Documentation (Commit `034fb13`)