Skip to content

Comprehensive refactoring: packaging, tests, new sync modules, AWS support#5

Open
vjsingh1984 wants to merge 11 commits into
gregwood-db:mainfrom
vjsingh1984:feature/aws-enhancements
Open

Comprehensive refactoring: packaging, tests, new sync modules, AWS support#5
vjsingh1984 wants to merge 11 commits into
gregwood-db:mainfrom
vjsingh1984:feature/aws-enhancements

Conversation

@vjsingh1984

Copy link
Copy Markdown

Fixes #3

Summary

This is a comprehensive refactoring of the DR sync tools with 10 commits organized across 4 major phases, building on the existing stacked PR approach. All changes maintain backward compatibility while adding significant production-ready features for AWS + Databricks customers.

What Changed

Core Improvements (Commits from base)

Area Before After
Bugs 6 confirmed bugs (crash, data corruption, operator precedence) ✅ All fixed
Resource leaks 4 scripts leaked serverless warehouses ✅ All fixed with context managers
Code duplication ~400 lines duplicated 8 times ✅ Consolidated into dr_sync/ package
Logging 76 print() calls, no structure ✅ Structured logging with levels
Config Only common.py hardcoded DRSyncConfig with env var support
Safety No dry-run, no validation --dry-run flag + CSV validation

New Features

Phase 1: Packaging, CI/CD, Tests (Commit a298d96)

  • pyproject.toml: Proper Python package with CLI entry point
  • Makefile: install, install-dev, test, lint, format, clean commands
  • .github/workflows/ci.yml: CI/CD for Python 3.9-3.12 with linting and testing
  • 40 unit tests: 53% coverage for dr_sync/ package

Phase 2: Core Infrastructure (Commit ebd5f02)

  • dr_sync/retry.py: Exponential backoff decorator for SDK operations
  • dr_sync/checkpoint.py: Resumable sync tracking (state in .dr_sync_state/)
  • dr_sync/filter.py: Include/exclude glob patterns for selective sync
  • dr_sync/registry.py: Plugin-based sync module registration with dependency resolution
  • dr_sync/cli.py: Unified CLI: dr-sync run --all, checkpoint management

Phase 3: AWS-Focused Sync Modules (Commit ebd5f02)

  • sync_jobs.py: Jobs/Workflows with checkpointing and filtering
  • sync_cluster_policies.py: Cluster policy definitions (parallel sync)
  • sync_instance_pools.py: Instance pools with AWS instance type remapping
  • sync_instance_profiles.py: AWS IAM profile registration
  • sync_secret_scopes.py: Secret scope metadata and ACLs (NOT values by design)
  • sync_notebooks.py: Notebooks with folder structure preservation

Phase 4: Documentation (Commit 034fb13)

  • Updated README.md with CLI usage, installation, and all sync modules
  • New docs/aws_usage_guide.md for AWS customers (IAM, S3 CRR, Secrets Manager, CloudWatch, VPC)

Unified CLI

# Run all sync in dependency order
dr-sync run --all

# Dry-run to preview
dr-sync run --all --dry-run

# Selective filtering
dr-sync run --all --include "prod.*.*" --exclude "*.staging.*"

# Resume from checkpoint
dr-sync run --all --resume

# Checkpoint management
dr-sync checkpoint list
dr-sync checkpoint clear jobs

Backward Compatibility

100% preserved - All existing workflows continue to work:

# Traditional workflow still works
# Edit common.py, then:
python sync_tables.py

# New env var workflow
export DR_SYNC_SOURCE_HOST=...
export DR_SYNC_CATALOGS_TO_COPY=cat1,cat2
python sync_tables.py --dry-run

Installation

pip install databricks-dr-sync

This installs the dr-sync CLI command.

Test Coverage

  • 40 unit tests passing
  • 53% code coverage for dr_sync/ package
  • All linting passing (black, ruff, mypy)
  • CI/CD via GitHub Actions on Python 3.9-3.12

Next Steps for Maintainer

  1. Review the documentation in README.md and docs/aws_usage_guide.md
  2. Test in a dev environment with sample workspaces
  3. Run tests: pytest
  4. Check CI/CD: https://github.com/gregwood-db/databricks-dr-examples/actions

Related Issues

This PR addresses the issues identified in the gap analysis:

  • Warehouse resource leaks → Fixed
  • No safety nets → Dry-run, validation added
  • No observability → Structured logging added
  • No CI/CD → GitHub Actions workflow added
  • No tests → 40 unit tests added
  • AWS-specific gaps → 6 new sync modules added

vjsingh1984 and others added 11 commits February 18, 2026 22:34
- sync_views.py: Fix incorrect header (said sync_tables.py)
- sync_views.py: Fix f-string on dict key ("status" field)
- sync_views.py: Fix operator precedence in filter (missing parens)
- sync_grs_ext.py: Fix missing f-string prefix in error status
- sync_grs_ext.py: Remove unpopulated loaded_table_types column
  that caused DataFrame column length mismatch
- examples/clone_to_secondary.py: Fix catalog list (was single
  string instead of two list elements)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- sync_tables.py: Wrap both source and target warehouse usage in
  try/finally blocks to guarantee cleanup on success or failure
- sync_shared_tables.py: Add try/finally for target warehouse cleanup
- sync_grs_ext.py: Add try/finally for target warehouse cleanup
- sync_views.py: Add try/finally for target warehouse cleanup
- sync_shared_tables.py: Replace sys.exit() with raise RuntimeError()
  to avoid killing notebook kernels; remove unused sys import

Each leaked warehouse persists in the workspace (even auto-stopped)
and can be accidentally restarted, incurring cost.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create dr_sync/ package with reusable utilities:
- sql_utils.py: execute_statement_sync() with timeout/cancellation,
  managed_warehouse() context manager, drop_table_if_exists()
- workspace.py: create_client() factory with env var support
- csv_mapping.py: load_mapping() with validation, lookup_value()
- thread_utils.py: parallel_map() with error isolation, ProgressCounter
- exceptions.py: DRSyncError hierarchy (ConfigurationError, MappingError,
  StatementError, WarehouseError, SyncError)

Refactor sync scripts to use shared utilities:
- Replace 8 identical polling loops with execute_statement_sync()
- Replace 5 warehouse creation blocks with managed_warehouse()
- Replace 2 duplicated drop_table functions with drop_table_if_exists()
- Replace raw pd.read_csv calls with load_mapping/lookup_value

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add dr_sync/config.py with DRSyncConfig dataclass supporting:
  - from_common_module(): backward-compatible import from common.py
  - from_env(): secure configuration via DR_SYNC_* environment variables
  - validate(): returns list of configuration errors
- Add .env.example template with all DR_SYNC_* variable names
- Update all 9 sync scripts to auto-detect config source:
  env vars used when DR_SYNC_SOURCE_HOST is set, else common.py

Existing users editing common.py are not disrupted.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add dr_sync/log.py with setup_logging() for console + optional file
  output, compatible with Databricks notebooks (stdout)
- Replace all 76 print() calls across 9 sync scripts with logger calls:
  - logger.info() for normal progress messages
  - logger.warning() for skip conditions (GCP, already exists)
  - logger.error() for failures (missing mappings, creation errors)
- Update dr_sync/sql_utils.py to use logging in managed_warehouse()
  and drop_table_if_exists()
- All log messages use %s lazy formatting instead of f-strings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add CSV validation functions to dr_sync/csv_mapping.py:
  - validate_catalog_mapping(): checks for duplicates
  - validate_cred_mapping(): checks cloud-specific required columns
  - validate_ext_location_mapping(): checks for empty URLs

- Add --dry-run flag to all 9 sync scripts:
  - SQL scripts: skip warehouse creation entirely, log planned SQL ops
  - SDK scripts: skip create/update API calls, log what would change
  - Supported via config.dry_run and DR_SYNC_DRY_RUN env var

- Add --log-level flag (DEBUG/INFO/WARNING/ERROR) to all scripts

- CLI args only activate in __name__ == "__main__" blocks, so
  notebook execution is not affected

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace O(n) DataFrame scans with O(1) dict lookups in
sync_creds_and_locs.py and sync_catalogs_and_schemas.py. Push Spark SQL
filters down from Python to the database engine in sync_grs_ext.py and
sync_tables.py. Parallelize schema creation across catalogs.
Run black formatter across all Python files. Fix E402 import order in
dr_sync/sql_utils.py. Add ruff.toml with Databricks notebook builtins
(spark, sql, display). Add .gitignore for Python, IDE, and env files.
Phase 1 - Foundation:
- pyproject.toml: Proper Python package with metadata, dependencies, CLI entry point
- Makefile: Developer convenience commands (install, test, lint, format, clean)
- .github/workflows/ci.yml: GitHub Actions CI/CD for Python 3.9-3.12
- tests/conftest.py: Pytest fixtures for mock WorkspaceClient and configs
- tests/test_*.py: 40 unit tests for all dr_sync modules (53% coverage)

Phase 2 - Core Infrastructure:
- dr_sync/retry.py: Exponential backoff decorator for SDK calls
- dr_sync/checkpoint.py: Resumable sync tracking with state files
- dr_sync/filter.py: Include/exclude glob pattern filtering for resources
- dr_sync/registry.py: Plugin-based sync module registration with dependency resolution
- dr_sync/cli.py: Unified CLI with run, list, and checkpoint commands
- dr_sync/__init__.py: Export all new public APIs
- .gitignore: Add .dr_sync_state/ for checkpoint files

All tests pass with 53% code coverage.
Phase 3 - New Sync Modules (AWS + Databricks customers):

- sync_jobs.py: Sync Databricks Jobs/Workflows definitions
  * Sync job tasks, clusters, schedules, triggers
  * Supports checkpointing, resume, and filtering
  * Handles ResourceAlreadyExists gracefully

- sync_cluster_policies.py: Sync cluster policies
  * Creates policies with same JSON in target
  * Parallel sync with ThreadPoolExecutor

- sync_instance_pools.py: Sync instance pools
  * Supports AWS instance type remapping between regions
  * Preserves min/max size, idle timeout settings

- sync_instance_profiles.py: Sync AWS instance profile registrations
  * Registers instance profiles in target workspace
  * IAM roles must exist in target AWS account
  * Skips validation to allow cross-account use

- sync_secret_scopes.py: Sync secret scope metadata and ACLs
  * Syncs scope definitions and permission grants
  * Does NOT sync secret values (by design - documented)
  * Supports Databricks and AWS Secrets Manager backends

- sync_notebooks.py: Export/import notebooks preserving folder structure
  * Supports SOURCE, JUPYTER, DBC formats
  * Creates directory structure in target
  * Skips system paths (/Workspace/Shared, /Users/)

All modules include:
- --dry-run flag for preview mode
- --log-level flag for configurable logging
- Integration with DRSyncConfig from common.py or env vars
- Structured logging via dr_sync.log
Phase 4 - Documentation:

- README.md updates:
  * Add overview section listing all capabilities
  * Document new unified CLI with dr-sync command
  * Document checkpoint/resume feature
  * Document structured logging and env var configuration
  * Add installation instructions (pip install -e .)
  * Add development section (tests, linting, code quality tools)
  * Document all 6 new sync modules (jobs, cluster_policies, instance_pools, instance_profiles, secret_scopes, notebooks)
  * Add See Also section linking to AWS guide and implementation plan

- docs/aws_usage_guide.md:
  * IAM role setup for cross-account access
  * S3 cross-region replication (CRR) configuration
  * S3 landing zone structure and lifecycle policies
  * AWS Secrets Manager integration for secret scopes
  * EC2 instance type remapping for instance pools
  * CloudWatch monitoring and logging setup
  * VPC networking and PrivateLink configuration
  * Cost optimization tips
  * Full example DR run sequence
  * Troubleshooting common AWS issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 bugs, warehouse leaks, no logging/dry-run, code duplication, and performance issues

1 participant