Comprehensive refactoring: packaging, tests, new sync modules, AWS support#5
Open
vjsingh1984 wants to merge 11 commits into
Open
Comprehensive refactoring: packaging, tests, new sync modules, AWS support#5vjsingh1984 wants to merge 11 commits into
vjsingh1984 wants to merge 11 commits into
Conversation
- sync_views.py: Fix incorrect header (said sync_tables.py)
- sync_views.py: Fix f-string on dict key ("status" field)
- sync_views.py: Fix operator precedence in filter (missing parens)
- sync_grs_ext.py: Fix missing f-string prefix in error status
- sync_grs_ext.py: Remove unpopulated loaded_table_types column
that caused DataFrame column length mismatch
- examples/clone_to_secondary.py: Fix catalog list (was single
string instead of two list elements)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- sync_tables.py: Wrap both source and target warehouse usage in try/finally blocks to guarantee cleanup on success or failure - sync_shared_tables.py: Add try/finally for target warehouse cleanup - sync_grs_ext.py: Add try/finally for target warehouse cleanup - sync_views.py: Add try/finally for target warehouse cleanup - sync_shared_tables.py: Replace sys.exit() with raise RuntimeError() to avoid killing notebook kernels; remove unused sys import Each leaked warehouse persists in the workspace (even auto-stopped) and can be accidentally restarted, incurring cost. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create dr_sync/ package with reusable utilities: - sql_utils.py: execute_statement_sync() with timeout/cancellation, managed_warehouse() context manager, drop_table_if_exists() - workspace.py: create_client() factory with env var support - csv_mapping.py: load_mapping() with validation, lookup_value() - thread_utils.py: parallel_map() with error isolation, ProgressCounter - exceptions.py: DRSyncError hierarchy (ConfigurationError, MappingError, StatementError, WarehouseError, SyncError) Refactor sync scripts to use shared utilities: - Replace 8 identical polling loops with execute_statement_sync() - Replace 5 warehouse creation blocks with managed_warehouse() - Replace 2 duplicated drop_table functions with drop_table_if_exists() - Replace raw pd.read_csv calls with load_mapping/lookup_value Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add dr_sync/config.py with DRSyncConfig dataclass supporting: - from_common_module(): backward-compatible import from common.py - from_env(): secure configuration via DR_SYNC_* environment variables - validate(): returns list of configuration errors - Add .env.example template with all DR_SYNC_* variable names - Update all 9 sync scripts to auto-detect config source: env vars used when DR_SYNC_SOURCE_HOST is set, else common.py Existing users editing common.py are not disrupted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add dr_sync/log.py with setup_logging() for console + optional file output, compatible with Databricks notebooks (stdout) - Replace all 76 print() calls across 9 sync scripts with logger calls: - logger.info() for normal progress messages - logger.warning() for skip conditions (GCP, already exists) - logger.error() for failures (missing mappings, creation errors) - Update dr_sync/sql_utils.py to use logging in managed_warehouse() and drop_table_if_exists() - All log messages use %s lazy formatting instead of f-strings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add CSV validation functions to dr_sync/csv_mapping.py: - validate_catalog_mapping(): checks for duplicates - validate_cred_mapping(): checks cloud-specific required columns - validate_ext_location_mapping(): checks for empty URLs - Add --dry-run flag to all 9 sync scripts: - SQL scripts: skip warehouse creation entirely, log planned SQL ops - SDK scripts: skip create/update API calls, log what would change - Supported via config.dry_run and DR_SYNC_DRY_RUN env var - Add --log-level flag (DEBUG/INFO/WARNING/ERROR) to all scripts - CLI args only activate in __name__ == "__main__" blocks, so notebook execution is not affected Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace O(n) DataFrame scans with O(1) dict lookups in sync_creds_and_locs.py and sync_catalogs_and_schemas.py. Push Spark SQL filters down from Python to the database engine in sync_grs_ext.py and sync_tables.py. Parallelize schema creation across catalogs.
Run black formatter across all Python files. Fix E402 import order in dr_sync/sql_utils.py. Add ruff.toml with Databricks notebook builtins (spark, sql, display). Add .gitignore for Python, IDE, and env files.
Phase 1 - Foundation: - pyproject.toml: Proper Python package with metadata, dependencies, CLI entry point - Makefile: Developer convenience commands (install, test, lint, format, clean) - .github/workflows/ci.yml: GitHub Actions CI/CD for Python 3.9-3.12 - tests/conftest.py: Pytest fixtures for mock WorkspaceClient and configs - tests/test_*.py: 40 unit tests for all dr_sync modules (53% coverage) Phase 2 - Core Infrastructure: - dr_sync/retry.py: Exponential backoff decorator for SDK calls - dr_sync/checkpoint.py: Resumable sync tracking with state files - dr_sync/filter.py: Include/exclude glob pattern filtering for resources - dr_sync/registry.py: Plugin-based sync module registration with dependency resolution - dr_sync/cli.py: Unified CLI with run, list, and checkpoint commands - dr_sync/__init__.py: Export all new public APIs - .gitignore: Add .dr_sync_state/ for checkpoint files All tests pass with 53% code coverage.
Phase 3 - New Sync Modules (AWS + Databricks customers): - sync_jobs.py: Sync Databricks Jobs/Workflows definitions * Sync job tasks, clusters, schedules, triggers * Supports checkpointing, resume, and filtering * Handles ResourceAlreadyExists gracefully - sync_cluster_policies.py: Sync cluster policies * Creates policies with same JSON in target * Parallel sync with ThreadPoolExecutor - sync_instance_pools.py: Sync instance pools * Supports AWS instance type remapping between regions * Preserves min/max size, idle timeout settings - sync_instance_profiles.py: Sync AWS instance profile registrations * Registers instance profiles in target workspace * IAM roles must exist in target AWS account * Skips validation to allow cross-account use - sync_secret_scopes.py: Sync secret scope metadata and ACLs * Syncs scope definitions and permission grants * Does NOT sync secret values (by design - documented) * Supports Databricks and AWS Secrets Manager backends - sync_notebooks.py: Export/import notebooks preserving folder structure * Supports SOURCE, JUPYTER, DBC formats * Creates directory structure in target * Skips system paths (/Workspace/Shared, /Users/) All modules include: - --dry-run flag for preview mode - --log-level flag for configurable logging - Integration with DRSyncConfig from common.py or env vars - Structured logging via dr_sync.log
Phase 4 - Documentation: - README.md updates: * Add overview section listing all capabilities * Document new unified CLI with dr-sync command * Document checkpoint/resume feature * Document structured logging and env var configuration * Add installation instructions (pip install -e .) * Add development section (tests, linting, code quality tools) * Document all 6 new sync modules (jobs, cluster_policies, instance_pools, instance_profiles, secret_scopes, notebooks) * Add See Also section linking to AWS guide and implementation plan - docs/aws_usage_guide.md: * IAM role setup for cross-account access * S3 cross-region replication (CRR) configuration * S3 landing zone structure and lifecycle policies * AWS Secrets Manager integration for secret scopes * EC2 instance type remapping for instance pools * CloudWatch monitoring and logging setup * VPC networking and PrivateLink configuration * Cost optimization tips * Full example DR run sequence * Troubleshooting common AWS issues
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3
Summary
This is a comprehensive refactoring of the DR sync tools with 10 commits organized across 4 major phases, building on the existing stacked PR approach. All changes maintain backward compatibility while adding significant production-ready features for AWS + Databricks customers.
What Changed
Core Improvements (Commits from base)
dr_sync/packageprint()calls, no structurecommon.pyhardcodedDRSyncConfigwith env var support--dry-runflag + CSV validationNew Features
Phase 1: Packaging, CI/CD, Tests (Commit a298d96)
pyproject.toml: Proper Python package with CLI entry pointMakefile:install,install-dev,test,lint,format,cleancommands.github/workflows/ci.yml: CI/CD for Python 3.9-3.12 with linting and testingdr_sync/packagePhase 2: Core Infrastructure (Commit ebd5f02)
dr_sync/retry.py: Exponential backoff decorator for SDK operationsdr_sync/checkpoint.py: Resumable sync tracking (state in.dr_sync_state/)dr_sync/filter.py: Include/exclude glob patterns for selective syncdr_sync/registry.py: Plugin-based sync module registration with dependency resolutiondr_sync/cli.py: Unified CLI:dr-sync run --all, checkpoint managementPhase 3: AWS-Focused Sync Modules (Commit ebd5f02)
sync_jobs.py: Jobs/Workflows with checkpointing and filteringsync_cluster_policies.py: Cluster policy definitions (parallel sync)sync_instance_pools.py: Instance pools with AWS instance type remappingsync_instance_profiles.py: AWS IAM profile registrationsync_secret_scopes.py: Secret scope metadata and ACLs (NOT values by design)sync_notebooks.py: Notebooks with folder structure preservationPhase 4: Documentation (Commit 034fb13)
README.mdwith CLI usage, installation, and all sync modulesdocs/aws_usage_guide.mdfor AWS customers (IAM, S3 CRR, Secrets Manager, CloudWatch, VPC)Unified CLI
Backward Compatibility
100% preserved - All existing workflows continue to work:
Installation
This installs the
dr-syncCLI command.Test Coverage
dr_sync/packageNext Steps for Maintainer
README.mdanddocs/aws_usage_guide.mdpytestRelated Issues
This PR addresses the issues identified in the gap analysis: