Adding feature to integrate Teradata profiler with Lakebridge#2343
Adding feature to integrate Teradata profiler with Lakebridge#2343dey-abhishek wants to merge 61 commits into
Conversation
…ction/dataframe normalization
…d when use_pdcr=False
…ion testing for TD
…r_replace_dashboard, causing a cryptic AssertionError on dashboard.dashboard_id assertion instead of the real SDK error
…reased the SQL preview for top queries in dashboard
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2343 +/- ##
==========================================
+ Coverage 69.10% 69.18% +0.07%
==========================================
Files 105 105
Lines 9482 9503 +21
Branches 1050 1052 +2
==========================================
+ Hits 6553 6575 +22
+ Misses 2735 2731 -4
- Partials 194 197 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
✅ 169/169 passed, 9 flaky, 2 skipped, 57m40s total Flaky tests:
Running from acceptance #4930 |
Addresses review comment: the source-agnostic Profiler class should not contain Teradata-specific implementation, solved in a general way. Moves the Teradata PDCR/DBQL-core step-selection logic out of Profiler into a dedicated teradata_pipeline module, invoked through a generic PIPELINE_CONFIGURATORS registry (mirrors the _create_connector pattern). Profiler no longer references any specific platform; any source can register a configurator to adjust its pipeline from credentials or probed capabilities.
Addresses review comment to improve fetch rather than add a parallel probe API. The PDCR capability check now uses fetch with a narrow exception catch inside teradata_pipeline, so the shared DatabaseManager keeps a single fetch path and no swallow-everything probe method.
Addresses review comment questioning FetchResult._normalize_text. teradatasql runs a UTF-8 session and returns CHAR/VARCHAR/CLOB as already-decoded str (bytes only for BYTE/VARBYTE, which the profiler queries do not select), so the silent byte/surrogate replacement was unnecessary and risked masking genuine encoding problems. Removes the normalizer, documents the UTF-8 guarantee on TeradataConnector, and updates the test to assert text round-trips.
Addresses review comment to follow the same pattern as the other profiler configurators: drops the redundant 'refer to the documentation' line and adds env-mode handling so the stored value is an environment variable name (matching Snowflake), making the documented env secret-vault mode work for the password.
Addresses review comment that profiler queries must redact query text. The DBQL-core extract now emits '[REDACTED]' for SQLTextInfo so raw SQL never leaves the source system.
- overview.mdx: mark Teradata as profiler-supported - teradata.mdx: reword driver prerequisite (no driver install required), remove the use_pdcr override bullet and password admonition, fix the output path to use --output-folder and the versioned extract filename, rewrite the International Character Support section for the UTF-8 session, and add a SQL Text Redaction section.
Fixes an integration-test regression from the pipeline-configurator refactor: _prepare_extractor_and_config read the credentials file whenever the source required a connector, even when the caller injected an extractor (as the Teradata integration tests do with a mock). It now returns early when an extractor is supplied or no connector is needed, so credentials are read only when the profiler builds the extractor itself.
|
@m-abulazm - Can you re-review again. Addressed all the points. happy to address any lingering concerns |
|
|
||
| - Network access to the Teradata system. | ||
| - Python package dependency used by Lakebridge: `teradatasqlalchemy` (installed automatically with Lakebridge). | ||
| - No driver install required for the Teradata profiler — the `teradatasql` driver ships with Lakebridge. |
There was a problem hiding this comment.
| - No driver install required for the Teradata profiler — the `teradatasql` driver ships with Lakebridge. | |
| - No driver install required for the Teradata profiler. |
Implementation details and should not be part of customer docs
| # In env mode the stored value is the name of an environment variable that | ||
| # EnvGetter resolves at runtime, not the password itself, so prompt accordingly. | ||
| if secret_vault_type == "env": | ||
| password = self.prompts.question("Enter the environment variable name holding the password") | ||
| else: | ||
| password = self.prompts.password("Enter the password details") |
There was a problem hiding this comment.
please do not add unnecessary comments
| # Only build an extractor from credentials when the caller didn't inject one (e.g. tests) | ||
| # and the source actually needs a connector. Otherwise leave the pipeline untouched so the | ||
| # credentials file is never read unnecessarily. | ||
| if extractor is not None or not self._connector_required: | ||
| return pipeline_config, extractor | ||
|
|
||
| extractor = self._setup_extractor(platform, cred_file_path) | ||
|
|
||
| # Let a source optionally adjust its pipeline at runtime (e.g. toggle steps based on | ||
| # credentials or probed capabilities). The agnostic profiler stays source-unaware. |
There was a problem hiding this comment.
same review regarding comments
|
|
||
| class TeradataConnector(_BaseConnector): | ||
| def _connect(self) -> Engine: | ||
| # teradatasql communicates with the database over a UTF-8 session and returns CHAR/VARCHAR/CLOB |
| if frame[column].dtype == "object": | ||
| frame[column] = frame[column].map(self._normalize_text) | ||
| return frame | ||
| # we have an empty result-set. Character data arrives already decoded as ``str`` from the drivers |
| return self.connector.fetch(query) | ||
| except OperationalError as e: | ||
| logger.exception(f"Error connecting to the database: {e}") | ||
| # Log quietly and let the caller decide how to surface this. Some callers treat a failure |
|
|
||
| PipelineConfigurator = Callable[[PipelineConfig, Mapping[str, object], DatabaseManager | None], PipelineConfig] | ||
|
|
||
| PIPELINE_CONFIGURATORS: dict[str, PipelineConfigurator] = { |
There was a problem hiding this comment.
this needs to be its own PR to allow for a proper design discussion
There was a problem hiding this comment.
Will emulate how Redshift variants are implemented. that way the changes are in line with existing design patterns
Replaces the generic pipeline-configurator registry with first-class profiler
variants, mirroring the existing Redshift pattern, per review feedback that the
generic dynamic-step mechanism should be designed in its own PR.
- _constants: add TERADATA_VARIANTS ('core', 'pdcr'); register teradata_core and
teradata_pdcr in SOURCE_SYSTEM_TO_PIPELINE_CFG; source_system_family collapses
teradata_* to the 'teradata' family (shared connector/credentials/configurator).
- resources: split into teradata/core (DBQL-core workload) and teradata/pdcr (PDCR
aggregates); common system/storage/object/UDF/user steps shared across both.
- profiler/database_manager: reverted to main (the source-agnostic class no longer
contains any Teradata logic); removed the registry, teradata_pipeline, and the
PDCR preflight probe. Kept the SQLTextInfo '[REDACTED]' redaction.
- removed the review-flagged inline comments; simplified the docs driver line and
added a 'Choosing a Profiler Variant' section.
- gitignore: ignore local .credentials files.
Validated end-to-end against a live Teradata system: teradata_core extracts
cleanly; teradata_pdcr fails only on the PDCR steps when PDCRINFO is absent (no
silent fallback, by design). make fmt clean; make test 1302 passed.
… note - profiler.py: drop the test-only extractor injection seam so the source-agnostic Profiler is byte-identical to main (matches Oracle/Snowflake/MSSQL, which have no profiler-execution tests). Removed the redundant mock-extractor integration tests that only re-exercised shared framework plumbing. - database_manager.fetch: log the full driver error (teradatasql embeds a Go stack trace + the SQL) at debug and raise only the concise first line, so a failed step no longer dumps walls of text; full detail remains under --debug. - docs: note that an 'Error 3802 ... Database PDCRINFO does not exist' means PDCR is unavailable -> re-run with --source-tech teradata_core. make fmt clean; make test 1302 passed. Verified on a live box: teradata_pdcr now fails with concise one-line errors when PDCRINFO is absent.
Completes the previous commit, which only captured the integration-test deletion (its git add aborted on a stale pathspec). This commits the changes that were left uncommitted: - profiler.py: drop the test-only extractor injection seam so the source-agnostic Profiler is byte-identical to main (matches Oracle/Snowflake/MSSQL). - database_manager.fetch: log the full driver error at debug and raise only the concise first line, so a failed step no longer dumps the teradatasql Go stack trace + SQL; full detail remains under --debug. - test_teradata_profiler.py: drop the obsolete mock-extractor argument. - docs: note that 'Error 3802 ... Database PDCRINFO does not exist' means PDCR is unavailable -> re-run with --source-tech teradata_core.
|
@m-abulazm - Can you review the PR now |
…iler # Conflicts: # src/databricks/labs/lakebridge/assessments/__init__.py # src/databricks/labs/lakebridge/assessments/_constants.py # tests/unit/assessment/test_assessment.py
- docs/teradata.mdx: correct source-tech/variant CLI usage (teradata + --variant) - test_teradata_profiler: use (source_system, variant) split - ConfigureTeradataAssessment + TeradataConnector: support custom port (default 1025) - remove unused extract_folder from teradata pipeline configs
|
@m-abulazm & @dgomez04 - Can you review the PR and approve if it looks good |
dgomez04
left a comment
There was a problem hiding this comment.
Approved-with-changes. Please review my comments and resolve before merging.
Core logic looks good to me.
| LogTbl.SpoolUsage | ||
| FROM | ||
| DBC.DBQLogTbl AS LogTbl | ||
| JOIN |
There was a problem hiding this comment.
This join is dead. Feel free to delete it entirely.
If it's being kept to filter to queries that had SQL logged, then add AND SQLTbl.SqlRowNo = 1
| LogTbl.AppID, | ||
| LogTbl.UserName, | ||
| LogTbl.SessionID, | ||
| -- Raw SQL text is redacted at extraction time so query text never leaves the source system. |
| "SQLAlchemy~=2.0.40", | ||
| "pygls~=2.1.1", | ||
| "duckdb>=1.2.2,<1.5.0", | ||
| "teradatasqlalchemy", |
There was a problem hiding this comment.
We should pin this to a specific version.
| cache_dir = ".venv/pytest-cache" | ||
| asyncio_mode = "auto" | ||
| asyncio_default_fixture_loop_scope="function" | ||
| markers = [ |
There was a problem hiding this comment.
This is not used and should be removed.
|
|
||
| # Regular expression matching correct class names. Overrides class-naming-style. | ||
| # If left empty, class names will be checked with the set naming style. | ||
| class-rgx = "[A-Z_][a-zA-Z0-9]+$" |
There was a problem hiding this comment.
Why are we relaxing the global lint rules as part of this pull request? This should either be explained in the pull request description or moved entirely.
Applies to both class-rgx and method-rgx
There was a problem hiding this comment.
yes lets not change pylint rules
| from databricks.labs.lakebridge.assessments.profiler_config import PipelineConfig | ||
| from databricks.labs.lakebridge.assessments.profiler import Profiler | ||
|
|
||
| TERADATA_VARIANTS = [("teradata", "core"), ("teradata", "pdcr")] |
There was a problem hiding this comment.
| TERADATA_VARIANTS = [("teradata", "core"), ("teradata", "pdcr")] | |
| TERADATA_VARIANTS = SOURCE_SYSTEM_VARIANTS["teradata"] |
| TERADATA_VARIANTS = [("teradata", "core"), ("teradata", "pdcr")] | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("source_system,variant", TERADATA_VARIANTS) |
There was a problem hiding this comment.
| @pytest.mark.parametrize("source_system,variant", TERADATA_VARIANTS) | |
| @pytest.mark.parametrize("variant", TERADATA_VARIANTS) |
| @pytest.mark.parametrize("source_system,variant", TERADATA_VARIANTS) | ||
| def test_teradata_profile_missing_platform_config(source_system: str, variant: str) -> None: | ||
| with pytest.raises(ValueError, match=f"Cannot Proceed without a valid pipeline configuration for {source_system}"): | ||
| Profiler(source_system, variant).profile() |
There was a problem hiding this comment.
| Profiler(source_system, variant).profile() | |
| Profiler("teradata", variant).profile() |
| mock_teradata_connector.assert_called_once_with(sample_config) | ||
|
|
||
|
|
||
| def test_fetch_result_to_df_preserves_international_text() -> None: |
There was a problem hiding this comment.
this tests the functionality of pandas library more than our code and should not be part of our codebase
|
|
||
| # Regular expression matching correct class names. Overrides class-naming-style. | ||
| # If left empty, class names will be checked with the set naming style. | ||
| class-rgx = "[A-Z_][a-zA-Z0-9]+$" |
There was a problem hiding this comment.
yes lets not change pylint rules
| import yaml | ||
|
|
||
|
|
||
| def build_overridden_pipeline_config( |
|
|
||
|
|
||
| @pytest.mark.parametrize("variant", TERADATA_VARIANTS) | ||
| def test_teradata_variants_have_pipeline_config_path(variant: str) -> None: |
There was a problem hiding this comment.
this test can be generalized independent of teradata. we do not need it if we test it elsewhere
There was a problem hiding this comment.
Done. Replaced both per-source tests with one generalized test_profiler_variants.py covering redshift + teradata
…t relaxation - core DBQL extract: drop dead DBQLSQLTbl join and stray comment - pin teradatasqlalchemy to ~=20.0.0.9 - remove unused 'teradata' pytest marker - revert global class-rgx/method-rgx to main's strict values
…nt test - test_teradata_profiler: derive variants from SOURCE_SYSTEM_VARIANTS, parametrize on variant - remove FetchResult i18n test (exercised pandas, not our code) - replace per-source variant tests with generalized test_profiler_variants (covers redshift + teradata) - remove unused profiler_test_helpers
|
@m-abulazm / @dgomez04 - Can you check now. |
What does this PR do?
Adds **Teradata profiler ** to Lakebridge. The profiler connects to a Teradata system, runs a set of extraction queries defined entirely in configuration, and produces a local DuckDB extract.
Teradata ships in two variants —
coreandpdcr— selected at extraction time via the existing--variantflag, each backed by its own pipeline configuration.Relevant implementation details
assessments/_constants.py): registersteradatainPROFILER_SOURCE_SYSTEMand its variants (core,pdcr) inSOURCE_SYSTEM_VARIANTS, mirroring Redshift.connections/database_manager.py):TeradataConnector(SQLAlchemyteradatasqldriver) + factory registration; supports a configurableport(default1025). Adds theteradatasqlalchemydependency.assessments/configure_assessment.py):ConfigureTeradataAssessmentcollects host / port / user / password / default database, withlocal | envsecret-vault handling.resources/assessments/teradata/): per-variantcore/pipeline_config.ymlandpdcr/pipeline_config.ymlplus shared SQL/DDL assets for system info, nodes info, disk/usage aggregates, storage, object types, user databases, UDFs, PDCR aggregates, and DBQL-core.corevariant extractstd_dbql_core_info_extract(DBQL-core); thepdcrvariant extractstd_pdcr_info_agg_extractandtd_pdcr_sp_exe_info_agg_extract(PDCR aggregated history). Selected via--source-tech teradata --variant core|pdcr.docs/.../assessment/profiler/teradata.mdx+ index): Teradata profiler guide covering prerequisites, the two variants, and how to choose between them.This PR makes no changes to the generic CLI, installer, deployment, or dashboard code — Teradata is supported purely through the existing config-driven extractor + variant framework, mirroring Redshift/Oracle.
Caveats / things to watch out for when reviewing
create-profiler-dashboardCLI command, and no SQLTextInfo masking/tagging in this PR. (The DBQL-core extract redactsSQLTextInfoto[REDACTED]at extraction time, so raw query text never leaves the source.)teradatasqlalchemyis added as a runtime dependency.corevspdcr) is explicit via--variant; if PDCR (PDCRINFO) is unavailable, use--variant core.Linked issues
Resolves #..
Functionality
Tests
test_teradata_profiler.py— variant (core/pdcr) pipeline-config resolution + missing/invalid-config handlingtest_teradata_variants.py— variant → pipeline-config path mappingtest_assessment.py—ConfigureTeradataAssessmentdialog + factorytest_database_manager.py—TeradataConnectormake testgreen (1276 passed, aside from an unrelated pre-existingtest_cli_analyzetimeout flake);make fmt/ ruff / mypy / pylint 10.00/10.