Skip to content

Adding feature to integrate Teradata profiler with Lakebridge#2343

Open
dey-abhishek wants to merge 61 commits into
mainfrom
feature/teradata_profiler
Open

Adding feature to integrate Teradata profiler with Lakebridge#2343
dey-abhishek wants to merge 61 commits into
mainfrom
feature/teradata_profiler

Conversation

@dey-abhishek

@dey-abhishek dey-abhishek commented Mar 23, 2026

Copy link
Copy Markdown

What does this PR do?

Adds **Teradata profiler ** to Lakebridge. The profiler connects to a Teradata system, runs a set of extraction queries defined entirely in configuration, and produces a local DuckDB extract.

Teradata ships in two variantscore and pdcr — selected at extraction time via the existing --variant flag, each backed by its own pipeline configuration.

Relevant implementation details

  • Source registration (assessments/_constants.py): registers teradata in PROFILER_SOURCE_SYSTEM and its variants (core, pdcr) in SOURCE_SYSTEM_VARIANTS, mirroring Redshift.
  • Connector (connections/database_manager.py): TeradataConnector (SQLAlchemy teradatasql driver) + factory registration; supports a configurable port (default 1025). Adds the teradatasqlalchemy dependency.
  • Credential dialog (assessments/configure_assessment.py): ConfigureTeradataAssessment collects host / port / user / password / default database, with local | env secret-vault handling.
  • Extraction resources (resources/assessments/teradata/): per-variant core/pipeline_config.yml and pdcr/pipeline_config.yml plus shared SQL/DDL assets for system info, nodes info, disk/usage aggregates, storage, object types, user databases, UDFs, PDCR aggregates, and DBQL-core.
  • Variant behavior: the core variant extracts td_dbql_core_info_extract (DBQL-core); the pdcr variant extracts td_pdcr_info_agg_extract and td_pdcr_sp_exe_info_agg_extract (PDCR aggregated history). Selected via --source-tech teradata --variant core|pdcr.
  • Docs (docs/.../assessment/profiler/teradata.mdx + index): Teradata profiler guide covering prerequisites, the two variants, and how to choose between them.

This PR makes no changes to the generic CLI, installer, deployment, or dashboard code — Teradata is supported purely through the existing config-driven extractor + variant framework, mirroring Redshift/Oracle.

Caveats / things to watch out for when reviewing

  • Scope is extractor only: no Unity Catalog ingestion, no Lakeview dashboard publishing, no create-profiler-dashboard CLI command, and no SQLTextInfo masking/tagging in this PR. (The DBQL-core extract redacts SQLTextInfo to [REDACTED] at extraction time, so raw query text never leaves the source.)
  • teradatasqlalchemy is added as a runtime dependency.
  • Variant selection (core vs pdcr) is explicit via --variant; if PDCR (PDCRINFO) is unavailable, use --variant core.

Linked issues

Resolves #..

Functionality

  • added relevant user documentation
  • added new CLI command
  • modified existing command
  • added Teradata profiler extraction resources (config-driven; no framework changes)

Tests

  • manually tested (extractor CLI run end-to-end)
  • added unit tests
    • test_teradata_profiler.py — variant (core/pdcr) pipeline-config resolution + missing/invalid-config handling
    • test_teradata_variants.py — variant → pipeline-config path mapping
    • test_assessment.pyConfigureTeradataAssessment dialog + factory
    • test_database_manager.pyTeradataConnector
  • added integration tests

make test green (1276 passed, aside from an unrelated pre-existing test_cli_analyze timeout flake); make fmt / ruff / mypy / pylint 10.00/10.

@codecov

codecov Bot commented Mar 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 58.33333% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.18%. Comparing base (7ebd945) to head (b7acf3e).

Files with missing lines Patch % Lines
...ks/labs/lakebridge/connections/database_manager.py 20.00% 8 Missing ⚠️
...abs/lakebridge/assessments/configure_assessment.py 84.61% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2343      +/-   ##
==========================================
+ Coverage   69.10%   69.18%   +0.07%     
==========================================
  Files         105      105              
  Lines        9482     9503      +21     
  Branches     1050     1052       +2     
==========================================
+ Hits         6553     6575      +22     
+ Misses       2735     2731       -4     
- Partials      194      197       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented Mar 23, 2026

Copy link
Copy Markdown

✅ 169/169 passed, 9 flaky, 2 skipped, 57m40s total

Flaky tests:

  • 🤪 test_installs_and_runs_local_bladebridge (13.081s)
  • 🤪 test_installs_and_runs_pypi_bladebridge (22.458s)
  • 🤪 test_recon_for_report_type_is_data (29.301s)
  • 🤪 test_transpiles_informatica_to_sparksql (25.017s)
  • 🤪 test_transpiles_informatica_to_sparksql_non_interactive[False] (23.672s)
  • 🤪 test_transpiles_informatica_to_sparksql_non_interactive[True] (25.114s)
  • 🤪 test_transpile_teradata_sql_non_interactive[True] (27.346s)
  • 🤪 test_transpile_teradata_sql (6.812s)
  • 🤪 test_transpile_teradata_sql_non_interactive[False] (8.038s)

Running from acceptance #4930

@dey-abhishek dey-abhishek added the feat/profiler Issues related to profilers label Mar 24, 2026
Addresses review comment: the source-agnostic Profiler class should not
contain Teradata-specific implementation, solved in a general way.

Moves the Teradata PDCR/DBQL-core step-selection logic out of Profiler into
a dedicated teradata_pipeline module, invoked through a generic
PIPELINE_CONFIGURATORS registry (mirrors the _create_connector pattern).
Profiler no longer references any specific platform; any source can register
a configurator to adjust its pipeline from credentials or probed capabilities.
Addresses review comment to improve fetch rather than add a parallel probe
API. The PDCR capability check now uses fetch with a narrow exception catch
inside teradata_pipeline, so the shared DatabaseManager keeps a single
fetch path and no swallow-everything probe method.
Addresses review comment questioning FetchResult._normalize_text. teradatasql
runs a UTF-8 session and returns CHAR/VARCHAR/CLOB as already-decoded str
(bytes only for BYTE/VARBYTE, which the profiler queries do not select), so
the silent byte/surrogate replacement was unnecessary and risked masking
genuine encoding problems. Removes the normalizer, documents the UTF-8
guarantee on TeradataConnector, and updates the test to assert text round-trips.
Addresses review comment to follow the same pattern as the other profiler
configurators: drops the redundant 'refer to the documentation' line and adds
env-mode handling so the stored value is an environment variable name (matching
Snowflake), making the documented env secret-vault mode work for the password.
Addresses review comment that profiler queries must redact query text. The
DBQL-core extract now emits '[REDACTED]' for SQLTextInfo so raw SQL never
leaves the source system.
- overview.mdx: mark Teradata as profiler-supported
- teradata.mdx: reword driver prerequisite (no driver install required),
  remove the use_pdcr override bullet and password admonition, fix the output
  path to use --output-folder and the versioned extract filename, rewrite the
  International Character Support section for the UTF-8 session, and add a SQL
  Text Redaction section.
Fixes an integration-test regression from the pipeline-configurator refactor:
_prepare_extractor_and_config read the credentials file whenever the source
required a connector, even when the caller injected an extractor (as the
Teradata integration tests do with a mock). It now returns early when an
extractor is supplied or no connector is needed, so credentials are read only
when the profiler builds the extractor itself.
@dey-abhishek dey-abhishek requested a review from m-abulazm June 12, 2026 20:29
@dey-abhishek

Copy link
Copy Markdown
Author

@m-abulazm - Can you re-review again. Addressed all the points. happy to address any lingering concerns

@dey-abhishek dey-abhishek requested a review from dgomez04 June 12, 2026 20:32

- Network access to the Teradata system.
- Python package dependency used by Lakebridge: `teradatasqlalchemy` (installed automatically with Lakebridge).
- No driver install required for the Teradata profiler — the `teradatasql` driver ships with Lakebridge.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- No driver install required for the Teradata profiler — the `teradatasql` driver ships with Lakebridge.
- No driver install required for the Teradata profiler.

Implementation details and should not be part of customer docs

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +377 to +382
# In env mode the stored value is the name of an environment variable that
# EnvGetter resolves at runtime, not the password itself, so prompt accordingly.
if secret_vault_type == "env":
password = self.prompts.question("Enter the environment variable name holding the password")
else:
password = self.prompts.password("Enter the password details")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not add unnecessary comments

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +83 to +92
# Only build an extractor from credentials when the caller didn't inject one (e.g. tests)
# and the source actually needs a connector. Otherwise leave the pipeline untouched so the
# credentials file is never read unnecessarily.
if extractor is not None or not self._connector_required:
return pipeline_config, extractor

extractor = self._setup_extractor(platform, cred_file_path)

# Let a source optionally adjust its pipeline at runtime (e.g. toggle steps based on
# credentials or probed capabilities). The agnostic profiler stays source-unaware.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same review regarding comments

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


class TeradataConnector(_BaseConnector):
def _connect(self) -> Engine:
# teradatasql communicates with the database over a UTF-8 session and returns CHAR/VARCHAR/CLOB

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid comments

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if frame[column].dtype == "object":
frame[column] = frame[column].map(self._normalize_text)
return frame
# we have an empty result-set. Character data arrives already decoded as ``str`` from the drivers

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return self.connector.fetch(query)
except OperationalError as e:
logger.exception(f"Error connecting to the database: {e}")
# Log quietly and let the caller decide how to surface this. Some callers treat a failure

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


PipelineConfigurator = Callable[[PipelineConfig, Mapping[str, object], DatabaseManager | None], PipelineConfig]

PIPELINE_CONFIGURATORS: dict[str, PipelineConfigurator] = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be its own PR to allow for a proper design discussion

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will emulate how Redshift variants are implemented. that way the changes are in line with existing design patterns

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Replaces the generic pipeline-configurator registry with first-class profiler
variants, mirroring the existing Redshift pattern, per review feedback that the
generic dynamic-step mechanism should be designed in its own PR.

- _constants: add TERADATA_VARIANTS ('core', 'pdcr'); register teradata_core and
  teradata_pdcr in SOURCE_SYSTEM_TO_PIPELINE_CFG; source_system_family collapses
  teradata_* to the 'teradata' family (shared connector/credentials/configurator).
- resources: split into teradata/core (DBQL-core workload) and teradata/pdcr (PDCR
  aggregates); common system/storage/object/UDF/user steps shared across both.
- profiler/database_manager: reverted to main (the source-agnostic class no longer
  contains any Teradata logic); removed the registry, teradata_pipeline, and the
  PDCR preflight probe. Kept the SQLTextInfo '[REDACTED]' redaction.
- removed the review-flagged inline comments; simplified the docs driver line and
  added a 'Choosing a Profiler Variant' section.
- gitignore: ignore local .credentials files.

Validated end-to-end against a live Teradata system: teradata_core extracts
cleanly; teradata_pdcr fails only on the PDCR steps when PDCRINFO is absent (no
silent fallback, by design). make fmt clean; make test 1302 passed.
… note

- profiler.py: drop the test-only extractor injection seam so the source-agnostic
  Profiler is byte-identical to main (matches Oracle/Snowflake/MSSQL, which have no
  profiler-execution tests). Removed the redundant mock-extractor integration tests
  that only re-exercised shared framework plumbing.
- database_manager.fetch: log the full driver error (teradatasql embeds a Go stack
  trace + the SQL) at debug and raise only the concise first line, so a failed step
  no longer dumps walls of text; full detail remains under --debug.
- docs: note that an 'Error 3802 ... Database PDCRINFO does not exist' means PDCR is
  unavailable -> re-run with --source-tech teradata_core.

make fmt clean; make test 1302 passed. Verified on a live box: teradata_pdcr now
fails with concise one-line errors when PDCRINFO is absent.
Completes the previous commit, which only captured the integration-test deletion
(its git add aborted on a stale pathspec). This commits the changes that were left
uncommitted:

- profiler.py: drop the test-only extractor injection seam so the source-agnostic
  Profiler is byte-identical to main (matches Oracle/Snowflake/MSSQL).
- database_manager.fetch: log the full driver error at debug and raise only the
  concise first line, so a failed step no longer dumps the teradatasql Go stack
  trace + SQL; full detail remains under --debug.
- test_teradata_profiler.py: drop the obsolete mock-extractor argument.
- docs: note that 'Error 3802 ... Database PDCRINFO does not exist' means PDCR is
  unavailable -> re-run with --source-tech teradata_core.
@dey-abhishek dey-abhishek requested a review from m-abulazm June 16, 2026 13:29
@dey-abhishek

Copy link
Copy Markdown
Author

@m-abulazm - Can you review the PR now

dey-abhishek and others added 5 commits June 22, 2026 22:32
…iler

# Conflicts:
#	src/databricks/labs/lakebridge/assessments/__init__.py
#	src/databricks/labs/lakebridge/assessments/_constants.py
#	tests/unit/assessment/test_assessment.py
- docs/teradata.mdx: correct source-tech/variant CLI usage (teradata + --variant)
- test_teradata_profiler: use (source_system, variant) split
- ConfigureTeradataAssessment + TeradataConnector: support custom port (default 1025)
- remove unused extract_folder from teradata pipeline configs
@dey-abhishek

dey-abhishek commented Jun 24, 2026

Copy link
Copy Markdown
Author

@m-abulazm & @dgomez04 - Can you review the PR and approve if it looks good

@dgomez04 dgomez04 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved-with-changes. Please review my comments and resolve before merging.

Core logic looks good to me.

LogTbl.SpoolUsage
FROM
DBC.DBQLogTbl AS LogTbl
JOIN

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This join is dead. Feel free to delete it entirely.

If it's being kept to filter to queries that had SQL logged, then add AND SQLTbl.SqlRowNo = 1

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL refactored

LogTbl.AppID,
LogTbl.UserName,
LogTbl.SessionID,
-- Raw SQL text is redacted at extraction time so query text never leaves the source system.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed comment

Comment thread pyproject.toml Outdated
"SQLAlchemy~=2.0.40",
"pygls~=2.1.1",
"duckdb>=1.2.2,<1.5.0",
"teradatasqlalchemy",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pin this to a specific version.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added version

Comment thread pyproject.toml Outdated
cache_dir = ".venv/pytest-cache"
asyncio_mode = "auto"
asyncio_default_fixture_loop_scope="function"
markers = [

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used and should be removed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment thread pyproject.toml

# Regular expression matching correct class names. Overrides class-naming-style.
# If left empty, class names will be checked with the set naming style.
class-rgx = "[A-Z_][a-zA-Z0-9]+$"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we relaxing the global lint rules as part of this pull request? This should either be explained in the pull request description or moved entirely.

Applies to both class-rgx and method-rgx

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes lets not change pylint rules

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from databricks.labs.lakebridge.assessments.profiler_config import PipelineConfig
from databricks.labs.lakebridge.assessments.profiler import Profiler

TERADATA_VARIANTS = [("teradata", "core"), ("teradata", "pdcr")]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TERADATA_VARIANTS = [("teradata", "core"), ("teradata", "pdcr")]
TERADATA_VARIANTS = SOURCE_SYSTEM_VARIANTS["teradata"]

@dey-abhishek dey-abhishek Jun 25, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

TERADATA_VARIANTS = [("teradata", "core"), ("teradata", "pdcr")]


@pytest.mark.parametrize("source_system,variant", TERADATA_VARIANTS)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parametrize("source_system,variant", TERADATA_VARIANTS)
@pytest.mark.parametrize("variant", TERADATA_VARIANTS)

@dey-abhishek dey-abhishek Jun 25, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@pytest.mark.parametrize("source_system,variant", TERADATA_VARIANTS)
def test_teradata_profile_missing_platform_config(source_system: str, variant: str) -> None:
with pytest.raises(ValueError, match=f"Cannot Proceed without a valid pipeline configuration for {source_system}"):
Profiler(source_system, variant).profile()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Profiler(source_system, variant).profile()
Profiler("teradata", variant).profile()

@dey-abhishek dey-abhishek Jun 25, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

mock_teradata_connector.assert_called_once_with(sample_config)


def test_fetch_result_to_df_preserves_international_text() -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tests the functionality of pandas library more than our code and should not be part of our codebase

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done removed

Comment thread pyproject.toml

# Regular expression matching correct class names. Overrides class-naming-style.
# If left empty, class names will be checked with the set naming style.
class-rgx = "[A-Z_][a-zA-Z0-9]+$"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes lets not change pylint rules

Comment thread tests/unit/profiler_test_helpers.py Outdated
import yaml


def build_overridden_pipeline_config(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used. please remove

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. removed



@pytest.mark.parametrize("variant", TERADATA_VARIANTS)
def test_teradata_variants_have_pipeline_config_path(variant: str) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test can be generalized independent of teradata. we do not need it if we test it elsewhere

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Replaced both per-source tests with one generalized test_profiler_variants.py covering redshift + teradata

…t relaxation

- core DBQL extract: drop dead DBQLSQLTbl join and stray comment
- pin teradatasqlalchemy to ~=20.0.0.9
- remove unused 'teradata' pytest marker
- revert global class-rgx/method-rgx to main's strict values
…nt test

- test_teradata_profiler: derive variants from SOURCE_SYSTEM_VARIANTS, parametrize on variant
- remove FetchResult i18n test (exercised pandas, not our code)
- replace per-source variant tests with generalized test_profiler_variants (covers redshift + teradata)
- remove unused profiler_test_helpers
@dey-abhishek

Copy link
Copy Markdown
Author

@m-abulazm / @dgomez04 - Can you check now.

@dey-abhishek dey-abhishek requested a review from m-abulazm June 25, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat/profiler Issues related to profilers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants