[DoNotMerge] feat(ingestion): make PySpark optional for s3/abs profiling#386
Draft
kyungsoo-datahub wants to merge 64 commits into
Draft
[DoNotMerge] feat(ingestion): make PySpark optional for s3/abs profiling#386kyungsoo-datahub wants to merge 64 commits into
kyungsoo-datahub wants to merge 64 commits into
Conversation
…ConfigurationError
… into profiling.common
…ub_ge_profiler helper
…non-orderable types
… Trino JSON columns
…f missing SparkProfiler
…or pyspark-optional
…mediation CVEs: CVE-2025-66418, CVE-2025-66471, CVE-2026-21441, CVE-2026-44431, CVE-2026-44432 (urllib3), CVE-2022-40897, CVE-2024-6345, CVE-2025-47273 (setuptools). elasticsearch and profiling-ge pin conflicting urllib3 ranges; resolved via uv override-dependencies so the dev lock always uses the patched version. Introduces aws-common as an explicit extra with self-references from all aws-using extras; updates verify_pyproject_equivalence.py accordingly.
…id airflow conflict setuptools CVEs are build-time vulnerabilities. Adding >=78.1.1 to base_requirements breaks airflow 2.7.3 integration (gcloud-aio-auth pins setuptools<67). CVE protection is enforced via constraints.txt (==81.0.0) for Marsh installs and uv.lock for dev.
pytest.importorskip at module level converts the collection error into a clean skip when the elasticsearch extra is absent from the dev+integration-tests venv.
…ct#17453) Co-authored-by: Cursor <cursoragent@cursor.com>
PR Title Check FailedYour PR title must follow the format: Examples:
See the Contributing Guide for allowed types and format details. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Unity Catalog's discriminated-union profiling default still pointed at UnityCatalogGEProfilerConfig after PR datahub-project#17465 flipped the global GEProfilingBaseConfig default to sqlalchemy. Aligns Unity with the rest of the sources: GE is now opt-in across the board. Users who relied on Unity Catalog GE profiling must now set profiling.method: ge explicitly and install acryl-datahub[profiling-ge]. (cherry picked from commit a00dfb3)
Pydantic's discriminated union fails with union_tag_not_found when a recipe supplies profiling as a dict without a method key. The Field default only applies when the field is entirely absent. Add a before-validator that injects method: sqlalchemy when the key is missing.
Table registration gate at source.py:584 checked is_ge_profiling() only, so method: sqlalchemy populated an empty self.tables and profiled nothing. Add is_sqlalchemy_profiling() and uses_table_level_profiler() helpers and replace the gate check to cover both table-level methods.
… tables and managing connections per catalog (cherry picked from commit 4ccd0bf)
…eval and enhance workunit generation (cherry picked from commit 2345928)
…ance Co-authored-by: hsheth2 <hsheth2@gmail.com> (cherry picked from commit 47e1b99)
Override get_profiler_instance to dispatch on profiling.method while using get_sql_alchemy_url(database=db_name) so both GE and SQLAlchemy paths resolve the correct catalog in the connection URL.
Bundle ReportBundle size has no change ✅ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PySpark and pydeequ are no longer installed by default with
s3,abs, ordatabricks/unity-catalogextras. A new[pyspark]extracarries those deps for users who need them.