Skip to content

[DoNotMerge] feat(ingestion): make PySpark optional for s3/abs profiling#386

Draft
kyungsoo-datahub wants to merge 64 commits into
masterfrom
experimental/marsh-cve-clean
Draft

[DoNotMerge] feat(ingestion): make PySpark optional for s3/abs profiling#386
kyungsoo-datahub wants to merge 64 commits into
masterfrom
experimental/marsh-cve-clean

Conversation

@kyungsoo-datahub

Copy link
Copy Markdown
Collaborator

PySpark and pydeequ are no longer installed by default with s3,
abs, or databricks/unity-catalog extras. A new [pyspark] extra
carries those deps for users who need them.

treff7es added 30 commits May 15, 2026 22:04
kyungsoo-datahub and others added 14 commits May 15, 2026 22:10
…mediation

CVEs: CVE-2025-66418, CVE-2025-66471, CVE-2026-21441, CVE-2026-44431, CVE-2026-44432 (urllib3),
CVE-2022-40897, CVE-2024-6345, CVE-2025-47273 (setuptools).

elasticsearch and profiling-ge pin conflicting urllib3 ranges; resolved via
uv override-dependencies so the dev lock always uses the patched version.
Introduces aws-common as an explicit extra with self-references from all
aws-using extras; updates verify_pyproject_equivalence.py accordingly.
…id airflow conflict

setuptools CVEs are build-time vulnerabilities. Adding >=78.1.1 to base_requirements
breaks airflow 2.7.3 integration (gcloud-aio-auth pins setuptools<67). CVE protection
is enforced via constraints.txt (==81.0.0) for Marsh installs and uv.lock for dev.
pytest.importorskip at module level converts the collection error
into a clean skip when the elasticsearch extra is absent from the
dev+integration-tests venv.
@kyungsoo-datahub kyungsoo-datahub marked this pull request as draft May 16, 2026 05:47
@kyungsoo-datahub kyungsoo-datahub changed the title feat(ingestion): make PySpark optional for s3/abs profiling [DoNotMerge] feat(ingestion): make PySpark optional for s3/abs profiling May 16, 2026
@github-actions

Copy link
Copy Markdown

PR Title Check Failed

Your PR title must follow the format: <type>[optional scope]: <description>

Examples:

  • feat(ingestion): add Snowflake v2 source
  • fix: resolve crash on empty dashboard

See the Contributing Guide for allowed types and format details.

kyungsoo-datahub and others added 8 commits May 17, 2026 07:01
Unity Catalog's discriminated-union profiling default still pointed at
UnityCatalogGEProfilerConfig after PR datahub-project#17465 flipped the global
GEProfilingBaseConfig default to sqlalchemy. Aligns Unity with the rest of
the sources: GE is now opt-in across the board.

Users who relied on Unity Catalog GE profiling must now set
profiling.method: ge explicitly and install acryl-datahub[profiling-ge].

(cherry picked from commit a00dfb3)
Pydantic's discriminated union fails with union_tag_not_found when a
recipe supplies profiling as a dict without a method key. The Field
default only applies when the field is entirely absent. Add a
before-validator that injects method: sqlalchemy when the key is missing.
Table registration gate at source.py:584 checked is_ge_profiling() only,
so method: sqlalchemy populated an empty self.tables and profiled nothing.
Add is_sqlalchemy_profiling() and uses_table_level_profiler() helpers and
replace the gate check to cover both table-level methods.
… tables and managing connections per catalog

(cherry picked from commit 4ccd0bf)
…eval and enhance workunit generation

(cherry picked from commit 2345928)
…ance

Co-authored-by: hsheth2 <hsheth2@gmail.com>
(cherry picked from commit 47e1b99)
Override get_profiler_instance to dispatch on profiling.method while
using get_sql_alchemy_url(database=db_name) so both GE and SQLAlchemy
paths resolve the correct catalog in the connection URL.
@codecov

codecov Bot commented May 18, 2026

Copy link
Copy Markdown

Bundle Report

Bundle size has no change ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants