Skip to content

[DataLoader] Add support to convert filters to SQL strings#632

Open
robreeves wants to merge 6 commits into
linkedin:mainfrom
robreeves:sparkfilter
Open

[DataLoader] Add support to convert filters to SQL strings#632
robreeves wants to merge 6 commits into
linkedin:mainfrom
robreeves:sparkfilter

Conversation

@robreeves

@robreeves robreeves commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a public to_sql(filter, target=SqlTarget.SPARK) to the DataLoader so callers can render
the filter DSL (col(...) > ..., etc.) as a SQL boolean expression for a chosen target — Spark,
Trino, or DataFusion. This unifies TableTransformer to also use SqlTarget instead of leaking SQLGlot into the public API.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

  • Client-facing API Changes: New public exports in openhouse.dataloader: to_sql(filter, target=SqlTarget.SPARK) -> str and the SqlTarget enum (SPARK, TRINO, DATA_FUSION). to_sql returns a WHERE-clause-ready predicate for the chosen target; callers select a SqlTarget, never a raw dialect string.
  • Internal API Changes: TableTransformer.dialect is now typed SqlTarget (was str); to_datafusion_sql(sql, source_dialect) now takes source_dialect: SqlTarget (was str); the standalone _to_datafusion_sql helper was removed (the loader's internal DataFusion query now calls to_sql(filters, SqlTarget.DATA_FUSION)).
  • Bug Fixes: Filter NaN checks (is_nan() / is_not_nan()) previously rendered col IS NAN, which DataFusion 53 cannot parse. They now render isnan(col) (Spark/DataFusion) / is_nan(col) (Trino), all valid in their dialect.
  • New Features: to_sql renders any filter DSL expression to Spark, Trino, or DataFusion SQL.
  • Refactoring: The filter→DataFusion and filter→Spark/Trino paths now share one dialect-agnostic sqlglot AST builder (_filter_to_expr / _literal_to_expr) rendered via .sql(dialect=...), replacing the hand-rolled f-string builder. SqlTarget is the single SQL-flavor vocabulary across to_sql, TableTransformer.dialect, and to_datafusion_sql.
  • Documentation: Updated README.md and CLAUDE.md to document to_sql / SqlTarget and the TableTransformer.dialect type.
  • Tests: Added TestSparkSqlConversion and TestTrinoSqlConversion (per-node assertions + sqlglot parse-validation per dialect). Updated the DataFusion filter tests to render via to_sql(..., SqlTarget.DATA_FUSION), updated the transformer/transpiler tests to use SqlTarget, and removed now-unreachable tests.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

  • make verify passes locally — ruff lint, ruff format --check, mypy (no issues), and the full unit suite (340 tests). This includes TestDataFusionFilterExecution, which executes the generated SQL against a real DataFusion SessionContext, so the unified rendering is validated against the engine.
  • Added: TestSparkSqlConversion, TestTrinoSqlConversion.
  • Updated: the filter→DataFusion conversion tests (now via to_sql), the to_datafusion_sql transpiler tests (now SqlTarget), and all TableTransformer subclasses in the test suite.
  • I did not run the Docker integration tests for this change — it only alters SQL rendering/typing, which the unit tests plus the real-DataFusion execution tests cover.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

  • Breaking Changes: TableTransformer subclasses must now pass a SqlTarget (e.g. SqlTarget.SPARK) instead of a dialect string, and to_datafusion_sql's source_dialect parameter is now a SqlTarget. These are internal/advanced APIs (the transformer ABC is not in the top-level __init__ exports).

robreeves added 4 commits June 9, 2026 20:48
Add a public to_sql(filter, target=SqlTarget.SPARK) that renders a Filter
DSL object as a SQL boolean expression for a target. The underlying sqlglot
dialect is hidden behind the SqlTarget enum so callers never pass a dialect
string.

Internally, _filter_to_expr() builds a single dialect-agnostic sqlglot AST
from the Filter tree; to_sql() renders it with .sql(dialect=target.value) and
_to_datafusion_sql() renders the same AST with the default dialect (signature
unchanged). This replaces the hand-rolled f-string DataFusion builder.

Side effect: NaN checks now render isnan(col) instead of the unparseable
'col IS NAN' (DataFusion 53 rejects IS NAN), and non-finite float literals use
canonical NaN/Infinity spellings accepted by both DataFusion and Spark.

Tests: TestSparkSqlConversion asserts per-node Spark output, validates each
rendering parses as Spark via sqlglot, and checks Spark vs DataFusion differ
only in identifier quoting. make verify passes (ruff, mypy, 332 tests).
Use SqlTarget across the library instead of raw dialect strings:

- Add SqlTarget.TRINO and SqlTarget.DATA_FUSION (now {SPARK, TRINO, DATA_FUSION}).
- Type TableTransformer.dialect as SqlTarget (was str); data_loader bridges to the
  string-based to_datafusion_sql() transpiler via transformer.dialect.value.
- Drop the standalone _to_datafusion_sql(); the loader's internal DataFusion query
  now uses to_sql(filters, SqlTarget.DATA_FUSION). Everything funnels through
  to_sql() -> _filter_to_expr().

NaN is dialect-specific: Trino spells it is_nan() while Spark/DataFusion use isnan().
Thread the target through _filter_to_expr and pick the right function name per target.

Breaking change for TableTransformer subclasses (an internal API): pass a SqlTarget,
e.g. SqlTarget.SPARK. Updated all subclasses + tests; removed the now-impossible
bad-dialect transformer test (string validation is still covered by test_datafusion_sql).

make verify passes (ruff, mypy, 344 tests).
…rectly

Inline to_sql(expr, SqlTarget.DATA_FUSION) at the 31 call sites that used the
leftover _to_datafusion_sql test helper, and remove the helper. The DataFusion
filter rendering is now exercised through the public to_sql() path with no
indirection. No production change; the to_datafusion_sql statement transpiler is
untouched. make verify passes (344 tests).
Make to_datafusion_sql accept a SqlTarget instead of a raw dialect string, so the
transformer transpilation path is SqlTarget-typed end to end (data_loader now
passes transformer.dialect directly, no .value).

- Drop the 'unsupported source dialect' validation: every SqlTarget value is a
  valid sqlglot dialect, so the check is dead. Compare the no-op case with
  'is SqlTarget.DATA_FUSION'; use .value for the sqlglot parse + error message.
- Tests: convert spark/datafusion cases to SqlTarget; remove the mysql/postgres
  transpilation cases and the unsupported-dialect test (not expressible/reachable
  once the param is a SqlTarget).

make verify passes (ruff, mypy, 340 tests).
@robreeves robreeves changed the title feat(dataloader): to_sql() to render filter DSL as Spark/Trino/DataFusion SQL [DataLoader] Add support to convert filters to SQL strings Jun 15, 2026
…rk filters

Add a case to the Livy-driven integration test that reads rows via
OpenHouseDataLoader with a filter, converts the same filter to Spark SQL via
to_sql(filters, SqlTarget.SPARK), reads from Spark with that WHERE clause, and
asserts the two row sets match. The filter excludes one matching-by-score row via
a name IN, so an ignored predicate on either side fails the assertion.

LivySession gains a query() that returns result rows (shared submit/poll loop
factored into _run); execute() is unchanged behaviorally.

Static checks pass (ruff, py_compile, unit suite); the Dockerized integration
suite must be run via make integration-tests.
@robreeves robreeves marked this pull request as ready for review June 17, 2026 17:13
…e explicitly

Normalize the float/Decimal input to a float once, then identify NaN / +Infinity
/ -Infinity with explicit positive checks instead of inferring -Infinity from a
fallthrough. The impossible case now raises (a guard) rather than silently
producing '-Infinity', and there's a single spelling source (no separate
str(Decimal) branch). Behavior is unchanged; 340 tests pass.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants