[DataLoader] Add support to convert filters to SQL strings#632
Open
robreeves wants to merge 6 commits into
Open
[DataLoader] Add support to convert filters to SQL strings#632robreeves wants to merge 6 commits into
robreeves wants to merge 6 commits into
Conversation
Add a public to_sql(filter, target=SqlTarget.SPARK) that renders a Filter DSL object as a SQL boolean expression for a target. The underlying sqlglot dialect is hidden behind the SqlTarget enum so callers never pass a dialect string. Internally, _filter_to_expr() builds a single dialect-agnostic sqlglot AST from the Filter tree; to_sql() renders it with .sql(dialect=target.value) and _to_datafusion_sql() renders the same AST with the default dialect (signature unchanged). This replaces the hand-rolled f-string DataFusion builder. Side effect: NaN checks now render isnan(col) instead of the unparseable 'col IS NAN' (DataFusion 53 rejects IS NAN), and non-finite float literals use canonical NaN/Infinity spellings accepted by both DataFusion and Spark. Tests: TestSparkSqlConversion asserts per-node Spark output, validates each rendering parses as Spark via sqlglot, and checks Spark vs DataFusion differ only in identifier quoting. make verify passes (ruff, mypy, 332 tests).
Use SqlTarget across the library instead of raw dialect strings:
- Add SqlTarget.TRINO and SqlTarget.DATA_FUSION (now {SPARK, TRINO, DATA_FUSION}).
- Type TableTransformer.dialect as SqlTarget (was str); data_loader bridges to the
string-based to_datafusion_sql() transpiler via transformer.dialect.value.
- Drop the standalone _to_datafusion_sql(); the loader's internal DataFusion query
now uses to_sql(filters, SqlTarget.DATA_FUSION). Everything funnels through
to_sql() -> _filter_to_expr().
NaN is dialect-specific: Trino spells it is_nan() while Spark/DataFusion use isnan().
Thread the target through _filter_to_expr and pick the right function name per target.
Breaking change for TableTransformer subclasses (an internal API): pass a SqlTarget,
e.g. SqlTarget.SPARK. Updated all subclasses + tests; removed the now-impossible
bad-dialect transformer test (string validation is still covered by test_datafusion_sql).
make verify passes (ruff, mypy, 344 tests).
…rectly Inline to_sql(expr, SqlTarget.DATA_FUSION) at the 31 call sites that used the leftover _to_datafusion_sql test helper, and remove the helper. The DataFusion filter rendering is now exercised through the public to_sql() path with no indirection. No production change; the to_datafusion_sql statement transpiler is untouched. make verify passes (344 tests).
Make to_datafusion_sql accept a SqlTarget instead of a raw dialect string, so the transformer transpilation path is SqlTarget-typed end to end (data_loader now passes transformer.dialect directly, no .value). - Drop the 'unsupported source dialect' validation: every SqlTarget value is a valid sqlglot dialect, so the check is dead. Compare the no-op case with 'is SqlTarget.DATA_FUSION'; use .value for the sqlglot parse + error message. - Tests: convert spark/datafusion cases to SqlTarget; remove the mysql/postgres transpilation cases and the unsupported-dialect test (not expressible/reachable once the param is a SqlTarget). make verify passes (ruff, mypy, 340 tests).
…rk filters Add a case to the Livy-driven integration test that reads rows via OpenHouseDataLoader with a filter, converts the same filter to Spark SQL via to_sql(filters, SqlTarget.SPARK), reads from Spark with that WHERE clause, and asserts the two row sets match. The filter excludes one matching-by-score row via a name IN, so an ignored predicate on either side fails the assertion. LivySession gains a query() that returns result rows (shared submit/poll loop factored into _run); execute() is unchanged behaviorally. Static checks pass (ruff, py_compile, unit suite); the Dockerized integration suite must be run via make integration-tests.
…e explicitly Normalize the float/Decimal input to a float once, then identify NaN / +Infinity / -Infinity with explicit positive checks instead of inferring -Infinity from a fallthrough. The impossible case now raises (a guard) rather than silently producing '-Infinity', and there's a single spelling source (no separate str(Decimal) branch). Behavior is unchanged; 340 tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a public
to_sql(filter, target=SqlTarget.SPARK)to the DataLoader so callers can renderthe filter DSL (
col(...) > ..., etc.) as a SQL boolean expression for a chosen target — Spark,Trino, or DataFusion. This unifies
TableTransformerto also use SqlTarget instead of leaking SQLGlot into the public API.Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
openhouse.dataloader:to_sql(filter, target=SqlTarget.SPARK) -> strand theSqlTargetenum (SPARK,TRINO,DATA_FUSION).to_sqlreturns aWHERE-clause-ready predicate for the chosen target; callers select aSqlTarget, never a raw dialect string.TableTransformer.dialectis now typedSqlTarget(wasstr);to_datafusion_sql(sql, source_dialect)now takessource_dialect: SqlTarget(wasstr); the standalone_to_datafusion_sqlhelper was removed (the loader's internal DataFusion query now callsto_sql(filters, SqlTarget.DATA_FUSION)).is_nan()/is_not_nan()) previously renderedcol IS NAN, which DataFusion 53 cannot parse. They now renderisnan(col)(Spark/DataFusion) /is_nan(col)(Trino), all valid in their dialect.to_sqlrenders any filter DSL expression to Spark, Trino, or DataFusion SQL._filter_to_expr/_literal_to_expr) rendered via.sql(dialect=...), replacing the hand-rolled f-string builder.SqlTargetis the single SQL-flavor vocabulary acrossto_sql,TableTransformer.dialect, andto_datafusion_sql.README.mdandCLAUDE.mdto documentto_sql/SqlTargetand theTableTransformer.dialecttype.TestSparkSqlConversionandTestTrinoSqlConversion(per-node assertions +sqlglotparse-validation per dialect). Updated the DataFusion filter tests to render viato_sql(..., SqlTarget.DATA_FUSION), updated the transformer/transpiler tests to useSqlTarget, and removed now-unreachable tests.Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
make verifypasses locally —rufflint,ruff format --check,mypy(no issues), and the full unit suite (340 tests). This includesTestDataFusionFilterExecution, which executes the generated SQL against a real DataFusionSessionContext, so the unified rendering is validated against the engine.TestSparkSqlConversion,TestTrinoSqlConversion.to_sql), theto_datafusion_sqltranspiler tests (nowSqlTarget), and allTableTransformersubclasses in the test suite.Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.
TableTransformersubclasses must now pass aSqlTarget(e.g.SqlTarget.SPARK) instead of a dialect string, andto_datafusion_sql'ssource_dialectparameter is now aSqlTarget. These are internal/advanced APIs (the transformer ABC is not in the top-level__init__exports).