SNOW-3266495: Defer HAVING, ORDER BY, LIMIT clauses in SCOS compatibility mode#4132
SNOW-3266495: Defer HAVING, ORDER BY, LIMIT clauses in SCOS compatibility mode#4132sfc-gh-joshi wants to merge 3 commits intomainfrom
Conversation
d7aca89 to
0d60875
Compare
src/snowflake/snowpark/dataframe.py
Outdated
| self._pending_having = None | ||
| self._pending_order_by = None | ||
| self._pending_limit = None |
There was a problem hiding this comment.
I'm curious how multiple filter would affect the plan generation
df1 = df.groupBy(...).agg(...)
df2 = df1.filter(...).limit().filter(...)
df3 = df1.filter(...).filter(...).limit()
df4 = df1.limit().filter(...).filter(...)- do df2,3,4 output the same query?
- what's the behavior in spark and do we align with spark behavior after your code change?
There was a problem hiding this comment.
That's a good point. It looks like in spark, filter is not commutative across a sequence of df.filter(...).orderBy(...).limit(...).filter(...) (the final call will see a deterministic subset of rows based on the prior order/limit). I'll need to do some more testing to see what this means for SQL generation, and whether the cases you mentioned have similar problems.
There was a problem hiding this comment.
@sfc-gh-aling I added some test cases covering this behavior, and checked the output against spark. The changes are:
- Operations that occur after a LIMIT now always produce a new sub-query, since FILTER -> LIMIT and LIMIT -> FILTER are not equivalent.
- Consecutive filter operations are now conjoined into a single HAVING clause. I don't think SQL has any short-circuiting evaluation behavior that imperative languages do, so I believe this should always be equivalent. The Spark
explainplans I looked at did also combine filter clauses together into a single operator. - Consecutive ordering operations are now combined into a single ORDER BY, with the last ordering clause appearing first in the SQL.
Most of these cases were previously broken in SCOS, as the only sequence of operations that would have produced valid SQL was df.groupby(...).agg(...).filter(...).orderBy(...).limit(...), where the operations appeared in the same order as that required by SQL.
There was a problem hiding this comment.
nice, thanks for checking the behavior!
re1: does orderBy also produce a subquery like limit?
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-3266495
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Snowflake SQL requires HAVING, ORDER BY, and LIMIT clauses of a GROUP BY statement to appear in that particular order. For example:
Re-arranging any of these clauses produces a syntax error. Currently, when
_is_snowpark_connect_compatible_modeis set, the order in which these clauses are added depends on the order in which they are specified by the user. That is,df.groupBy(...).agg(...).sort(...).filter(...)would emitORDER BYbeforeHAVING, resulting in a compilation error.This PR defers the generation of these clauses for GROUP BY aggregations. Instead of appending each clause to the analyzer tree when a method is called, the new
DataFrame._build_post_agg_dfmethod ensures the clauses are added in the correct order.