SNOW-3236596: cte implement dup node detection#4143
Open
sfc-gh-aling wants to merge 3 commits intomainfrom
Open
SNOW-3236596: cte implement dup node detection#4143sfc-gh-aling wants to merge 3 commits intomainfrom
sfc-gh-aling wants to merge 3 commits intomainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-3236596
The fix is only for SCOS mode alone for now. The snowpark python CTE implementation is built upon the assumption that independent DataFrames can be CTE optimized.
When the CTE optimizer merges two independently constructed DataFrames that happen to produce identical SQL into a single CTE. This causes incorrect results when data generation functions like uuid_string() or random() are used — for example, df1.union_all(df2) would return duplicate values instead of two independent evaluations.
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
The root cause is that find_duplicate_subtrees in cte_utils.py identifies duplicates purely by
encoded_node_id_with_query(a hash of the generated SQL). Two different Python objects (df1 and df2) that produce the same SQL get the same encoded id, causing them to be treated as duplicates and collapsed into a single CTE.Fix:
When _is_snowpark_connect_compatible_mode is True, we now track the Python object identity (id(node)) alongside the encoded node id during tree traversal. A new helper _node_occurrence_count distinguishes between:
This behavior is gated behind
_is_snowpark_connect_compatible_modeso existing behavior is unchanged by default.