SNOW-3236596: cte implement dup node detection by sfc-gh-aling · Pull Request #4143 · snowflakedb/snowpark-python

sfc-gh-aling · 2026-03-27T21:16:51Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-3236596

The fix is only for SCOS mode alone for now. The snowpark python CTE implementation is built upon the assumption that independent DataFrames can be CTE optimized.

When the CTE optimizer merges two independently constructed DataFrames that happen to produce identical SQL into a single CTE. This causes incorrect results when data generation functions like uuid_string() or random() are used — for example, df1.union_all(df2) would return duplicate values instead of two independent evaluations.

Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
- If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
Please describe how your code solves the related issue.

The root cause is that find_duplicate_subtrees in cte_utils.py identifies duplicates purely by encoded_node_id_with_query (a hash of the generated SQL). Two different Python objects (df1 and df2) that produce the same SQL get the same encoded id, causing them to be treated as duplicates and collapsed into a single CTE.

Fix:
When _is_snowpark_connect_compatible_mode is True, we now track the Python object identity (id(node)) alongside the encoded node id during tree traversal. A new helper _node_occurrence_count distinguishes between:

True duplicates: The same Python object appearing multiple times (e.g. df.union_all(df)) — these are still deduplicated into a CTE.
Coincidental matches: Different Python objects that produce identical SQL (e.g. df1.union_all(df2)) — these are no longer merged.
This behavior is gated behind _is_snowpark_connect_compatible_mode so existing behavior is unchanged by default.

src/snowflake/snowpark/_internal/compiler/cte_utils.py

sfc-gh-aling added 2 commits March 27, 2026 14:14

implement dup node detection

433e998

add tests

f5c03c8

sfc-gh-aling marked this pull request as ready for review March 27, 2026 22:25

sfc-gh-aling requested review from a team as code owners March 27, 2026 22:25

sfc-gh-aling requested review from sfc-gh-aalam, sfc-gh-joshi and sfc-gh-yixie March 27, 2026 22:25

graphite-app bot reviewed Mar 27, 2026

View reviewed changes

src/snowflake/snowpark/_internal/compiler/cte_utils.py Show resolved Hide resolved

bug fix

2aca150

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-3236596: cte implement dup node detection#4143

SNOW-3236596: cte implement dup node detection#4143
sfc-gh-aling wants to merge 3 commits intomainfrom
SNOW-3236596-cte-disable-cte-for-same-sql-hash-but-with-different-plan-id

sfc-gh-aling commented Mar 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sfc-gh-aling commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sfc-gh-aling commented Mar 27, 2026 •

edited

Loading