Skip to content

SNOW-3236596: cte implement dup node detection#4143

Open
sfc-gh-aling wants to merge 3 commits intomainfrom
SNOW-3236596-cte-disable-cte-for-same-sql-hash-but-with-different-plan-id
Open

SNOW-3236596: cte implement dup node detection#4143
sfc-gh-aling wants to merge 3 commits intomainfrom
SNOW-3236596-cte-disable-cte-for-same-sql-hash-but-with-different-plan-id

Conversation

@sfc-gh-aling
Copy link
Copy Markdown
Contributor

@sfc-gh-aling sfc-gh-aling commented Mar 27, 2026

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-3236596

The fix is only for SCOS mode alone for now. The snowpark python CTE implementation is built upon the assumption that independent DataFrames can be CTE optimized.

When the CTE optimizer merges two independently constructed DataFrames that happen to produce identical SQL into a single CTE. This causes incorrect results when data generation functions like uuid_string() or random() are used — for example, df1.union_all(df2) would return duplicate values instead of two independent evaluations.

  1. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  2. Please describe how your code solves the related issue.

The root cause is that find_duplicate_subtrees in cte_utils.py identifies duplicates purely by encoded_node_id_with_query (a hash of the generated SQL). Two different Python objects (df1 and df2) that produce the same SQL get the same encoded id, causing them to be treated as duplicates and collapsed into a single CTE.

Fix:
When _is_snowpark_connect_compatible_mode is True, we now track the Python object identity (id(node)) alongside the encoded node id during tree traversal. A new helper _node_occurrence_count distinguishes between:

  • True duplicates: The same Python object appearing multiple times (e.g. df.union_all(df)) — these are still deduplicated into a CTE.
  • Coincidental matches: Different Python objects that produce identical SQL (e.g. df1.union_all(df2)) — these are no longer merged.
    This behavior is gated behind _is_snowpark_connect_compatible_mode so existing behavior is unchanged by default.

@sfc-gh-aling sfc-gh-aling marked this pull request as ready for review March 27, 2026 22:25
@sfc-gh-aling sfc-gh-aling requested review from a team as code owners March 27, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant