SNOW-3283826: retry upon cte error for snowpark connect mode#4140

Open

sfc-gh-aling wants to merge 6 commits intomainfrom

aling-cte-retry-v2

Contributor

sfc-gh-aling commented Mar 26, 2026 •

edited

Loading

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-3283826
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
- If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
Please describe how your code solves the related issue.

Description

When CTE optimization rewrites a query, the resulting SQL may occasionally fail at execution time with a ProgrammingError even though the equivalent unoptimized SQL would succeed. This change adds a transparent retry mechanism that catches these failures and re-executes with unoptimized SQL, preventing user-visible errors.

How it works

When a ProgrammingError occurs during execution in ServerConnection.get_result_set:

Guard checks — Retry only triggers when all conditions are met: the query is synchronous (block=True), Snowpark Connect compatible mode is enabled, CTE optimization is enabled, and the failed query actually contains CTE syntax (starts with WITH ).
Retry with unoptimized SQL — Calls SnowflakePlan.get_execution_queries_without_cte(), which recompiles the plan via PlanCompiler.compile(skip_cte_optimization=True) without mutating any session state. The unoptimized queries are then executed. If the retry also fails, the original error is re-raised.
Auto-disable — When the threshold is positive (context._cte_error_threshold, default 3), each successful retry increments a per-session counter (session._cte_error_count). Once the counter reaches the threshold, CTE optimization is automatically disabled for the remainder of the session with a warning log. Setting the threshold to 0 disables auto-disable entirely and skips counter increments.
Telemetry — Every successful retry sends a cte_execution_retry telemetry event containing the plan UUID, error message, api_calls chain, and both the failed CTE query ID and the successful retry query ID (real Snowflake sfqids, not raw SQL). When auto-disable triggers, a separate cte_optimization_auto_disabled event is sent.

Files changed

File	Change
`server_connection.py`	Core retry logic in `get_result_set`, extracted `_execute_queries` helper
`snowflake_plan.py`	Added `get_execution_queries_without_cte()`
`plan_compiler.py`	Added `skip_cte_optimization` parameter to `compile()`
`context.py`	Added `_cte_error_threshold` module-level variable
`session.py`	Added `_cte_error_count` instance attribute
`telemetry.py`	Added `send_cte_execution_retry_telemetry` and `send_cte_optimization_auto_disabled_telemetry`
`telemetry_constants.py`	Added `CTE_QUERY_ID`, `RETRY_QUERY_ID`, `CTE_ERROR_COUNT`
`tests/integ/test_cte.py`	10 integration tests covering retry success/failure, guard conditions, auto-disable, telemetry data, and threshold=0

Test plan

sfc-gh-aling added 2 commits

March 26, 2026 11:20


          retry upon cte error

f21e1f5


          update

ab022cb

sfc-gh-aling marked this pull request as ready for review

March 26, 2026 20:30

sfc-gh-aling requested review from a team as code owners

March 26, 2026 20:30

sfc-gh-aling requested review from sfc-gh-aalam, sfc-gh-yixie and sfc-gh-yuwang

March 26, 2026 20:30

sfc-gh-aling changed the title ~~SNOW-3283826: retry upon cte error~~ SNOW-3283826: retry upon cte error for snowpark connect mode

graphite-app bot reviewed

View reviewed changes

src/snowflake/snowpark/_internal/server_connection.py Outdated Show resolved Hide resolved

sfc-gh-aalam approved these changes

View reviewed changes

sfc-gh-joshi requested changes

View reviewed changes

src/snowflake/snowpark/_internal/server_connection.py Outdated

Comment on lines +749 to +750

		# main query are Snowpark-generated temp object DDL (e.g., CREATE
		# TEMP FILE FORMAT IF NOT EXISTS) that are idempotent. CTE rewriting

Contributor

sfc-gh-joshi Mar 27, 2026

Are the post-action queries of the CTE-optimized attempt run on error? Also, is state like temporary identifier suffixes regenerated on the retry attempt?

Contributor Author

sfc-gh-aling Mar 27, 2026

yes, they are run in the final block, and I added tests to confirm the behavior
yes, it's regenerated with fresh id

src/snowflake/snowpark/_internal/server_connection.py Outdated

Comment on lines +787 to +789

+                                  with session._lock:
+                                      session._cte_error_count += 1
+                                      current_count = session._cte_error_count

Contributor

sfc-gh-joshi Mar 27, 2026

Can we make this lock more fine-grained to avoid potential contention with other code acquiring this lock to access other variables?

Contributor Author

sfc-gh-aling Mar 27, 2026

lock operation consolidated

src/snowflake/snowpark/_internal/server_connection.py Outdated

Comment on lines +800 to +801

		with session._lock:
		session._cte_optimization_enabled = False

Contributor

sfc-gh-joshi Mar 27, 2026

There's a minor TOCTOU race condition with the code as-written. If another thread issues a CTE query that errors out between the two lock acquisitions in this function, it could issue an additional query with CTE enabled even though the threshold was already exceeded.

Checking current_count and setting _cte_optimization_enabled = False should occur in the previous lock acquisition block. The warning log can still be safely kept outside it.

Contributor Author

sfc-gh-aling Mar 27, 2026

lock operation consolidated


          update tests and fix bug

1e36445

sfc-gh-aling requested a review from sfc-gh-joshi

March 27, 2026 23:55

graphite-app bot reviewed

View reviewed changes

src/snowflake/snowpark/_internal/server_connection.py Show resolved Hide resolved

sfc-gh-aling and others added 2 commits

March 27, 2026 17:03


          Update src/snowflake/snowpark/_internal/server_connection.py

05cb048

Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>


          Update src/snowflake/snowpark/_internal/server_connection.py

a8b238b

Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>

graphite-app bot reviewed

View reviewed changes

src/snowflake/snowpark/_internal/server_connection.py Show resolved Hide resolved


          lint

1d847d0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

graphite-app[bot] graphite-app[bot] left review comments

sfc-gh-aalam sfc-gh-aalam approved these changes

sfc-gh-yixie Awaiting requested review from sfc-gh-yixie sfc-gh-yixie is a code owner automatically assigned from snowflakedb/snowpark-python-api-reviewers

sfc-gh-yuwang Awaiting requested review from sfc-gh-yuwang sfc-gh-yuwang is a code owner automatically assigned from snowflakedb/snowpark-python-api-reviewers

sfc-gh-joshi Awaiting requested review from sfc-gh-joshi

Requested changes must be addressed to merge this pull request.

Labels

None yet