Skip to content

SNOW-3283826: retry upon cte error for snowpark connect mode#4140

Open
sfc-gh-aling wants to merge 6 commits intomainfrom
aling-cte-retry-v2
Open

SNOW-3283826: retry upon cte error for snowpark connect mode#4140
sfc-gh-aling wants to merge 6 commits intomainfrom
aling-cte-retry-v2

Conversation

@sfc-gh-aling
Copy link
Copy Markdown
Contributor

@sfc-gh-aling sfc-gh-aling commented Mar 26, 2026

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-3283826

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  3. Please describe how your code solves the related issue.

Description

When CTE optimization rewrites a query, the resulting SQL may occasionally fail at execution time with a ProgrammingError even though the equivalent unoptimized SQL would succeed. This change adds a transparent retry mechanism that catches these failures and re-executes with unoptimized SQL, preventing user-visible errors.

How it works

When a ProgrammingError occurs during execution in ServerConnection.get_result_set:

  1. Guard checks — Retry only triggers when all conditions are met: the query is synchronous (block=True), Snowpark Connect compatible mode is enabled, CTE optimization is enabled, and the failed query actually contains CTE syntax (starts with WITH ).

  2. Retry with unoptimized SQL — Calls SnowflakePlan.get_execution_queries_without_cte(), which recompiles the plan via PlanCompiler.compile(skip_cte_optimization=True) without mutating any session state. The unoptimized queries are then executed. If the retry also fails, the original error is re-raised.

  3. Auto-disable — When the threshold is positive (context._cte_error_threshold, default 3), each successful retry increments a per-session counter (session._cte_error_count). Once the counter reaches the threshold, CTE optimization is automatically disabled for the remainder of the session with a warning log. Setting the threshold to 0 disables auto-disable entirely and skips counter increments.

  4. Telemetry — Every successful retry sends a cte_execution_retry telemetry event containing the plan UUID, error message, api_calls chain, and both the failed CTE query ID and the successful retry query ID (real Snowflake sfqids, not raw SQL). When auto-disable triggers, a separate cte_optimization_auto_disabled event is sent.

Files changed

File Change
server_connection.py Core retry logic in get_result_set, extracted _execute_queries helper
snowflake_plan.py Added get_execution_queries_without_cte()
plan_compiler.py Added skip_cte_optimization parameter to compile()
context.py Added _cte_error_threshold module-level variable
session.py Added _cte_error_count instance attribute
telemetry.py Added send_cte_execution_retry_telemetry and send_cte_optimization_auto_disabled_telemetry
telemetry_constants.py Added CTE_QUERY_ID, RETRY_QUERY_ID, CTE_ERROR_COUNT
tests/integ/test_cte.py 10 integration tests covering retry success/failure, guard conditions, auto-disable, telemetry data, and threshold=0

Test plan

  • test_cte_retry_succeeds_increments_counter — counter goes 0→1 on successful retry
  • test_cte_retry_fails_reraises_original_error — both CTE and unoptimized fail, counter stays 0
  • test_cte_retry_no_retry_when_cte_disabled — error propagates when CTE optimization is off
  • test_cte_retry_no_retry_when_not_compatible_mode — error propagates when compat mode is False
  • test_cte_retry_no_retry_on_async_queriesblock=False skips retry
  • test_cte_retry_auto_disable_at_threshold — counter hits 3, CTE auto-disabled, telemetry sent
  • test_cte_retry_counter_below_threshold_does_not_disable — counter=1, CTE stays enabled
  • test_cte_retry_telemetry_contains_reproduction_data — real sfqids for both failed and retry queries, api_calls, plan_uuid, error_message
  • test_cte_retry_threshold_zero_disables_auto_disable — threshold=0 means no counter increment, no auto-disable
  • test_cte_retry_no_retry_when_query_has_no_cte_prefix — non-CTE query error propagates directly

@sfc-gh-aling sfc-gh-aling marked this pull request as ready for review March 26, 2026 20:30
@sfc-gh-aling sfc-gh-aling requested review from a team as code owners March 26, 2026 20:30
@sfc-gh-aling sfc-gh-aling changed the title SNOW-3283826: retry upon cte error SNOW-3283826: retry upon cte error for snowpark connect mode Mar 26, 2026
Comment on lines +749 to +750
# main query are Snowpark-generated temp object DDL (e.g., CREATE
# TEMP FILE FORMAT IF NOT EXISTS) that are idempotent. CTE rewriting
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the post-action queries of the CTE-optimized attempt run on error? Also, is state like temporary identifier suffixes regenerated on the retry attempt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes, they are run in the final block, and I added tests to confirm the behavior
  2. yes, it's regenerated with fresh id

Comment on lines +787 to +789
with session._lock:
session._cte_error_count += 1
current_count = session._cte_error_count
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this lock more fine-grained to avoid potential contention with other code acquiring this lock to access other variables?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lock operation consolidated

Comment on lines +800 to +801
with session._lock:
session._cte_optimization_enabled = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a minor TOCTOU race condition with the code as-written. If another thread issues a CTE query that errors out between the two lock acquisitions in this function, it could issue an additional query with CTE enabled even though the threshold was already exceeded.

Checking current_count and setting _cte_optimization_enabled = False should occur in the previous lock acquisition block. The warning log can still be safely kept outside it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lock operation consolidated

sfc-gh-aling and others added 2 commits March 27, 2026 17:03
Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>
Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants