SNOW-3283826: retry upon cte error for snowpark connect mode#4140
SNOW-3283826: retry upon cte error for snowpark connect mode#4140sfc-gh-aling wants to merge 6 commits intomainfrom
Conversation
| # main query are Snowpark-generated temp object DDL (e.g., CREATE | ||
| # TEMP FILE FORMAT IF NOT EXISTS) that are idempotent. CTE rewriting |
There was a problem hiding this comment.
Are the post-action queries of the CTE-optimized attempt run on error? Also, is state like temporary identifier suffixes regenerated on the retry attempt?
There was a problem hiding this comment.
- yes, they are run in the final block, and I added tests to confirm the behavior
- yes, it's regenerated with fresh id
| with session._lock: | ||
| session._cte_error_count += 1 | ||
| current_count = session._cte_error_count |
There was a problem hiding this comment.
Can we make this lock more fine-grained to avoid potential contention with other code acquiring this lock to access other variables?
There was a problem hiding this comment.
lock operation consolidated
| with session._lock: | ||
| session._cte_optimization_enabled = False |
There was a problem hiding this comment.
There's a minor TOCTOU race condition with the code as-written. If another thread issues a CTE query that errors out between the two lock acquisitions in this function, it could issue an additional query with CTE enabled even though the threshold was already exceeded.
Checking current_count and setting _cte_optimization_enabled = False should occur in the previous lock acquisition block. The warning log can still be safely kept outside it.
There was a problem hiding this comment.
lock operation consolidated
Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>
Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-3283826
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Description
When CTE optimization rewrites a query, the resulting SQL may occasionally fail at execution time with a
ProgrammingErroreven though the equivalent unoptimized SQL would succeed. This change adds a transparent retry mechanism that catches these failures and re-executes with unoptimized SQL, preventing user-visible errors.How it works
When a
ProgrammingErroroccurs during execution inServerConnection.get_result_set:Guard checks — Retry only triggers when all conditions are met: the query is synchronous (
block=True), Snowpark Connect compatible mode is enabled, CTE optimization is enabled, and the failed query actually contains CTE syntax (starts withWITH).Retry with unoptimized SQL — Calls
SnowflakePlan.get_execution_queries_without_cte(), which recompiles the plan viaPlanCompiler.compile(skip_cte_optimization=True)without mutating any session state. The unoptimized queries are then executed. If the retry also fails, the original error is re-raised.Auto-disable — When the threshold is positive (
context._cte_error_threshold, default 3), each successful retry increments a per-session counter (session._cte_error_count). Once the counter reaches the threshold, CTE optimization is automatically disabled for the remainder of the session with a warning log. Setting the threshold to 0 disables auto-disable entirely and skips counter increments.Telemetry — Every successful retry sends a
cte_execution_retrytelemetry event containing the plan UUID, error message,api_callschain, and both the failed CTE query ID and the successful retry query ID (real Snowflake sfqids, not raw SQL). When auto-disable triggers, a separatecte_optimization_auto_disabledevent is sent.Files changed
server_connection.pyget_result_set, extracted_execute_querieshelpersnowflake_plan.pyget_execution_queries_without_cte()plan_compiler.pyskip_cte_optimizationparameter tocompile()context.py_cte_error_thresholdmodule-level variablesession.py_cte_error_countinstance attributetelemetry.pysend_cte_execution_retry_telemetryandsend_cte_optimization_auto_disabled_telemetrytelemetry_constants.pyCTE_QUERY_ID,RETRY_QUERY_ID,CTE_ERROR_COUNTtests/integ/test_cte.pyTest plan
test_cte_retry_succeeds_increments_counter— counter goes 0→1 on successful retrytest_cte_retry_fails_reraises_original_error— both CTE and unoptimized fail, counter stays 0test_cte_retry_no_retry_when_cte_disabled— error propagates when CTE optimization is offtest_cte_retry_no_retry_when_not_compatible_mode— error propagates when compat mode is Falsetest_cte_retry_no_retry_on_async_queries—block=Falseskips retrytest_cte_retry_auto_disable_at_threshold— counter hits 3, CTE auto-disabled, telemetry senttest_cte_retry_counter_below_threshold_does_not_disable— counter=1, CTE stays enabledtest_cte_retry_telemetry_contains_reproduction_data— real sfqids for both failed and retry queries, api_calls, plan_uuid, error_messagetest_cte_retry_threshold_zero_disables_auto_disable— threshold=0 means no counter increment, no auto-disabletest_cte_retry_no_retry_when_query_has_no_cte_prefix— non-CTE query error propagates directly