Skip to content

Refactor tool call retry logic with explicit scenario labels#2813

Open
onmete wants to merge 2 commits intoopenshift:mainfrom
onmete:fix/tool-call-retry-logic
Open

Refactor tool call retry logic with explicit scenario labels#2813
onmete wants to merge 2 commits intoopenshift:mainfrom
onmete:fix/tool-call-retry-logic

Conversation

@onmete
Copy link
Contributor

@onmete onmete commented Mar 13, 2026

Description

Refactors _execute_single_tool_call in ols/src/tools/tools.py to make the retry/error-handling logic explicit and correct.

Changes:

  • Renames _is_retryable_tool_error_is_transient_error (clearer: transient is an error property, retryable is a policy outcome)
  • Labels the four execution scenarios explicitly in the code:
    • Scenario 1 — precondition failure (missing tool name, unknown tool, sensitive args): always includes DO_NOT_RETRY_REMINDER
    • Scenario 2 — transient infrastructure error: auto-retry with exponential backoff
    • Scenario 3a — non-transient error on the very first attempt: no DO_NOT_RETRY_REMINDER so the LLM knows it may retry with corrected arguments
    • Scenario 3b — any other failure (exhausted retries, or non-transient after a prior transient retry): includes DO_NOT_RETRY_REMINDER
  • Fixes DO_NOT_RETRY_REMINDER inclusion: previously keyed on all_attempts_exhausted; now correctly keyed on attempt == 0 and not _is_transient_error(error) to identify clean first-attempt non-transient failures
  • Eliminates a redundant get_tool_by_name call: the precondition block now saves the resolved StructuredTool and passes it directly into execute_tool_call, removing the duplicate lookup on every successful execution
  • Updates execute_tool_call signature to accept StructuredTool instead of tool_name: str + all_mcp_tools: list[StructuredTool]

Type of change

  • Refactor

Related Tickets & Documents

  • Related Issue #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • make test-unit — 883 passed
  • make test-integration — 95 passed
  • make verify — black, ruff, pylint (10.00/10), mypy all clean

Made with Cursor

- Rename _is_retryable_tool_error to _is_transient_error for clarity
- Label retry scenarios 1, 2, 3a, 3b in _execute_single_tool_call
- Fix DO_NOT_RETRY_REMINDER: omit only on first-attempt non-transient
  failure (Scenario 3a — LLM may retry with corrected args); add it for
  all other failure paths
- Eliminate double get_tool_by_name lookup by saving the resolved tool
  in the precondition block and passing it directly to execute_tool_call
- Update execute_tool_call signature to accept StructuredTool instead
  of tool name + tools list
- Update tests to match revised semantics and new signature

Made-with: Cursor
@openshift-ci
Copy link

openshift-ci bot commented Mar 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign raptorsun for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link

openshift-ci bot commented Mar 13, 2026

@onmete: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ols-cluster 62b2a65 link true /test e2e-ols-cluster

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tool_output = f"Error: {e}. {DO_NOT_RETRY_REMINDER}"
status = "error"
logger.error("Tool call missing name: %s", tool_call)
logger.error(tool_output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cleaner way is to return here and not have else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else clause on a try block is idiomatic Python - it means "run this only if no exception was raised".

@blublinsky
Copy link
Contributor

So basically, this PR is trying to save on retries for non-retryable calls. The savings can be quite small, but... Can we have a story for this with the estimate

@onmete
Copy link
Contributor Author

onmete commented Mar 16, 2026

@blublinsky This is about not telling LLM to never try the tool when it makes the wrong args at first atttempt. Not about saving retries.

@blublinsky
Copy link
Contributor

blublinsky commented Mar 16, 2026

was trying to do the same thing and came up with a simpler one. I hope

def _is_transient_tool_error(error: Exception) -> bool:
    """Return true if a tool execution error is likely transient."""
    # Canonical timeout/connection exceptions: usually safe to retry.
    if isinstance(error, (TimeoutError, asyncio.TimeoutError, ConnectionError)):
        return True
    # OS/network stack errors (e.g., socket issues) are often transient.
    if isinstance(error, OSError):
        return True
    # Fallback text matching for provider-specific exception types/messages.
    error_text = str(error).lower()
    return any(
        token in error_text
        for token in ("timeout", "temporar", "connection reset", "connection closed")
    )


def _is_rate_limited_tool_error(error: Exception) -> bool:
    """Return true if a tool execution error indicates rate-limiting."""
    error_text = str(error).lower()
    return any(
        token in error_text for token in ("rate limit", "429", "too many requests")
    )
    
    
  async def _execute_with_retries(
    *,
    tool_id: str,
    tool: StructuredTool,
    tool_args: dict[str, object],
    max_tokens: int,
) -> ToolExecutionEvent:
    """Execute one tool call with retry policy and convert result into tool_result event.

    Args:
        tool_id: Correlation ID of the originating tool call.
        tool: Tool to execute.
        tool_args: Arguments passed to the tool.
        max_tokens: Maximum tokens allowed for tool output truncation.

    Returns:
        Final tool-result event representing either success or terminal failure.
    """
    tool_name = tool.name
    attempts = MAX_TOOL_CALL_RETRIES + 1
    last_error_text = "unknown error"
    was_truncated = False

    for attempt in range(attempts):
        try:
            _status, tool_output, was_truncated, structured_content = (
                await execute_tool_call(tool, tool_args, max_tokens)
            )
            return _tool_result_event(
                content=tool_output,
                status="success",
                tool_call_id=tool_id,
                truncated=was_truncated,
                structured_content=structured_content,
            )
        except Exception as error:
            last_error_text = str(error)
            is_rate_limited_error = _is_rate_limited_tool_error(error)
            should_retry = _is_transient_tool_error(error) or is_rate_limited_error
            if attempt < MAX_TOOL_CALL_RETRIES and should_retry:
                logger.warning(
                    "Retrying tool '%s' after local transient failure on attempt %d/%d: %s",
                    tool_name,
                    attempt + 1,
                    attempts,
                    error,
                )
                backoff_base = (
                    RATE_LIMIT_RETRY_BACKOFF_SECONDS
                    if is_rate_limited_error
                    else RETRY_BACKOFF_SECONDS
                )
                await asyncio.sleep(backoff_base * (2**attempt))
                continue
            break

    reason = " ".join(last_error_text.split())
    if len(reason) > 220:
        reason = f"{reason[:217]}..."
    tool_output = f"Tool '{tool_name}' failed: {reason}"
    return _tool_result_event(
        content=tool_output,
        status="error",
        tool_call_id=tool_id,
        truncated=was_truncated,
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants