Refactor tool call retry logic with explicit scenario labels by onmete · Pull Request #2813 · openshift/lightspeed-service

onmete · 2026-03-13T15:17:36Z

Description

Refactors _execute_single_tool_call in ols/src/tools/tools.py to make the retry/error-handling logic explicit and correct.

Changes:

Renames _is_retryable_tool_error → _is_transient_error (clearer: transient is an error property, retryable is a policy outcome)
Labels the four execution scenarios explicitly in the code:
- Scenario 1 — precondition failure (missing tool name, unknown tool, sensitive args): always includes DO_NOT_RETRY_REMINDER
- Scenario 2 — transient infrastructure error: auto-retry with exponential backoff
- Scenario 3a — non-transient error on the very first attempt: no DO_NOT_RETRY_REMINDER so the LLM knows it may retry with corrected arguments
- Scenario 3b — any other failure (exhausted retries, or non-transient after a prior transient retry): includes DO_NOT_RETRY_REMINDER
Fixes DO_NOT_RETRY_REMINDER inclusion: previously keyed on all_attempts_exhausted; now correctly keyed on attempt == 0 and not _is_transient_error(error) to identify clean first-attempt non-transient failures
Eliminates a redundant get_tool_by_name call: the precondition block now saves the resolved StructuredTool and passes it directly into execute_tool_call, removing the duplicate lookup on every successful execution
Updates execute_tool_call signature to accept StructuredTool instead of tool_name: str + all_mcp_tools: list[StructuredTool]

Type of change

Refactor

Related Tickets & Documents

Related Issue #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

make test-unit — 883 passed
make test-integration — 95 passed
make verify — black, ruff, pylint (10.00/10), mypy all clean

Made with Cursor

- Rename _is_retryable_tool_error to _is_transient_error for clarity - Label retry scenarios 1, 2, 3a, 3b in _execute_single_tool_call - Fix DO_NOT_RETRY_REMINDER: omit only on first-attempt non-transient failure (Scenario 3a — LLM may retry with corrected args); add it for all other failure paths - Eliminate double get_tool_by_name lookup by saving the resolved tool in the precondition block and passing it directly to execute_tool_call - Update execute_tool_call signature to accept StructuredTool instead of tool name + tools list - Update tests to match revised semantics and new signature Made-with: Cursor

openshift-ci · 2026-03-13T15:18:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign raptorsun for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-03-13T16:34:53Z

@onmete: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-ols-cluster	`62b2a65`	link	true	`/test e2e-ols-cluster`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ols/src/tools/tools.py

blublinsky · 2026-03-14T20:27:59Z

ols/src/tools/tools.py

+        tool_output = f"Error: {e}. {DO_NOT_RETRY_REMINDER}"
        status = "error"
-        logger.error("Tool call missing name: %s", tool_call)
+        logger.error(tool_output)


A cleaner way is to return here and not have else

The else clause on a try block is idiomatic Python - it means "run this only if no exception was raised".

blublinsky · 2026-03-14T20:36:18Z

So basically, this PR is trying to save on retries for non-retryable calls. The savings can be quite small, but... Can we have a story for this with the estimate

onmete · 2026-03-16T09:06:07Z

@blublinsky This is about not telling LLM to never try the tool when it makes the wrong args at first atttempt. Not about saving retries.

blublinsky · 2026-03-16T12:55:41Z

was trying to do the same thing and came up with a simpler one. I hope

def _is_transient_tool_error(error: Exception) -> bool:
    """Return true if a tool execution error is likely transient."""
    # Canonical timeout/connection exceptions: usually safe to retry.
    if isinstance(error, (TimeoutError, asyncio.TimeoutError, ConnectionError)):
        return True
    # OS/network stack errors (e.g., socket issues) are often transient.
    if isinstance(error, OSError):
        return True
    # Fallback text matching for provider-specific exception types/messages.
    error_text = str(error).lower()
    return any(
        token in error_text
        for token in ("timeout", "temporar", "connection reset", "connection closed")
    )


def _is_rate_limited_tool_error(error: Exception) -> bool:
    """Return true if a tool execution error indicates rate-limiting."""
    error_text = str(error).lower()
    return any(
        token in error_text for token in ("rate limit", "429", "too many requests")
    )
    
    
  async def _execute_with_retries(
    *,
    tool_id: str,
    tool: StructuredTool,
    tool_args: dict[str, object],
    max_tokens: int,
) -> ToolExecutionEvent:
    """Execute one tool call with retry policy and convert result into tool_result event.

    Args:
        tool_id: Correlation ID of the originating tool call.
        tool: Tool to execute.
        tool_args: Arguments passed to the tool.
        max_tokens: Maximum tokens allowed for tool output truncation.

    Returns:
        Final tool-result event representing either success or terminal failure.
    """
    tool_name = tool.name
    attempts = MAX_TOOL_CALL_RETRIES + 1
    last_error_text = "unknown error"
    was_truncated = False

    for attempt in range(attempts):
        try:
            _status, tool_output, was_truncated, structured_content = (
                await execute_tool_call(tool, tool_args, max_tokens)
            )
            return _tool_result_event(
                content=tool_output,
                status="success",
                tool_call_id=tool_id,
                truncated=was_truncated,
                structured_content=structured_content,
            )
        except Exception as error:
            last_error_text = str(error)
            is_rate_limited_error = _is_rate_limited_tool_error(error)
            should_retry = _is_transient_tool_error(error) or is_rate_limited_error
            if attempt < MAX_TOOL_CALL_RETRIES and should_retry:
                logger.warning(
                    "Retrying tool '%s' after local transient failure on attempt %d/%d: %s",
                    tool_name,
                    attempt + 1,
                    attempts,
                    error,
                )
                backoff_base = (
                    RATE_LIMIT_RETRY_BACKOFF_SECONDS
                    if is_rate_limited_error
                    else RETRY_BACKOFF_SECONDS
                )
                await asyncio.sleep(backoff_base * (2**attempt))
                continue
            break

    reason = " ".join(last_error_text.split())
    if len(reason) > 220:
        reason = f"{reason[:217]}..."
    tool_output = f"Tool '{tool_name}' failed: {reason}"
    return _tool_result_event(
        content=tool_output,
        status="error",
        tool_call_id=tool_id,
        truncated=was_truncated,
    )

openshift-ci bot requested review from blublinsky and joshuawilson March 13, 2026 15:18

falox suggested changes Mar 14, 2026

View reviewed changes

ols/src/tools/tools.py Show resolved Hide resolved

ols/src/tools/tools.py Outdated Show resolved Hide resolved

ols/src/tools/tools.py Outdated Show resolved Hide resolved

blublinsky reviewed Mar 14, 2026

View reviewed changes

Log tool call on error

7dd74fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor tool call retry logic with explicit scenario labels#2813

Refactor tool call retry logic with explicit scenario labels#2813
onmete wants to merge 2 commits intoopenshift:mainfrom
onmete:fix/tool-call-retry-logic

onmete commented Mar 13, 2026

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blublinsky Mar 14, 2026

Uh oh!

onmete Mar 16, 2026

Uh oh!

blublinsky commented Mar 14, 2026

Uh oh!

onmete commented Mar 16, 2026

Uh oh!

blublinsky commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

onmete commented Mar 13, 2026

Description

Type of change

Related Tickets & Documents

Checklist before requesting a review

Testing

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blublinsky Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

onmete Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

blublinsky commented Mar 14, 2026

Uh oh!

onmete commented Mar 16, 2026

Uh oh!

blublinsky commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

blublinsky commented Mar 16, 2026 •

edited

Loading