Refactor tool call retry logic with explicit scenario labels#2813
Refactor tool call retry logic with explicit scenario labels#2813onmete wants to merge 2 commits intoopenshift:mainfrom
Conversation
- Rename _is_retryable_tool_error to _is_transient_error for clarity - Label retry scenarios 1, 2, 3a, 3b in _execute_single_tool_call - Fix DO_NOT_RETRY_REMINDER: omit only on first-attempt non-transient failure (Scenario 3a — LLM may retry with corrected args); add it for all other failure paths - Eliminate double get_tool_by_name lookup by saving the resolved tool in the precondition block and passing it directly to execute_tool_call - Update execute_tool_call signature to accept StructuredTool instead of tool name + tools list - Update tests to match revised semantics and new signature Made-with: Cursor
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@onmete: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
ols/src/tools/tools.py
Outdated
| tool_output = f"Error: {e}. {DO_NOT_RETRY_REMINDER}" | ||
| status = "error" | ||
| logger.error("Tool call missing name: %s", tool_call) | ||
| logger.error(tool_output) |
There was a problem hiding this comment.
A cleaner way is to return here and not have else
There was a problem hiding this comment.
The else clause on a try block is idiomatic Python - it means "run this only if no exception was raised".
|
So basically, this PR is trying to save on retries for non-retryable calls. The savings can be quite small, but... Can we have a story for this with the estimate |
|
@blublinsky This is about not telling LLM to never try the tool when it makes the wrong args at first atttempt. Not about saving retries. |
|
was trying to do the same thing and came up with a simpler one. I hope |
Description
Refactors
_execute_single_tool_callinols/src/tools/tools.pyto make the retry/error-handling logic explicit and correct.Changes:
_is_retryable_tool_error→_is_transient_error(clearer: transient is an error property, retryable is a policy outcome)DO_NOT_RETRY_REMINDERDO_NOT_RETRY_REMINDERso the LLM knows it may retry with corrected argumentsDO_NOT_RETRY_REMINDERDO_NOT_RETRY_REMINDERinclusion: previously keyed onall_attempts_exhausted; now correctly keyed onattempt == 0 and not _is_transient_error(error)to identify clean first-attempt non-transient failuresget_tool_by_namecall: the precondition block now saves the resolvedStructuredTooland passes it directly intoexecute_tool_call, removing the duplicate lookup on every successful executionexecute_tool_callsignature to acceptStructuredToolinstead oftool_name: str+all_mcp_tools: list[StructuredTool]Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing
make test-unit— 883 passedmake test-integration— 95 passedmake verify— black, ruff, pylint (10.00/10), mypy all cleanMade with Cursor