Fix async retry losing temperature→1.0 on LLMNoResponseError#3356
Fix async retry losing temperature→1.0 on LLMNoResponseError#3356malhotra5 wants to merge 8 commits into
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||||||||||||
f18cfb8 to
1846b92
Compare
The async methods (acompletion, aresponses) used an AsyncRetrying 'async for' loop that has no wrapped function for tenacity to re-invoke with updated kwargs. The before_sleep callback wrote temperature=1.0 to retry_state.kwargs, but the loop body used a local call_kwargs that was never updated. Replace the AsyncRetrying loop with the same @Retry decorator pattern used by the sync methods. Tenacity's decorator handles async functions natively and automatically threads retry_state.kwargs into the wrapped function, so the temperature bump works identically for both sync and async paths. Also deduplicate shared logic across completion/acompletion and responses/aresponses: - _current_metrics_snapshot(): replaces 4 identical MetricsSnapshot constructions - _prepare_completion_params(): shared setup (format messages, tools, mock tools, select_chat_options, telemetry ctx) - _prepare_responses_params(): shared setup for responses path - _validate_chat_response(): shared post-processing (mock tools, telemetry, empty-choices check) Removes the now-unused async_retry() method and AsyncRetrying import from RetryMixin. Co-authored-by: openhands <openhands@all-hands.dev>
1846b92 to
321d7ac
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
The root-cause analysis in the PR description is spot-on: tenacity's AsyncRetrying loop manages the attempt context-manager but does not re-invoke the user's coroutine on each retry — so any retry_state.kwargs mutations written by before_sleep (the temperature bump) were silently dropped. Switching to @retry_decorator, which natively wraps async def functions and threads retry_state.kwargs into each call, is exactly the right fix.
The refactoring into shared helpers (_current_metrics_snapshot, _prepare_completion_params, _prepare_responses_params, _validate_chat_response) eliminates 4× duplication cleanly and the docstrings are accurate. The three new async regression tests directly validate the fixed behaviour.
Two items are worth a follow-up:
_prepare_completion_paramsmutates itskwargsargument in-place (inline comment below).- The
aresponsesasync retry path received the same structural fix but has no new regression tests (inline comment below).
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 QA Report: PARTIAL
I reproduced the base async completion retry bug and verified this PR fixes the observable LLM.acompletion() temperature retry behavior; one adjacent async Responses path could not be independently confirmed.
Does this PR achieve its stated goal?
Yes for the directly reproducible completion-path bug. On origin/main, the same SDK call sequence produced sync_temperatures=[0.0, 1.0] but async_temperatures=[0.0, 0.0], confirming async retry lost the temperature=1.0 bump after LLMNoResponseError. On the PR branch, the same script produced async_temperatures=[0.0, 1.0] and returned successfully after retry, showing the fix works for the user-facing LLM.acompletion() path. I could not separately prove an aresponses() 0→1 retry transition because the Responses path sent temperature=1.0 on the first transport call in this environment even when configured with temperature=0.0.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed and installed the uv-managed workspace environment. |
| CI Status | 🟡 GitHub checks at review time: 23 successful, 4 skipped, 5 pending, 0 failing. |
| Functional Verification | ✅ Reproduced base bug and verified PR fix via real SDK LLM.completion() / LLM.acompletion() calls with provider responses simulated at the LiteLLM boundary. |
Functional Verification
Test 1: Async completion retry applies the temperature bump after LLMNoResponseError
The verification script imported the SDK, created an LLM(temperature=0.0, num_retries=2), and called the public completion() and acompletion() APIs. To avoid external API credentials and make the empty-response condition deterministic, only the LiteLLM transport boundary was patched to return an empty choices=[] response followed by a successful response; the SDK retry, error handling, and response parsing paths were exercised normally.
Step 1 — Reproduce baseline without the fix:
Ran git checkout --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_completion_retry_clean.py; echo qa_exit=$?:
sync_temperatures=[0.0, 1.0]
async_temperatures=[0.0, 0.0]
expected_async_temperatures=[0.0, 1.0]
result=FAIL
qa_exit=1
This confirms the reported bug: the sync retry picked up temperature=1.0, but async retry resent temperature=0.0 on the second attempt.
Step 2 — Apply the PR's changes:
Checked out fix/async-retry-temperature-bump at 321d7ac9abfb3dc1ba0a4dea9082fcbf83cad768.
Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_completion_retry_clean.py; echo qa_exit=$?:
sync_temperatures=[0.0, 1.0]
async_temperatures=[0.0, 1.0]
expected_async_temperatures=[0.0, 1.0]
result=PASS
qa_exit=0
This shows the async retry now receives the same temperature bump as the sync path, and the public SDK call completes successfully after the retry.
Unable to Verify
I attempted a similar before/after script for LLM.aresponses() by simulating LLMNoResponseError then a successful ResponsesAPIResponse. In this environment, the Responses path sent temperature=1.0 on the first transport call even with both instance-level and call-level temperature=0.0, so there was no observable 0.0 → 1.0 retry delta to compare. Future QA guidance could document a concrete provider/model configuration that naturally sends temperature=0.0 through the Responses API and can produce an LLMNoResponseError retry.
Issues Found
None.
Final verdict: PARTIAL (core completion-path bug verified fixed; adjacent aresponses() temperature transition not independently observable in this environment).
This review was created by an AI agent (OpenHands) on behalf of the user.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable — the direction is good, but I found retry-kwargs regressions in the async paths.
Risk: 🟡 MEDIUM because this touches core LLM retry behavior and provider-specific parameter filtering; human maintainer/eval judgment is still recommended before approval.
This review was generated by an AI agent (OpenHands) on behalf of the user.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26252814170
- Add defensive kwargs = dict(kwargs) at the top of both _prepare_completion_params and _prepare_responses_params so callers are not affected by internal mutations (e.g. kwargs['tools'] = ...). - Add 3 aresponses regression tests covering temperature bump on retry, retry-then-succeed, and retry exhaustion — parallel to the existing acompletion tests. Co-authored-by: openhands <openhands@all-hands.dev>
Extract shared helpers to eliminate code duplication across completion/acompletion/responses/aresponses: Layer 1 — _make_retry_decorator(): replaces 4 copies of the 7-line retry decorator configuration. Layer 2 — _build_completion_result() / _build_responses_result(): shared result-building logic (response → Message → LLMResponse), replaces 2×5 identical lines per pair. Layer 3 — responses stream processing: - _build_responses_call_kwargs(): shared litellm call kwargs dict - _process_stream_event(): event type matching + delta extraction - _finalize_stream_response(): post-stream validation + output patching Net result: -33 lines, and the sync/async responses _one_attempt bodies now differ only in the transport call (litellm_responses vs await litellm_aresponses) and iteration (for vs async for). Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 QA Report: PARTIAL
The PR fixes the async chat-completion retry temperature regression in a real SDK call path; one Responses-API-specific 0→1.0 delta could not be reproduced in this environment.
Does this PR achieve its stated goal?
Partially verified. The core reported bug for LLM.acompletion() is fixed: on origin/main, the same public SDK call retried with temperatures [0.0, 0.0], while PR commit a7616581 retried with [0.0, 1.0] and returned an LLMResponse. The sync control path stayed [0.0, 1.0], so the refactor did not break the already-working sync retry behavior in this exercised path. I also exercised LLM.aresponses(), but the simulated Responses path sent [1.0, 1.0] on both base and PR, so I could not use it to prove a before/after 0→1.0 bump for that API.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully and installed the uv workspace environment. |
| CI Status | ⏳ At review time: 22 checks successful, 2 skipped, 6 still in progress. |
| Functional Verification | ✅ LLM.acompletion() regression fixed; sync completion still behaves correctly. |
Functional Verification
Test 1: Async chat completion retry applies temperature bump
Step 1 — Reproduce / establish baseline without the fix:
Checked out origin/main (8f406a88) and ran a small script that imports the SDK, constructs LLM(temperature=0.0, num_retries=2), calls await llm.acompletion(...), and drives the real SDK retry path with a deterministic empty first model response followed by a successful response.
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run --no-sync python /tmp/qa_async_retry_temperature.py:
{
"async_completion_under_test": {
"temperatures": [0.0, 0.0],
"text": "ok"
},
"sync_completion_control": {
"temperatures": [0.0, 1.0],
"text": "ok"
}
}This confirms the bug exists on base: async retry reaches the successful second attempt, but it reuses temperature=0.0 instead of applying the retry temperature bump. The sync control shows the expected behavior on the same scenario.
Step 2 — Apply the PR's changes:
Checked out PR commit a7616581e188ad8c17e8d3ba1cf6a41d7c0950b1.
Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run --no-sync python /tmp/qa_async_retry_temperature.py:
{
"async_completion_under_test": {
"temperatures": [0.0, 1.0],
"text": "ok"
},
"sync_completion_control": {
"temperatures": [0.0, 1.0],
"text": "ok"
}
}This shows the PR fixes the core async chat-completion regression: the retry attempt now receives temperature=1.0 and the SDK returns a successful LLMResponse. The sync control remained unchanged.
Test 2: Async Responses API path still retries and returns an SDK response
Using the same before/after script, I also exercised await llm.aresponses(...) with a first-call LLMNoResponseError and a second successful Responses API result.
Base and PR both returned:
{
"async_responses_under_test": {
"temperatures": [1.0, 1.0],
"text": "ok"
}
}This verifies the Responses API path still retries and returns an LLMResponse, but this scenario did not reproduce a base-branch [0.0, 0.0] failure pattern for aresponses().
Unable to Verify
I could not verify the exact aresponses() 0→1.0 before/after delta claimed in the PR description because the exercised Responses API path supplied temperature=1.0 on both attempts in both base and PR. I also did not use a live LLM provider because the bug requires deterministic empty/no-response behavior, which is not reliably inducible through a normal real-provider user request; instead I used a deterministic fake LiteLLM transport while exercising the real public SDK LLM methods and retry logic.
Suggested future QA guidance: add a documented minimal script or local fake-provider recipe in AGENTS.md for reproducing LLM retry edge cases without relying on nondeterministic external provider behavior.
Issues Found
None.
This review was created by an AI agent (OpenHands) on behalf of the user.
Final verdict: PARTIAL
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
The root-cause diagnosis and fix are both correct. Tenacity's @retry decorator natively handles async def functions and threads retry_state.kwargs into each re-invocation, whereas the AsyncRetrying loop pattern had no wrapped function for tenacity to call with updated kwargs — so before_sleep's temperature=1.0 mutation was silently lost on every async retry. Switching acompletion and aresponses to @_make_retry_decorator() aligns them with the sync path and fixes the bug cleanly.
The refactoring is solid: the extracted helpers (_prepare_completion_params, _prepare_responses_params, _validate_chat_response, _build_completion_result, _build_responses_result, _build_responses_call_kwargs, _process_stream_event, _finalize_stream_response) eliminate substantial duplication while keeping each public method shallow and readable. The defensive kwargs = dict(kwargs) copies in the two prepare helpers (added in the follow-up commit) are correct.
One observation on the new aresponses tests (inline below).
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable — the async retry fix is directionally right, but I found one public SDK API removal that needs addressing.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM. This touches core LLM retry behavior and the SDK public surface.
VERDICT: ❌ Needs rework: restore or formally deprecate LLM.async_retry before removal.
This review was generated by an AI agent (OpenHands) on behalf of the user.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26254832519
Test that when a streaming response completes without a ResponseCompletedEvent (i.e. _finalize_stream_response raises LLMNoResponseError), the retry mechanism still bumps temperature from 0→1.0. Uses a minimal _FakeAsyncStreamIterator that inherits from ResponsesAPIStreamingIterator to pass isinstance checks. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands please address the unresolved review comments, and note that we should be retaining the original code comments and logs wherever the original code has moved |
|
@malhotra5 your session has expired. Please login again at OpenHands Cloud and try again. |
…ed deltas, add sync tests - _validate_chat_response returns just resp (raw_resp is consumed internally by Telemetry.on_response); both call sites updated - Restore the mock-tool logger.debug and the informative comments lost when code moved into _prepare_completion_params / _prepare_responses_params - _process_stream_event gains emit_deltas so the ModelResponseStream chunk is not built when there is no stream callback to receive it - Add sync responses retry + sync streaming retry regression tests
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 QA Report: PARTIAL
Verified the async chat-completion retry bug before/after: the PR changes acompletion() from retrying at [0.0, 0.0] to retrying at [0.0, 1.0]; the Responses temperature delta was not independently reproducible in this environment.
Does this PR achieve its stated goal?
Partially verified. Yes for the reproducible LLM.acompletion() path: after a simulated empty-choices provider response, the retry warning fired and the second LiteLLM call received temperature=1.0 on the PR, while origin/main retried with temperature=0.0. I also exercised LLM.aresponses() retry/recovery, but in this black-box probe the Responses path sent temperature=1.0 on both base and PR before retry, so I could not establish the exact before/after temperature-bump delta for that path.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed and installed the uv-managed dev environment |
| CI Status | 🟡 Most checks observed passing; image/binary plus review/QA automation checks were still pending when checked |
| Functional Verification | 🟡 acompletion() fix verified before/after; aresponses() retry executed but temperature-specific delta was inconclusive |
Functional Verification
Test 1: Async chat completion retries with bumped temperature
I used a short SDK script that calls the public LLM.acompletion() method with real SDK message objects. To deterministically trigger the user-facing bug without external provider credentials, the script temporarily replaced the LiteLLM async transport with a provider stand-in that returns an empty ModelResponse(choices=[]) first, then a valid assistant message, while recording the actual kwargs passed into the transport.
Step 1 — Reproduce / establish baseline without the fix:
Ran:
git checkout --detach origin/main
OPENHANDS_SUPPRESS_BANNER=1 uv run python -u /tmp/qa_async_retry_probe.py > /tmp/qa_probe_main.out 2>&1Relevant output:
LLMNoResponseError with temperature=0, setting temperature to 1.0 for next attempt.
acompletion temperatures=[0.0, 0.0]; content=[TextContent(cache_prompt=False, type='text', text='async chat recovered')]
This confirms the bug on the base branch: the retry callback decided to bump temperature, but the second async chat transport call still received 0.0.
Step 2 — Apply the PR's changes:
Checked out the reviewed PR commit:
git checkout --detach 6daf9a2dc0b140d707330e8e85035bfb88d9d799Step 3 — Re-run with the fix in place:
Ran the same probe:
OPENHANDS_SUPPRESS_BANNER=1 uv run python -u /tmp/qa_async_retry_probe.py > /tmp/qa_probe_pr.out 2>&1Relevant output:
LLMNoResponseError with temperature=0, setting temperature to 1.0 for next attempt.
acompletion temperatures=[0.0, 1.0]; content=[TextContent(cache_prompt=False, type='text', text='async chat recovered')]
This shows the PR fixes the observed async retry behavior for chat completions: the retry-modified temperature now reaches the second provider call and the public SDK call recovers successfully.
Test 2: Async Responses retry path executes and recovers
The same probe exercised LLM.aresponses() by making the first transport call raise LLMNoResponseError and the second return a valid ResponsesAPIResponse.
Base and PR both recovered successfully, with this recorded temperature sequence:
origin/main: aresponses temperatures=[1.0, 1.0]; content=[TextContent(cache_prompt=False, type='text', text='async responses recovered')]
PR commit: aresponses temperatures=[1.0, 1.0]; content=[TextContent(cache_prompt=False, type='text', text='async responses recovered')]
This verifies the async Responses retry path still executes and returns a normal LLMResponse on the PR, but it does not prove the specific 0→1.0 retry mutation because this path already normalized/sent 1.0 before retry in the exercised configuration.
Unable to Verify
I could not produce a before/after aresponses() temperature sequence of [0.0, 0.0] on base versus [0.0, 1.0] on the PR. I tried the public LLM.aresponses() path with multiple model names and explicit temperature=0.0, but the transport-level kwargs recorded temperature=1.0 on the first call for both base and PR. Future QA guidance in AGENTS.md could document a concrete Responses API configuration/provider where temperature=0.0 reaches LiteLLM unchanged, if that path needs manual verification.
Issues Found
None from the verified behavior.
This review was created by an AI agent (OpenHands) on behalf of the user.
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
The root-cause diagnosis is precise, and the fix is the correct minimal change: tenacity's @retry decorator natively handles async def functions and threads retry_state.kwargs into every retry call, so the temperature bump written by before_sleep is automatically picked up. The old AsyncRetrying loop pattern had no wrapped function for tenacity to re-invoke, so the bump was silently discarded.
The accompanying refactoring (_prepare_*_params, _validate_chat_response, _build_*_result, _process_stream_event, _finalize_stream_response) is well-scoped and eliminates real duplication without introducing indirection. The defensive kwargs = dict(kwargs) copies in both _prepare_completion_params and _prepare_responses_params correctly guard against the kwargs['tools'] = ... mutation being visible to callers. The emit_deltas optimisation in _process_stream_event is a nice touch.
Test coverage is comprehensive: the three commit rounds added temperature-bump regression tests for acompletion, aresponses, and both streaming paths, plus retry-exhaustion and retry-then-succeed variants across all four call paths.
One minor observation below. Overall this is in good shape.
This review was generated by an AI agent (OpenHands) on behalf of the user via OpenHands Automation.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Taste Rating: Good taste — the retry wrapper change is simple and matches the sync path. I didn’t find actionable code issues in the current diff.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM. This touches core LLM retry behavior and can affect agent/eval performance, so I’m leaving a COMMENT rather than approving; human maintainer/eval judgment is recommended.
VERDICT: ✅ Worth merging after maintainer validation/eval sign-off.
KEY INSIGHT: Using Tenacity’s decorator restores one source of truth for retry kwargs across sync and async paths.
This review was generated by an AI agent (OpenHands) on behalf of the user.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26374664213
Avoids the per-retry sleep in the two stream-path tests without changing what the temperature-bump assertions verify.
Why
The
before_sleepcallback bumps temperature from 0→1.0 onLLMNoResponseErrorby mutatingretry_state.kwargs. In the sync path, tenacity's@retrydecorator wraps_one_attemptand automatically threadsretry_state.kwargsinto each retry call — so the temperature override is picked up viafinal_kwargs = {**call_kwargs, **retry_kwargs}.The async methods (
acompletion,aresponses) instead used anasync for attempt in AsyncRetryingloop. This pattern has no wrapped function for tenacity to re-invoke with updated kwargs.before_sleepwrotetemperature=1.0toretry_state.kwargs, but the loop body used a localcall_kwargsthat was never updated — so the bump was silently lost.With temperature=0 the model tends to repeat the same empty response, so all retries fail identically.
Repro (temperatures per attempt): sync
[0, 1.0, 1.0], async[0, 0, 0](before fix).Summary
AsyncRetryingloop inacompletionandaresponseswith the same@retrydecorator pattern used by the sync methods — tenacity's decorator natively handlesasync deffunctions and automatically threadsretry_state.kwargsinto each retry_current_metrics_snapshot()— replaces 4 identicalMetricsSnapshotconstructions_prepare_completion_params()— shared setup (format messages, tools, mock tools, select_chat_options, telemetry ctx)_prepare_responses_params()— shared setup for responses path_validate_chat_response()— shared post-processing (mock tools, telemetry, empty-choices check)async_retry()method andAsyncRetryingimport fromRetryMixinHow to Test
All 742 LLM tests pass (including 3 new async regression tests):
The key regression test (
test_async_no_response_retry_bumps_temperature) asserts that on the second attempt,litellm_acompletionis called withtemperature=1.0.Type
Notes
RetryMixin._build_before_sleepandretry_decoratorare unchangedAgent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:de3387c-pythonRun
All tags pushed for this build
About Multi-Architecture Support
de3387c-python) is a multi-arch manifest supporting both amd64 and arm64de3387c-python-amd64) are also available if needed