Retry on 429 with Retry-After / exp backoff; omit_temperature for Opus 4.7+#3
Open
to-ha wants to merge 4 commits into
Open
Retry on 429 with Retry-After / exp backoff; omit_temperature for Opus 4.7+#3to-ha wants to merge 4 commits into
to-ha wants to merge 4 commits into
Conversation
…e flag Two cloud-resilience knobs added to the OpenAI-compatible client: - HTTP 429 responses are now retried up to 3 times. Honors the standard `Retry-After` header (integer seconds) when present; falls back to exponential backoff anchored at 60s otherwise. After exhausting retries the original RuntimeError is raised, so structurally-too-large requests (e.g. a single call exceeding Anthropic's per-minute input-token cap on Tier 1) still fail fast rather than looping. Other 4xx/5xx errors are unchanged — they raise immediately. - New `omit_temperature` flag in ClientConfig. When true, the `temperature` field is not sent in the request payload. Needed for Claude Opus 4.7+, which deprecated the parameter and rejects requests that include it. Default false; existing configs are unaffected. Tested against: - gpt-5.1 jquery (16 functions, OpenAI Tier 1 / 500k TPM): 13 successful retries, run completes where without the patch fail-fast aborted at 2. - claude-sonnet-4-6 http_server (Anthropic Tier 1 / 30k ITPM): 11/11 through repeated 60s sleeps. - claude-opus-4-7 http_server: 11/11 with omit_temperature=true.
Eight assertion cases covering the new client behavior, no live server: - happy path (single 200, no retry, no sleep) - 429 with Retry-After header → sleeps the indicated duration - 429 without Retry-After → exponential backoff (60, 120, …) - bounded retry: after MAX_RETRIES_ON_429 the underlying error surfaces - 500 / 400 errors are not retried (raise immediately) - omit_temperature=true drops the field from the payload - omit_temperature default (false) keeps the field Uses unittest.mock to patch httpx.Client and time.sleep so the test runs in milliseconds. Style mirrors smoke_test.py (assert-based, exits non-zero on failure).
- README: new section under server-setup notes describing the 429-retry behavior with example output line and the bounded-retry semantics. Module map gains client_smoke_test.py. - CONFIG_README: omit_temperature added to the model-config field table and to the hosted-model gotcha matrix (Claude Opus 4.7+, hosted-tier rate limits).
… defaults Three new model configs: - claude-opus-4-7: uses omit_temperature=true (Opus 4.7 deprecated the field and rejects requests that include it). - gpt-5-1: GPT-5.1 chat-completions, mainstream OpenAI frontier, requires use_max_completion_tokens=true and temperature=1.0 like the rest of the GPT-5 family. - gpt-5-2: GPT-5.2 reasoning-on-by-default; reasoning_effort='none' skips CoT (accepted on the GPT-5 family). claude-sonnet-4-6 tightened: max_tokens 8000 → 1500 (the 20-line answer needs much less), and relax_indent=true added for consistency with the local-MLX configs that ship the same flag.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two cloud-resilience knobs in
bench/client.pyso abench.py runagainst a rate-limited hosted endpoint no longer aborts the moment a TPM cap kicks in, and so Claude Opus 4.7+ (which deprecatedtemperature) can be used at all.Both changes are backwards-compatible — existing local-server configs are unaffected. New
omit_temperaturedefaults tofalse; the 429 retry loop only activates on 429.Changes
1.
HTTP 429retry withRetry-After/ exponential backoff (bench/client.py)MAX_RETRIES_ON_429(3) times.Retry-Afterheader (integer seconds) when present.2.
omit_temperatureflag (bench/client.py,bench/config.py)ClientConfig. When true, thetemperaturefield is not sent in the request payload.Tested against
gpt-5.1jquery (16 functions, OpenAI Tier 1 / 500k TPM): 13 successful retries, run completes — without the patch fail-fast aborted after 2 errors.claude-sonnet-4-6http_server (Anthropic Tier 1 / 30k ITPM): 11/11 through repeated 60s sleeps withRetry-Afterhonored.claude-opus-4-7http_server withomit_temperature=true: 11/11 — the same config without the flag was rejected withtemperature is deprecated for this model.smoke_test.pyand 6 local model runs pass.New test file:
client_smoke_test.pyEight assertion cases covering the new client behavior, no live server. Uses
unittest.mockto patchhttpx.Clientandtime.sleep, runs in milliseconds. Style mirrorssmoke_test.py(assert-based, exits non-zero on failure).Configs added
configs/models/claude-opus-4-7.toml— usesomit_temperature=true(Opus 4.7 deprecated the field).configs/models/gpt-5-1.toml— GPT-5.1 chat-completions, mainstream OpenAI frontier.configs/models/gpt-5-2.toml— GPT-5.2 reasoning-on-by-default;reasoning_effort="none"skips CoT.configs/models/claude-sonnet-4-6.tomltightened:max_tokens8000 → 1500 (the 20-line answer needs much less than 8000), andrelax_indent=trueadded for consistency with the local-MLX configs that ship the same flag.Docs
README.md: new "Hosted endpoints with rate limits" section with example output, plusclient_smoke_test.pyadded to the module map.configs/CONFIG_README.md:omit_temperaturein the model-config field table and the hosted-model gotcha matrix.