Retry on 429 with Retry-After / exp backoff; omit_temperature for Opus 4.7+ by to-ha · Pull Request #3 · alexziskind1/codeneedle

to-ha · 2026-05-09T04:29:51Z

Summary

Two cloud-resilience knobs in bench/client.py so a bench.py run against a rate-limited hosted endpoint no longer aborts the moment a TPM cap kicks in, and so Claude Opus 4.7+ (which deprecated temperature) can be used at all.

Both changes are backwards-compatible — existing local-server configs are unaffected. New omit_temperature defaults to false; the 429 retry loop only activates on 429.

Changes

1. HTTP 429 retry with Retry-After / exponential backoff (bench/client.py)

On 429, sleep and retry up to MAX_RETRIES_ON_429 (3) times.
Honors the standard Retry-After header (integer seconds) when present.
Falls back to exponential backoff anchored at 60s when the header is missing.
Other 4xx/5xx errors are not retried (raise immediately, same as before).
Bounded retry: structurally-too-large requests still fail fast rather than looping forever.

2. omit_temperature flag (bench/client.py, bench/config.py)

New optional bool in ClientConfig. When true, the temperature field is not sent in the request payload.
Required for Claude Opus 4.7+, which deprecated the parameter and rejects requests that include it.
Default false; existing configs are unaffected.

Tested against

gpt-5.1 jquery (16 functions, OpenAI Tier 1 / 500k TPM): 13 successful retries, run completes — without the patch fail-fast aborted after 2 errors.
claude-sonnet-4-6 http_server (Anthropic Tier 1 / 30k ITPM): 11/11 through repeated 60s sleeps with Retry-After honored.
claude-opus-4-7 http_server with omit_temperature=true: 11/11 — the same config without the flag was rejected with temperature is deprecated for this model.
Local existing configs (qwen3.5/3.6 family, no 429 path): unchanged behavior. Existing smoke_test.py and 6 local model runs pass.

New test file: `client_smoke_test.py`

Eight assertion cases covering the new client behavior, no live server. Uses unittest.mock to patch httpx.Client and time.sleep, runs in milliseconds. Style mirrors smoke_test.py (assert-based, exits non-zero on failure).

$ uv run python client_smoke_test.py
✓ happy path: single 200, returns content, sends temperature
✓ 429 with Retry-After: sleeps the indicated duration, retries once
✓ 429 without Retry-After: exponential backoff (60, 120)
✓ bounded retry: after 3 retries, raises RuntimeError
✓ non-429 error (500): raises immediately, no retry
✓ non-429 error (400): raises immediately, no retry
✓ omit_temperature=true: temperature absent from payload
✓ default (omit_temperature=false): temperature present in payload
✅ all client smoke checks passed

Configs added

configs/models/claude-opus-4-7.toml — uses omit_temperature=true (Opus 4.7 deprecated the field).
configs/models/gpt-5-1.toml — GPT-5.1 chat-completions, mainstream OpenAI frontier.
configs/models/gpt-5-2.toml — GPT-5.2 reasoning-on-by-default; reasoning_effort="none" skips CoT.

configs/models/claude-sonnet-4-6.toml tightened: max_tokens 8000 → 1500 (the 20-line answer needs much less than 8000), and relax_indent=true added for consistency with the local-MLX configs that ship the same flag.

Docs

README.md: new "Hosted endpoints with rate limits" section with example output, plus client_smoke_test.py added to the module map.
configs/CONFIG_README.md: omit_temperature in the model-config field table and the hosted-model gotcha matrix.

…e flag Two cloud-resilience knobs added to the OpenAI-compatible client: - HTTP 429 responses are now retried up to 3 times. Honors the standard `Retry-After` header (integer seconds) when present; falls back to exponential backoff anchored at 60s otherwise. After exhausting retries the original RuntimeError is raised, so structurally-too-large requests (e.g. a single call exceeding Anthropic's per-minute input-token cap on Tier 1) still fail fast rather than looping. Other 4xx/5xx errors are unchanged — they raise immediately. - New `omit_temperature` flag in ClientConfig. When true, the `temperature` field is not sent in the request payload. Needed for Claude Opus 4.7+, which deprecated the parameter and rejects requests that include it. Default false; existing configs are unaffected. Tested against: - gpt-5.1 jquery (16 functions, OpenAI Tier 1 / 500k TPM): 13 successful retries, run completes where without the patch fail-fast aborted at 2. - claude-sonnet-4-6 http_server (Anthropic Tier 1 / 30k ITPM): 11/11 through repeated 60s sleeps. - claude-opus-4-7 http_server: 11/11 with omit_temperature=true.

Eight assertion cases covering the new client behavior, no live server: - happy path (single 200, no retry, no sleep) - 429 with Retry-After header → sleeps the indicated duration - 429 without Retry-After → exponential backoff (60, 120, …) - bounded retry: after MAX_RETRIES_ON_429 the underlying error surfaces - 500 / 400 errors are not retried (raise immediately) - omit_temperature=true drops the field from the payload - omit_temperature default (false) keeps the field Uses unittest.mock to patch httpx.Client and time.sleep so the test runs in milliseconds. Style mirrors smoke_test.py (assert-based, exits non-zero on failure).

- README: new section under server-setup notes describing the 429-retry behavior with example output line and the bounded-retry semantics. Module map gains client_smoke_test.py. - CONFIG_README: omit_temperature added to the model-config field table and to the hosted-model gotcha matrix (Claude Opus 4.7+, hosted-tier rate limits).

… defaults Three new model configs: - claude-opus-4-7: uses omit_temperature=true (Opus 4.7 deprecated the field and rejects requests that include it). - gpt-5-1: GPT-5.1 chat-completions, mainstream OpenAI frontier, requires use_max_completion_tokens=true and temperature=1.0 like the rest of the GPT-5 family. - gpt-5-2: GPT-5.2 reasoning-on-by-default; reasoning_effort='none' skips CoT (accepted on the GPT-5 family). claude-sonnet-4-6 tightened: max_tokens 8000 → 1500 (the 20-line answer needs much less), and relax_indent=true added for consistency with the local-MLX configs that ship the same flag.

to-ha added 4 commits May 9, 2026 06:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on 429 with Retry-After / exp backoff; omit_temperature for Opus 4.7+#3

Retry on 429 with Retry-After / exp backoff; omit_temperature for Opus 4.7+#3
to-ha wants to merge 4 commits into
alexziskind1:mainfrom
to-ha:feat/cloud-retry-on-429

to-ha commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

to-ha commented May 9, 2026

Summary

Changes

Tested against

New test file: client_smoke_test.py

Configs added

Docs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New test file: `client_smoke_test.py`