Skip to content

Retry on 429 with Retry-After / exp backoff; omit_temperature for Opus 4.7+#3

Open
to-ha wants to merge 4 commits into
alexziskind1:mainfrom
to-ha:feat/cloud-retry-on-429
Open

Retry on 429 with Retry-After / exp backoff; omit_temperature for Opus 4.7+#3
to-ha wants to merge 4 commits into
alexziskind1:mainfrom
to-ha:feat/cloud-retry-on-429

Conversation

@to-ha
Copy link
Copy Markdown

@to-ha to-ha commented May 9, 2026

Summary

Two cloud-resilience knobs in bench/client.py so a bench.py run against a rate-limited hosted endpoint no longer aborts the moment a TPM cap kicks in, and so Claude Opus 4.7+ (which deprecated temperature) can be used at all.

Both changes are backwards-compatible — existing local-server configs are unaffected. New omit_temperature defaults to false; the 429 retry loop only activates on 429.

Changes

1. HTTP 429 retry with Retry-After / exponential backoff (bench/client.py)

  • On 429, sleep and retry up to MAX_RETRIES_ON_429 (3) times.
  • Honors the standard Retry-After header (integer seconds) when present.
  • Falls back to exponential backoff anchored at 60s when the header is missing.
  • Other 4xx/5xx errors are not retried (raise immediately, same as before).
  • Bounded retry: structurally-too-large requests still fail fast rather than looping forever.

2. omit_temperature flag (bench/client.py, bench/config.py)

  • New optional bool in ClientConfig. When true, the temperature field is not sent in the request payload.
  • Required for Claude Opus 4.7+, which deprecated the parameter and rejects requests that include it.
  • Default false; existing configs are unaffected.

Tested against

  • gpt-5.1 jquery (16 functions, OpenAI Tier 1 / 500k TPM): 13 successful retries, run completes — without the patch fail-fast aborted after 2 errors.
  • claude-sonnet-4-6 http_server (Anthropic Tier 1 / 30k ITPM): 11/11 through repeated 60s sleeps with Retry-After honored.
  • claude-opus-4-7 http_server with omit_temperature=true: 11/11 — the same config without the flag was rejected with temperature is deprecated for this model.
  • Local existing configs (qwen3.5/3.6 family, no 429 path): unchanged behavior. Existing smoke_test.py and 6 local model runs pass.

New test file: client_smoke_test.py

Eight assertion cases covering the new client behavior, no live server. Uses unittest.mock to patch httpx.Client and time.sleep, runs in milliseconds. Style mirrors smoke_test.py (assert-based, exits non-zero on failure).

$ uv run python client_smoke_test.py
✓ happy path: single 200, returns content, sends temperature
✓ 429 with Retry-After: sleeps the indicated duration, retries once
✓ 429 without Retry-After: exponential backoff (60, 120)
✓ bounded retry: after 3 retries, raises RuntimeError
✓ non-429 error (500): raises immediately, no retry
✓ non-429 error (400): raises immediately, no retry
✓ omit_temperature=true: temperature absent from payload
✓ default (omit_temperature=false): temperature present in payload
✅ all client smoke checks passed

Configs added

  • configs/models/claude-opus-4-7.toml — uses omit_temperature=true (Opus 4.7 deprecated the field).
  • configs/models/gpt-5-1.toml — GPT-5.1 chat-completions, mainstream OpenAI frontier.
  • configs/models/gpt-5-2.toml — GPT-5.2 reasoning-on-by-default; reasoning_effort="none" skips CoT.

configs/models/claude-sonnet-4-6.toml tightened: max_tokens 8000 → 1500 (the 20-line answer needs much less than 8000), and relax_indent=true added for consistency with the local-MLX configs that ship the same flag.

Docs

  • README.md: new "Hosted endpoints with rate limits" section with example output, plus client_smoke_test.py added to the module map.
  • configs/CONFIG_README.md: omit_temperature in the model-config field table and the hosted-model gotcha matrix.

to-ha added 4 commits May 9, 2026 06:14
…e flag

Two cloud-resilience knobs added to the OpenAI-compatible client:

- HTTP 429 responses are now retried up to 3 times. Honors the standard
  `Retry-After` header (integer seconds) when present; falls back to
  exponential backoff anchored at 60s otherwise. After exhausting retries
  the original RuntimeError is raised, so structurally-too-large requests
  (e.g. a single call exceeding Anthropic's per-minute input-token cap on
  Tier 1) still fail fast rather than looping. Other 4xx/5xx errors are
  unchanged — they raise immediately.

- New `omit_temperature` flag in ClientConfig. When true, the `temperature`
  field is not sent in the request payload. Needed for Claude Opus 4.7+,
  which deprecated the parameter and rejects requests that include it.
  Default false; existing configs are unaffected.

Tested against:
- gpt-5.1 jquery (16 functions, OpenAI Tier 1 / 500k TPM): 13 successful
  retries, run completes where without the patch fail-fast aborted at 2.
- claude-sonnet-4-6 http_server (Anthropic Tier 1 / 30k ITPM): 11/11
  through repeated 60s sleeps.
- claude-opus-4-7 http_server: 11/11 with omit_temperature=true.
Eight assertion cases covering the new client behavior, no live server:

- happy path (single 200, no retry, no sleep)
- 429 with Retry-After header → sleeps the indicated duration
- 429 without Retry-After → exponential backoff (60, 120, …)
- bounded retry: after MAX_RETRIES_ON_429 the underlying error surfaces
- 500 / 400 errors are not retried (raise immediately)
- omit_temperature=true drops the field from the payload
- omit_temperature default (false) keeps the field

Uses unittest.mock to patch httpx.Client and time.sleep so the test runs
in milliseconds. Style mirrors smoke_test.py (assert-based, exits non-zero
on failure).
- README: new section under server-setup notes describing the 429-retry
  behavior with example output line and the bounded-retry semantics.
  Module map gains client_smoke_test.py.
- CONFIG_README: omit_temperature added to the model-config field table
  and to the hosted-model gotcha matrix (Claude Opus 4.7+, hosted-tier
  rate limits).
… defaults

Three new model configs:

- claude-opus-4-7: uses omit_temperature=true (Opus 4.7 deprecated the
  field and rejects requests that include it).
- gpt-5-1: GPT-5.1 chat-completions, mainstream OpenAI frontier, requires
  use_max_completion_tokens=true and temperature=1.0 like the rest of the
  GPT-5 family.
- gpt-5-2: GPT-5.2 reasoning-on-by-default; reasoning_effort='none' skips
  CoT (accepted on the GPT-5 family).

claude-sonnet-4-6 tightened: max_tokens 8000 → 1500 (the 20-line answer
needs much less), and relax_indent=true added for consistency with the
local-MLX configs that ship the same flag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant