feat: add per-credential response-header-timeout and treat 524 as transient error#2060
feat: add per-credential response-header-timeout and treat 524 as transient error#2060shellus wants to merge 3 commits intorouter-for-me:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the resilience and responsiveness of the proxy when dealing with slow or unresponsive upstream services, particularly those behind Cloudflare. By allowing per-credential configuration of response header timeouts and treating Cloudflare's 524 timeout as a transient error, the system can detect and recover from issues much faster, reducing overall request latency during credential failures from several minutes to under a minute. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
977edce to
b5b601a
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a per-credential response-header-timeout and adds HTTP 524 to the list of transient errors. The changes are generally well-implemented, but I found a scenario where the new timeout is not applied correctly. Specifically, if an http.Transport is provided via context, the response-header-timeout is ignored. I've suggested a fix to address this. The addition of HTTP 524 is correct.
Note: Security Review did not run due to the size of the PR.
| // Apply ResponseHeaderTimeout when no proxy transport was set. | ||
| if respHeaderTimeout > 0 && httpClient.Transport == nil { | ||
| httpClient.Transport = &http.Transport{ | ||
| ResponseHeaderTimeout: respHeaderTimeout, | ||
| } | ||
| } |
There was a problem hiding this comment.
The current logic for applying ResponseHeaderTimeout doesn't handle the case where an http.Transport is already set on the httpClient from the context. The timeout is only applied if httpClient.Transport is nil.
This means if a transport is provided via ctx.Value("cliproxy.roundtripper"), the per-credential timeout will be ignored, which is likely not the intended behavior.
I suggest modifying this block to also handle cases where an *http.Transport is already present. It's important to clone the existing transport before modifying it to avoid side effects on shared transports.
// Apply ResponseHeaderTimeout if set.
if respHeaderTimeout > 0 {
if transport, ok := httpClient.Transport.(*http.Transport); ok {
// Clone the transport to avoid modifying a shared instance.
clonedTransport := transport.Clone()
clonedTransport.ResponseHeaderTimeout = respHeaderTimeout
httpClient.Transport = clonedTransport
} else if httpClient.Transport == nil {
httpClient.Transport = &http.Transport{
ResponseHeaderTimeout: respHeaderTimeout,
}
}
}There was a problem hiding this comment.
Fixed in 6aeb2aa. Now the code clones the existing transport (if it's *http.Transport) and applies ResponseHeaderTimeout to the clone, avoiding modification of shared instances. Thanks for catching this!
b5b601a to
6aeb2aa
Compare
…nsient error
- Add ResponseHeaderTimeout config field to ClaudeKey for per-upstream
timeout control (in seconds). Only limits wait for first response
header; streaming responses are unaffected.
- Pass timeout value through Auth.Attributes to executor layer.
- Apply ResponseHeaderTimeout to http.Transport in proxy_helpers.go,
covering proxy, context-RoundTripper, and bare-transport code paths.
- Add 524 (Cloudflare timeout) to transient error list in conductor.go.
- Wrap Go net.Error timeout as statusErr{code: 504} in claude_executor.go
so conductor properly applies 1-minute cooldown on timeout failures,
preventing repeated 15s waits on the same failing upstream.
luispater
left a comment
There was a problem hiding this comment.
Thanks for the improvement here. The intent makes sense, but I think this still has two blocking issues:
- Blocking: the new transient handling is only wired into
applyAuthFailureState, while normal request execution records failures through the model-scoped branch inMarkResult. That branch still only treats408/500/502/503/504as transient and still hard-codes a 1 minute cooldown. In practice,524will not trigger cooldown for normal Claude requests, andtransient-error-cooldownwill be ignored as well. - Blocking: in
newProxyAwareHTTPClient, whenresponse_header_timeoutis set and there is no proxy/context transport, the code creates a zero-valuehttp.Transportinstead of cloning the default transport. That drops existing default behavior such asProxyFromEnvironment, dial/TLS timeouts, HTTP/2, and idle connection settings.
I’d recommend fixing those two paths and adding targeted tests for model-scoped cooldown handling plus the no-proxy transport construction.
Code ReviewGood intent on per-credential timeouts and treating 524 as transient. Found two blocking issues. Blocking Issues1. Transient 524 handling not wired for normal request path The new transient error logic only applies to // new transient error added here
if statusCode == 524 || ... {
...
}But normal Claude requests record failures through the model-scoped branch in if statusCode == 408 || statusCode == 500 || ... || statusCode == 504 {
// 524 is NOT in this list
...
}In practice:
2. Zero-value Transport drops defaults In if responseHeaderTimeout > 0 {
return &http.Client{
Timeout: totalTimeout,
Transport: &http.Transport{
ResponseHeaderTimeout: responseHeaderTimeout,
},
}
}When no proxy/context transport exists, this creates a zero-value
Instead, clone if responseHeaderTimeout > 0 {
transport := http.DefaultTransport.(*http.Transport).Clone()
transport.ResponseHeaderTimeout = responseHeaderTimeout
return &http.Client{
Timeout: totalTimeout,
Transport: transport,
}
}TestsAdd:
SummaryRequest changes before merge. The transient handling needs full coverage, and the transport construction should preserve default behavior. |
- Add 524 to transient error list in MarkResult model-scoped branch - Add transient_error_cooldown config support in model-scoped branch - Use http.DefaultTransport.Clone() instead of zero-value Transport Fixes code review feedback from PR router-for-me#2060
e0de114 to
0c70af4
Compare
|
Thanks for the detailed review! Both issues have been fixed in the latest commit. Issue A: 524 not covered in model-scoped pathFixed in Added 524 to the transient error list and implemented case 408, 500, 502, 503, 504, 524:
if quotaCooldownDisabledForAuth(auth) {
state.NextRetryAfter = time.Time{}
} else {
cooldown := 1 * time.Minute
if auth.Attributes != nil {
if v, ok := auth.Attributes["transient_error_cooldown"]; ok {
if secs, err := strconv.Atoi(v); err == nil && secs > 0 {
cooldown = time.Duration(secs) * time.Second
}
}
}
next := now.Add(cooldown)
state.NextRetryAfter = next
}Now both normal Claude requests (99%) and auth requests (1%) properly handle 524 with configurable cooldown. Issue B: Zero-value Transport drops defaultsFixed in Replaced zero-value } else if httpClient.Transport == nil {
transport := http.DefaultTransport.(*http.Transport).Clone()
transport.ResponseHeaderTimeout = respHeaderTimeout
httpClient.Transport = transport
}This ensures Both fixes are now live in the PR branch. Ready for re-review. |
Summary
response-header-timeoutconfig field toclaude-api-keyentries (per-credential, in seconds)net.Errortimeout asstatusErr{code: 504}so conductor properly applies 1-minute cooldown on timeout failuresMotivation
Upstreams behind Cloudflare CDN have a ~100s default 524 timeout. Without per-credential timeout control, each failing credential wastes 2+ minutes before failover triggers. With
response-header-timeout: 15, failover triggers within ~15 seconds. The 504 wrapping ensures the failing credential enters cooldown, so subsequent requests skip it immediately.Config example
Files changed
internal/config/config.goResponseHeaderTimeoutfield toClaudeKeyinternal/watcher/synthesizer/config.goAuth.Attributesinternal/runtime/executor/proxy_helpers.goResponseHeaderTimeouttohttp.Transportsdk/cliproxy/auth/conductor.gointernal/runtime/executor/claude_executor.gonet.Errortimeout asstatusErr{code: 504}Test plan
response-header-timeout: 15on an upstream behind Cloudflareresponse-header-timeoutbehave as before (no timeout)