fix(websocket): reset pinned auth after unrecoverable upstream auth errors#1946
fix(websocket): reset pinned auth after unrecoverable upstream auth errors#1946excelwang wants to merge 2 commits intorouter-for-me:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Concrete failure mode this fixes:
So a single bad upstream credential can effectively poison the whole interactive session until reconnect. This PR only resets the pinned auth on unrecoverable upstream auth/quota signals so the next turn can fail over normally. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 875491db34
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return false | ||
| } | ||
| switch errMsg.StatusCode { | ||
| case http.StatusUnauthorized, http.StatusPaymentRequired, http.StatusForbidden, http.StatusNotFound, http.StatusTooManyRequests: |
There was a problem hiding this comment.
Avoid resetting pinned auth on generic 404 responses
shouldResetPinnedAuthForWebsocketError currently treats any 404 as an unrecoverable auth error, but websocket upstreams can return 404 for request/state issues (for example, stale previous_response_id) that are not credential failures. In that case this branch clears pinnedAuthID and closes the execution session, so the next turn can be routed to a different auth and lose session continuity even if the client corrects the request. This makes a recoverable request error degrade into cross-auth context loss.
Useful? React with 👍 / 👎.
xkonjin
left a comment
There was a problem hiding this comment.
Quick review pass:
- Main risk area here is auth/session state and stale credential handling.
- Good to see test coverage move with the code; I’d still make sure it exercises the unhappy path around auth/session state and stale credential handling rather than only the happy path.
- Before merge, I’d smoke-test the behavior touched by openai_responses_websocket.go, openai_responses_websocket_test.go with malformed input / retry / rollback cases, since that’s where this class of change usually breaks.
|
Follow-up on the stale credential / unhappy-path concern: I re-ran targeted tests against the touched websocket path. Re-ran locally:
Coverage split is:
This change does not alter retry backoff logic itself; it only decides when a poisoned pinned auth must be dropped so the next turn can fail over normally. |
luispater
left a comment
There was a problem hiding this comment.
Summary
This correctly clears the pinned auth after unrecoverable websocket auth/quota errors, but it still leaves websocket-v2 continuation enabled for the next turn.
Key findings
- Blocking: after
resetPinnedAuth, the handler clearspinnedAuthID, but the next request still falls back towebsocketUpstreamSupportsIncrementalInputForModel(...). That check only asks whether any available auth for the model supports websocket incremental input, sonormalizeResponseSubsequentRequest(...)can still forward the oldprevious_response_idunchanged to a newly selected auth. At that point the new auth does not own the old upstream continuation/session, so the next turn can still fail with the same continuation error instead of actually failing over. - Non-blocking: the added test only covers the helper classification. Please add an end-to-end regression that pins auth A, forces an unrecoverable websocket error, then sends a follow-up turn with
previous_response_idand verifies the handler no longer forwards that upstream continuation token after the pin reset.
Test plan
go test ./sdk/api/handlers/openai -count=1
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Follow-up pushed in This addresses the continuation-token blocker, not just the helper classification. After an unrecoverable websocket auth/quota error resets the pinned auth, the next turn now forces incremental continuation off once, so the stale I also added the requested end-to-end regression test:
That scenario covers:
Re-ran:
|
xkonjin
left a comment
There was a problem hiding this comment.
Code Review
Overall: ✅ Approvable — the core logic is correct, but the continuation-token leak that luispater flagged needs verification.
What's good
resetPinnedAuthDisablesIncrementalInputflag correctly blocksprevious_response_idforwarding for exactly one turn after a pin reset — the flag is cleared only after a successful non-resetting turn (!resetPinnedAuth), which is the right state machine.shouldResetPinnedAuthForWebsocketErroris well-scoped: status codes 401/402/403/404/429 plus textual quota/invalidation signals. Limiting 404 to the websocket path (where it indicates the upstream session is gone) is correct — this is a different semantic than 404 on REST.CloseExecutionSessionis called on pin reset, which is important for connection pool hygiene.- The end-to-end test (
TestResponsesWebsocket_DropsPreviousResponseIDAfterPinnedAuthReset) exercises the full sequence: success → auth failure → failover with stripped continuation token. This directly addresses luispater's requested regression test.
Issue (luispater's blocker — needs verification)
The test verifies executor.payloads[2] does not contain previous_response_id. But look at the turn-3 request construction:
thirdRequest := fmt.Sprintf(`{..., "previous_response_id":%q,...}`, previousResponseID)The client sends previous_response_id on turn 3. The test then checks that the upstream payload (executor.payloads[2]) doesn't contain it — which exercises normalizeResponseSubsequentRequest stripping the field when allowIncrementalInputWithPreviousResponseID == false.
Please confirm that normalizeResponseSubsequentRequest (or equivalent) is the codepath that removes previous_response_id when the flag is false, and that this behavior is exercised here. If the test passes by coincidence (e.g. the field is stripped elsewhere unconditionally), the regression coverage is weaker than it appears.
Non-blocking
- The
resetPinnedAuthDisablesIncrementalInputflag persists until a successful non-resetting turn. If there are multiple consecutive auth failures (second auth also fails), the flag stays true correctly — but consider whether back-to-back resets could put the session in a state where no incremental input is ever re-enabled if new auths keep failing. Probably fine for the edge case, but worth a comment. strings.TrimSpace(pinnedAuthID)is used in the log comparison butpinnedAuthIDis always assigned from auth ID strings that shouldn't have whitespace. The trim is harmless but slightly misleading.
Summary
This fixes a severe websocket failover bug where a session could stay pinned to an exhausted or invalid upstream auth even after the upstream had already returned an unrecoverable credential error.
In practice, once a websocket session got pinned to a bad auth, subsequent turns in the same session could continue failing instead of switching to another healthy credential. For multi-auth deployments, that turns a single upstream auth failure into a user-visible session outage.
Root Cause
/v1/responseswebsocket handling keeps a pinned auth for session continuity, but the pin was not cleared when upstream returned quota exhaustion or token invalidation style errors.Fix
Why this matters
This is operationally severe because:
Tests
go test ./sdk/api/handlers/openai -count=1Scope
This PR is limited to websocket pinned-auth recovery behavior and does not include any usage aggregation changes.