Skip to content

fix(websocket): reset pinned auth after unrecoverable upstream auth errors#1946

Open
excelwang wants to merge 2 commits intorouter-for-me:mainfrom
excelwang:pr/upstream-websocket-unpin-auth
Open

fix(websocket): reset pinned auth after unrecoverable upstream auth errors#1946
excelwang wants to merge 2 commits intorouter-for-me:mainfrom
excelwang:pr/upstream-websocket-unpin-auth

Conversation

@excelwang
Copy link

Summary

This fixes a severe websocket failover bug where a session could stay pinned to an exhausted or invalid upstream auth even after the upstream had already returned an unrecoverable credential error.

In practice, once a websocket session got pinned to a bad auth, subsequent turns in the same session could continue failing instead of switching to another healthy credential. For multi-auth deployments, that turns a single upstream auth failure into a user-visible session outage.

Root Cause

/v1/responses websocket handling keeps a pinned auth for session continuity, but the pin was not cleared when upstream returned quota exhaustion or token invalidation style errors.

Fix

  • detect unrecoverable upstream auth / quota errors in websocket forwarding
  • reset the pinned auth when those errors are observed
  • close the upstream execution session so the next turn can reselect a healthy auth

Why this matters

This is operationally severe because:

  • one bad upstream auth can poison the whole websocket session
  • healthy fallback auths remain unused even though scheduler failover is otherwise available
  • users see repeated failures in the same interactive session until they reconnect or the bad auth recovers

Tests

  • go test ./sdk/api/handlers/openai -count=1

Scope

This PR is limited to websocket pinned-auth recovery behavior and does not include any usage aggregation changes.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@excelwang
Copy link
Author

Concrete failure mode this fixes:

  1. a /v1/responses websocket session pins auth A
  2. auth A later hits quota exhaustion or token invalidation upstream
  3. the same websocket session keeps the stale pin
  4. subsequent turns keep failing even though another healthy auth is available

So a single bad upstream credential can effectively poison the whole interactive session until reconnect. This PR only resets the pinned auth on unrecoverable upstream auth/quota signals so the next turn can fail over normally.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 875491db34

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return false
}
switch errMsg.StatusCode {
case http.StatusUnauthorized, http.StatusPaymentRequired, http.StatusForbidden, http.StatusNotFound, http.StatusTooManyRequests:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid resetting pinned auth on generic 404 responses

shouldResetPinnedAuthForWebsocketError currently treats any 404 as an unrecoverable auth error, but websocket upstreams can return 404 for request/state issues (for example, stale previous_response_id) that are not credential failures. In that case this branch clears pinnedAuthID and closes the execution session, so the next turn can be routed to a different auth and lose session continuity even if the client corrects the request. This makes a recoverable request error degrade into cross-auth context loss.

Useful? React with 👍 / 👎.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review pass:

  • Main risk area here is auth/session state and stale credential handling.
  • Good to see test coverage move with the code; I’d still make sure it exercises the unhappy path around auth/session state and stale credential handling rather than only the happy path.
  • Before merge, I’d smoke-test the behavior touched by openai_responses_websocket.go, openai_responses_websocket_test.go with malformed input / retry / rollback cases, since that’s where this class of change usually breaks.

@excelwang
Copy link
Author

Follow-up on the stale credential / unhappy-path concern:

I re-ran targeted tests against the touched websocket path.

Re-ran locally:

  • go test ./sdk/api/handlers/openai -run "TestShouldResetPinnedAuthForWebsocketError|TestNormalizeResponsesWebsocketRequestCreateWithHistory|TestNormalizeResponsesWebsocketRequestWithPreviousResponseIDMergedWhenIncrementalDisabled" -count=1

Coverage split is:

  • TestShouldResetPinnedAuthForWebsocketError exercises the unrecoverable upstream auth/quota signals that should break the stale pin (token_invalidated, quota/429-style signal, and non-credential control case).
  • The existing normalization tests in the same file cover the request-shaping path around websocket history / previous-response handling so the pin-reset change is not only validated on the happy path.

This change does not alter retry backoff logic itself; it only decides when a poisoned pinned auth must be dropped so the next turn can fail over normally.

Copy link
Collaborator

@luispater luispater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This correctly clears the pinned auth after unrecoverable websocket auth/quota errors, but it still leaves websocket-v2 continuation enabled for the next turn.

Key findings

  • Blocking: after resetPinnedAuth, the handler clears pinnedAuthID, but the next request still falls back to websocketUpstreamSupportsIncrementalInputForModel(...). That check only asks whether any available auth for the model supports websocket incremental input, so normalizeResponseSubsequentRequest(...) can still forward the old previous_response_id unchanged to a newly selected auth. At that point the new auth does not own the old upstream continuation/session, so the next turn can still fail with the same continuation error instead of actually failing over.
  • Non-blocking: the added test only covers the helper classification. Please add an end-to-end regression that pins auth A, forces an unrecoverable websocket error, then sends a follow-up turn with previous_response_id and verifies the handler no longer forwards that upstream continuation token after the pin reset.

Test plan

  • go test ./sdk/api/handlers/openai -count=1

@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@excelwang
Copy link
Author

Follow-up pushed in a3caa0bc.

This addresses the continuation-token blocker, not just the helper classification. After an unrecoverable websocket auth/quota error resets the pinned auth, the next turn now forces incremental continuation off once, so the stale previous_response_id is not forwarded to the newly selected auth.

I also added the requested end-to-end regression test:

  • TestResponsesWebsocket_DropsPreviousResponseIDAfterPinnedAuthReset

That scenario covers:

  1. auth A handles the first turn and establishes the original response id
  2. a follow-up turn on pinned auth A hits an unrecoverable websocket auth error (token_invalidated)
  3. the next turn still arrives with the old previous_response_id
  4. failover switches to auth B
  5. the forwarded upstream payload for auth B no longer contains that stale continuation token

Re-ran:

  • go test ./sdk/api/handlers/openai -count=1

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: ✅ Approvable — the core logic is correct, but the continuation-token leak that luispater flagged needs verification.

What's good

  • resetPinnedAuthDisablesIncrementalInput flag correctly blocks previous_response_id forwarding for exactly one turn after a pin reset — the flag is cleared only after a successful non-resetting turn (!resetPinnedAuth), which is the right state machine.
  • shouldResetPinnedAuthForWebsocketError is well-scoped: status codes 401/402/403/404/429 plus textual quota/invalidation signals. Limiting 404 to the websocket path (where it indicates the upstream session is gone) is correct — this is a different semantic than 404 on REST.
  • CloseExecutionSession is called on pin reset, which is important for connection pool hygiene.
  • The end-to-end test (TestResponsesWebsocket_DropsPreviousResponseIDAfterPinnedAuthReset) exercises the full sequence: success → auth failure → failover with stripped continuation token. This directly addresses luispater's requested regression test.

Issue (luispater's blocker — needs verification)

The test verifies executor.payloads[2] does not contain previous_response_id. But look at the turn-3 request construction:

thirdRequest := fmt.Sprintf(`{..., "previous_response_id":%q,...}`, previousResponseID)

The client sends previous_response_id on turn 3. The test then checks that the upstream payload (executor.payloads[2]) doesn't contain it — which exercises normalizeResponseSubsequentRequest stripping the field when allowIncrementalInputWithPreviousResponseID == false.

Please confirm that normalizeResponseSubsequentRequest (or equivalent) is the codepath that removes previous_response_id when the flag is false, and that this behavior is exercised here. If the test passes by coincidence (e.g. the field is stripped elsewhere unconditionally), the regression coverage is weaker than it appears.

Non-blocking

  • The resetPinnedAuthDisablesIncrementalInput flag persists until a successful non-resetting turn. If there are multiple consecutive auth failures (second auth also fails), the flag stays true correctly — but consider whether back-to-back resets could put the session in a state where no incremental input is ever re-enabled if new auths keep failing. Probably fine for the edge case, but worth a comment.
  • strings.TrimSpace(pinnedAuthID) is used in the log comparison but pinnedAuthID is always assigned from auth ID strings that shouldn't have whitespace. The trim is harmless but slightly misleading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants