Skip to content

fix(auth): avoid stale runtime state when re-adding disabled auth files#1945

Open
excelwang wants to merge 3 commits intorouter-for-me:mainfrom
excelwang:pr/upstream-auth-remove-readd-fix
Open

fix(auth): avoid stale runtime state when re-adding disabled auth files#1945
excelwang wants to merge 3 commits intorouter-for-me:mainfrom
excelwang:pr/upstream-auth-remove-readd-fix

Conversation

@excelwang
Copy link

@excelwang excelwang commented Mar 8, 2026

Summary

Fixes #2061.

When an auth file is deleted in the current upstream flow, the runtime entry is kept and marked disabled. If a file with the same identity is later recreated, Update(...) can currently inherit stale ModelStates from that removed/disabled entry.

This PR keeps the existing disable-based removal behavior and only changes the state inheritance rule:

  • keep deletion as disabled + Update
  • do not preserve ModelStates from an existing auth that is already disabled
  • preserve the existing ModelStates inheritance behavior for normal in-place auth updates

Why this approach

This addresses the original bug without changing the existing disable semantics that scheduler and management paths already rely on.

Tests

  • manager update still preserves model state for normal auth updates
  • manager update does not preserve stale model state from a disabled auth entry
  • management delete + re-add of the same auth file no longer carries stale runtime state into the recreated auth

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@excelwang
Copy link
Author

Concrete failure mode this fixes:

  1. an auth enters cooldown / error state
  2. the operator deletes the bad auth file
  3. the operator adds a fresh replacement file with the same auth ID / filename
  4. the fresh auth still inherits the deleted entry's stale runtime state and remains unschedulable until the old cooldown expires

So this is not just a cleanup issue. In production it means a healthy replacement credential can remain unavailable for minutes or hours after rotation, which is why I split this out as a standalone bugfix.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a3e448a3d1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return nil
}
}
if err := m.store.Delete(ctx, id); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Delete persisted auth via record path instead of ID

Calling m.store.Delete(ctx, id) here assumes auth IDs are valid delete keys, but file-backed auth IDs can contain path separators (e.g. nested IDs like team/user.json). In FileTokenStore.resolveDeletePath, any ID containing a separator is treated as a direct filesystem path rather than being rooted under AuthDir, so removal can target the wrong location and leave the actual auth file undeleted (or delete an unrelated relative path) when an auth is removed and later reloaded. This regression is introduced by the new Remove persistence path and is triggered whenever removed auth IDs are not simple basename IDs.

Useful? React with 👍 / 👎.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review pass:

  • Main risk area here is auth/session state and stale credential handling.
  • Good to see test coverage move with the code; I’d still make sure it exercises the unhappy path around auth/session state and stale credential handling rather than only the happy path.
  • Before merge, I’d smoke-test the behavior touched by auth_files.go, auth_files_delete_test.go, conductor.go (+2 more) with malformed input / retry / rollback cases, since that’s where this class of change usually breaks.

@excelwang
Copy link
Author

Follow-up on the auth/session state concern:

I re-ran targeted tests around the exact stale-state path this patch fixes. The failure mode here is not request parsing, but runtime auth metadata surviving deletion and being reused when the same auth file is added back.

Re-ran locally:

  • go test ./sdk/cliproxy/auth -run TestManager_Remove_DropsStaleModelStatesForRecreatedAuth -count=1
  • go test ./internal/api/handlers/management -run TestDeleteAuthFile_AllowsReaddingSameNameWithoutStaleState -count=1

The first covers the unhappy path where recreated auth would otherwise inherit stale model/cooldown state from the deleted entry. The second covers the management deletion flow end-to-end and verifies the same-name auth can be re-added and selected again instead of remaining stuck behind old runtime state.

So the risky path here is specifically exercised at both the manager layer and the management delete/re-add flow.

Copy link
Collaborator

@luispater luispater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This fixes the stale ModelStates / cooldown reuse problem, but the merge result introduces a scheduler regression on current main.

Key findings

  • Blocking: Manager.Remove deletes the auth from m.auths, but it never removes the auth from m.scheduler. On current main, pickNext / pickNextMixed use the scheduler fast path, so a deleted auth can stay selectable from cached model shards until a full scheduler rebuild happens.
  • Non-blocking: the new tests only cover GetByID / stale ModelStates. Please add a regression test that removes an auth after it has already been scheduled and verifies manager.pickNext (or scheduler.pickSingle) can no longer return it.

Test plan

  • go test ./sdk/cliproxy/auth ./internal/api/handlers/management -count=1

@excelwang
Copy link
Author

Follow-up pushed in 8aecc9fb.

I pulled latest main again and re-checked the selection path around this change. In the current auth manager code path I could not find a separate m.scheduler cache; pickNext / pickNextMixed still select from m.auths directly. To cover the exact concern anyway, I added the requested regression test for the post-selection remove path:

  • TestManager_Remove_PreviouslySelectedAuthIsNoLongerPickable

That test:

  1. registers two auths for the same model
  2. selects auth-1 once through pickNext
  3. removes auth-1
  4. verifies the next pickNext can only return auth-2

Re-ran:

  • go test ./sdk/cliproxy/auth ./internal/api/handlers/management -count=1

If there is another scheduler/cache path you want this exercised against, point me at the file and I can extend the regression there too.

@excelwang excelwang changed the title fix(auth): fully remove deleted auth state before re-adding auth file fix(auth): avoid stale runtime state when re-adding disabled auth files Mar 11, 2026
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@excelwang
Copy link
Author

Updated this PR to follow #2061 directly.\n\nThis no longer changes deletion semantics to a hard remove. It keeps the existing disable-based removal flow and only prevents from inheriting stale from an already-disabled auth entry when the same auth file is recreated.

@excelwang
Copy link
Author

Updated this PR to follow #2061 directly.

This no longer changes deletion semantics to a hard remove. It keeps the existing disable-based removal flow and only prevents Update from inheriting stale model state from an already-disabled auth entry when the same auth file is recreated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-created auth file inherits stale runtime state when re-added with the same path

3 participants