fix(auth): avoid stale runtime state when re-adding disabled auth files#1945
fix(auth): avoid stale runtime state when re-adding disabled auth files#1945excelwang wants to merge 3 commits intorouter-for-me:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Concrete failure mode this fixes:
So this is not just a cleanup issue. In production it means a healthy replacement credential can remain unavailable for minutes or hours after rotation, which is why I split this out as a standalone bugfix. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a3e448a3d1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
sdk/cliproxy/auth/conductor.go
Outdated
| return nil | ||
| } | ||
| } | ||
| if err := m.store.Delete(ctx, id); err != nil { |
There was a problem hiding this comment.
Delete persisted auth via record path instead of ID
Calling m.store.Delete(ctx, id) here assumes auth IDs are valid delete keys, but file-backed auth IDs can contain path separators (e.g. nested IDs like team/user.json). In FileTokenStore.resolveDeletePath, any ID containing a separator is treated as a direct filesystem path rather than being rooted under AuthDir, so removal can target the wrong location and leave the actual auth file undeleted (or delete an unrelated relative path) when an auth is removed and later reloaded. This regression is introduced by the new Remove persistence path and is triggered whenever removed auth IDs are not simple basename IDs.
Useful? React with 👍 / 👎.
xkonjin
left a comment
There was a problem hiding this comment.
Quick review pass:
- Main risk area here is auth/session state and stale credential handling.
- Good to see test coverage move with the code; I’d still make sure it exercises the unhappy path around auth/session state and stale credential handling rather than only the happy path.
- Before merge, I’d smoke-test the behavior touched by auth_files.go, auth_files_delete_test.go, conductor.go (+2 more) with malformed input / retry / rollback cases, since that’s where this class of change usually breaks.
|
Follow-up on the auth/session state concern: I re-ran targeted tests around the exact stale-state path this patch fixes. The failure mode here is not request parsing, but runtime auth metadata surviving deletion and being reused when the same auth file is added back. Re-ran locally:
The first covers the unhappy path where recreated auth would otherwise inherit stale model/cooldown state from the deleted entry. The second covers the management deletion flow end-to-end and verifies the same-name auth can be re-added and selected again instead of remaining stuck behind old runtime state. So the risky path here is specifically exercised at both the manager layer and the management delete/re-add flow. |
luispater
left a comment
There was a problem hiding this comment.
Summary
This fixes the stale ModelStates / cooldown reuse problem, but the merge result introduces a scheduler regression on current main.
Key findings
- Blocking:
Manager.Removedeletes the auth fromm.auths, but it never removes the auth fromm.scheduler. On current main,pickNext/pickNextMixeduse the scheduler fast path, so a deleted auth can stay selectable from cached model shards until a full scheduler rebuild happens. - Non-blocking: the new tests only cover
GetByID/ staleModelStates. Please add a regression test that removes an auth after it has already been scheduled and verifiesmanager.pickNext(orscheduler.pickSingle) can no longer return it.
Test plan
go test ./sdk/cliproxy/auth ./internal/api/handlers/management -count=1
|
Follow-up pushed in I pulled latest
That test:
Re-ran:
If there is another scheduler/cache path you want this exercised against, point me at the file and I can extend the regression there too. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Updated this PR to follow #2061 directly.\n\nThis no longer changes deletion semantics to a hard remove. It keeps the existing disable-based removal flow and only prevents from inheriting stale from an already-disabled auth entry when the same auth file is recreated. |
|
Updated this PR to follow #2061 directly. This no longer changes deletion semantics to a hard remove. It keeps the existing disable-based removal flow and only prevents Update from inheriting stale model state from an already-disabled auth entry when the same auth file is recreated. |
Summary
Fixes #2061.
When an auth file is deleted in the current upstream flow, the runtime entry is kept and marked disabled. If a file with the same identity is later recreated,
Update(...)can currently inherit staleModelStatesfrom that removed/disabled entry.This PR keeps the existing disable-based removal behavior and only changes the state inheritance rule:
disabled + UpdateModelStatesfrom an existing auth that is already disabledModelStatesinheritance behavior for normal in-place auth updatesWhy this approach
This addresses the original bug without changing the existing disable semantics that scheduler and management paths already rely on.
Tests