Skip to content

fix(server): refresh api-keys cache on miss + TTL so multi-instance deploys see new accounts (#2351)#2681

Open
yeyitech wants to merge 2 commits into
volcengine:mainfrom
yeyitech:fix/2351-api-keys-cache-invalidation
Open

fix(server): refresh api-keys cache on miss + TTL so multi-instance deploys see new accounts (#2351)#2681
yeyitech wants to merge 2 commits into
volcengine:mainfrom
yeyitech:fix/2351-api-keys-cache-invalidation

Conversation

@yeyitech

Copy link
Copy Markdown
Contributor

Summary

Closes #2351. In a multi-instance deployment behind a load balancer, an account/user created via POST /api/v1/admin/accounts on Instance A wrote to shared AGFS but never propagated to Instance B's in-memory api-keys cache (loaded once at startup). Subsequent authenticated requests routed to Instance B returned 401.

Maintainer @zhoujh01 confirmed in the issue thread: "We also welcome community members to share their code."

This PR keeps it simple — no new infrastructure dependency:

  1. Cache-miss reload — when a request presents an X-API-Key that isn't in the in-memory cache, the cache reloads from AGFS once before declaring 401. New accounts created on a peer instance become visible on the next request.
  2. 30 s TTL — cache entries expire and refresh periodically so already-cached entries also stay fresh.
  3. Local-write invalidation — admin endpoints invalidate the local cache immediately on successful write so the originating instance never has to wait.
  4. Reload deduplication — concurrent misses for the same key share one reload via an asyncio.Lock, so a thundering-herd doesn't multiply storage reads.

No new dependencies (no Redis / no pub-sub). Storage format unchanged. TTL is a module constant; no new config field.

Test plan

  • pytest tests/server/test_api_keys_cache_invalidation.py -x -q passes (5 cases).
  • Manual: spin up two server processes against the same AGFS; create an account on one, immediately auth against the other → succeeds within one retry.
  • Existing tests/server/test_auth.py and admin tests still pass.

Closes #2351

…eploys see new accounts (volcengine#2351)

In a multi-instance deployment behind a load balancer, accounts and
users created on Instance A via /api/v1/admin/accounts wrote to shared
AGFS but never propagated to Instance B's in-memory api-keys cache,
which loaded once at startup. Subsequent requests routed to Instance B
returned 401 even though the new account was persisted.

The cache now refreshes on cache-miss before declaring an unknown key
unauthorized, and entries expire after a 30-second TTL so changes
written elsewhere are picked up on the next request after the TTL.
Local writes (admin endpoints) invalidate immediately so the
originating instance always sees its own writes without waiting.

Concurrent misses for the same key dedupe through an asyncio.Lock so a
thundering-herd of 401s after a fresh account creation only triggers
one reload.

No new dependencies (no Redis / no pub-sub). Storage format unchanged.
TTL is a module constant; no new config field.

Closes volcengine#2351

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅

2351 - Fully compliant

Compliant requirements:

  • Fix multi-instance API key cache inconsistency
  • Ensure newly created accounts/users are visible across instances
  • Ensure regenerated API keys invalidate old keys across instances
  • Ensure deleted users are no longer authenticated across instances
⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🏅 Score: 92
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Private Attribute Access

APIKeyManager.resolve_with_refresh accesses private attributes and methods of self._legacy (e.g., _cache_is_stale, _reload_locked, _loaded_at, _unknown_key_reload_at). This creates tight coupling between the two classes and makes future refactoring harder.

async def resolve_with_refresh(self, api_key: str) -> ResolvedIdentity:
    """Resolve an API key, reloading from AGFS on cache miss / TTL expiry.

    See ``LegacyAPIKeyManager.resolve_with_refresh`` for the policy. The
    new-format fast path goes through the same ``resolve`` body, so
    cache-miss reload behaves identically for new and legacy keys.
    """
    import time as _time

    if not api_key:
        raise UnauthenticatedError("Missing API Key")

    if self._legacy._cache_is_stale():
        await self._legacy._reload_locked(force=False)

    try:
        return self.resolve(api_key)
    except UnauthenticatedError:
        pass

    if self._legacy._should_skip_unknown_key_reload(api_key):
        raise UnauthenticatedError("Invalid API Key")

    observed_loaded_at = self._legacy._loaded_at
    await self._legacy._reload_locked(force=True)
    try:
        return self.resolve(api_key)
    except UnauthenticatedError:
        if self._legacy._loaded_at != observed_loaded_at:
            self._legacy._unknown_key_reload_at[api_key] = _time.monotonic()
        raise

@github-actions

Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Add resolve callable parameter for reusability

Add an optional resolve_callable parameter to allow reuse by the new API key manager
without accessing private members.

openviking/server/api_keys/legacy.py [300-347]

-async def resolve_with_refresh(self, api_key: str) -> ResolvedIdentity:
+async def resolve_with_refresh(
+    self,
+    api_key: str,
+    resolve_callable=None,
+) -> ResolvedIdentity:
     """Like ``resolve`` but reloads from AGFS on cache miss / TTL expiry.
 
     Two-pronged freshness for multi-instance deploys:
       1. Cache-miss reload — an unknown ``api_key`` triggers one forced
          reload before declaring 401, so accounts created on a peer
          instance become visible on the next request.
       2. TTL refresh — entries older than ``ACCOUNTS_CACHE_TTL_SECONDS``
          are refreshed before the lookup, bounding staleness for entries
          we already think we know.
 
     Concurrent callers dedupe through ``_reload_lock`` so a thundering
     herd of 401s after a fresh account creation only hits AGFS once.
     """
+    resolve_callable = resolve_callable or self.resolve
     if not api_key:
         raise UnauthenticatedError("Missing API Key")
 
     # Periodic refresh — picks up rotations / removals on peer instances.
     if self._cache_is_stale():
         await self._reload_locked(force=False)
 
     try:
-        return self.resolve(api_key)
+        return resolve_callable(api_key)
     except UnauthenticatedError:
         pass
 
     # Cache miss. Avoid hammering AGFS for keys we just confirmed missing
     # *and for which a reload has already been performed*. If a peer
     # caller is mid-reload, our _reload_locked call below will piggy-back
     # on that reload rather than starting a new one.
     if self._should_skip_unknown_key_reload(api_key):
         raise UnauthenticatedError("Invalid API Key")
 
     observed_loaded_at = self._loaded_at
     await self._reload_locked(force=True)
     # Only stamp the negative cache *after* the reload resolved the
     # question (key truly absent). Stamping before would make peer
     # callers in the same burst short-circuit their own reload attempt
     # without ever seeing the post-reload state.
     try:
-        return self.resolve(api_key)
+        return resolve_callable(api_key)
     except UnauthenticatedError:
         # Only count this as a confirmed miss if our caller (or a peer)
         # actually reloaded — otherwise a concurrent caller's stamp could
         # cause us to mark a key as missing without re-checking AGFS.
         if self._loaded_at != observed_loaded_at:
             self._unknown_key_reload_at[api_key] = time.monotonic()
         raise
Suggestion importance[1-10]: 5

__

Why: This valid change improves encapsulation and reusability by allowing a custom resolve callable, enabling the new API key manager to avoid accessing private legacy members. It’s a useful maintainability improvement but not critical.

Low

@qin-ctx qin-ctx requested a review from zhoujh01 June 17, 2026 08:10
@ZaynJarvis ZaynJarvis added bug Something isn't working scenario:kernel Core server, runtime, storage, retrieval, SDK, CLI, or Studio behavior. urgency:fatal Severe production breakage, crash, security risk, or broad outage. labels Jun 23, 2026
@ZaynJarvis

Copy link
Copy Markdown
Collaborator

Direction matches #2351: cache-miss reload + TTL + local invalidation is a reasonable no-new-infra bridge.

But this PR is not mergeable as-is:

  • CI is red.
  • Diff includes unrelated parser files (openviking/parse/..., tests/parse/test_code_tools.py) that are outside the API-key cache fix.

Please split/drop unrelated parser changes and rerun the targeted auth/cache tests. After that, this can be reviewed as the #2351 fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Review effort 3/5 scenario:kernel Core server, runtime, storage, retrieval, SDK, CLI, or Studio behavior. urgency:fatal Severe production breakage, crash, security risk, or broad outage.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug]:Account/user created on one instance is invisible to other instances in multi-instance deployment

2 participants