fix(hub): disambiguate supervisor auth failure reasons#50
Merged
Conversation
verifyApiKeyWithCapability now returns a discriminated union
({ ok: true | false, reason: 'not_found' | 'revoked' | 'deleted' |
'missing_capability' }) instead of collapsing every failure to null.
The supervisor WS auth handler logs the precise reason and closes
with code 4001 + a meaningful reason text (api_key_not_found,
api_key_revoked, missing_supervisor_capability) so prod auth bugs
stop masquerading as silent connection failures.
- hub/src/db/supervisor-dal.ts: VerifyApiKeyResult union, separate
revoked + capability gates, structured log lines per branch
- hub/src/ws/agent.ts: switch over verified.reason, distinct close
reason strings, structured auth_error payload (adds .reason field)
- hub/test/supervisor-auth-disambiguation.test.ts: REMO_E2E_DB_URL-gated
coverage of all four runtime branches + always-on exhaustive-switch
smoke test
3 tasks
finedesignz
added a commit
that referenced
this pull request
May 26, 2026
…th (#69) Two silent failure paths in handleAgentAuth made it impossible to diagnose "new sessions not populating in the UI" from Coolify logs: 1. Invalid api_key: verifyApiKey returns null, hub sends auth_error frame and closes 4001 — but logs NOTHING. The user's only signal is a bare "[agent] connection opened" with no follow-up, indistinguishable from a network blip. Now logs `[agent] auth fail reason=invalid_api_key hash=<8hex>... host=<hostname>` matching the supervisor disambiguation pattern from PR #50. 2. Missing project_dir AND rootless_sessions: Phase 05 made project_dir optional in the AgentAuth zod schema (rootless-only agents), but the subsequent `msg.project_dir.replace(...)` would throw at runtime, again with no diagnostic. Explicit reject with reason=no_project_or_rootless. Adds hub/test/agent-auth-logging.test.ts (REMO_E2E_DB_URL gated) covering both new log lines + verifying the schema-reject path (already logged) is intact. Root cause of the user-visible bug: agents reconnecting with stale/wrong API keys after recent dashboard changes were silently rejected. The fix itself is purely diagnostic — once deployed, the operator can see exactly why an "[agent] connection opened" didn't produce "[agent] authenticated" and act (rotate key, point agent at correct hub, etc.).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
verifyApiKeyWithCapabilityreturns a discriminated union (not_found/revoked/deleted/missing_capability/ok) instead of collapsing every failure tonull.4001+ a meaningful reason text (api_key_not_found,api_key_revoked,missing_supervisor_capability) so the supervisor sees WHY auth failed.auth_errorpayload now includes areasonfield; close code stays 4001 for backward compatibility.Why
Prod supervisor auth failures (incl. the
237dfc95…incident logged in MEMORY.md) currently log nothing distinguishable — the legacy single-queryWHERE revoked_at IS NULLcollapsed three distinct failure modes (key gone / key revoked / key has wrong capabilities) into one silent null. Diagnosing required SSH-ing into Coolify. After this PR the hub logsreason=revoked/reason=missing_capability/reason=not_foundon every fail, and the supervisor sees a 4001 close with a parseable reason string.Notes
deletedbranch is reserved in the union for forward-compat —api_keys.deleted_atcolumn doesn't exist yet (onlyrevoked_at). It is never returned at runtime today but typechecks as exhaustive.capabilities[]is still treated as legacy all-caps (no behavior change for legacy keys).Test plan
bun test hub/test/— 247 pass, 0 failhub/test/supervisor-auth-disambiguation.test.ts: 5 DB-gated branches (not_found,ok,revoked,missing_capability, legacy empty caps) + always-on exhaustive-switch type smokezewfc6g9dw3c4h88z2jd2o4gand confirm a freshly-revoked supervisor key now logs[supervisor] auth fail reason=revokedon next reconnect