Skip to content

fix(hub): disambiguate supervisor auth failure reasons#50

Merged
finedesignz merged 1 commit into
mainfrom
fix/supervisor-auth-disambiguation
May 26, 2026
Merged

fix(hub): disambiguate supervisor auth failure reasons#50
finedesignz merged 1 commit into
mainfrom
fix/supervisor-auth-disambiguation

Conversation

@finedesignz
Copy link
Copy Markdown
Owner

Summary

  • verifyApiKeyWithCapability returns a discriminated union (not_found / revoked / deleted / missing_capability / ok) instead of collapsing every failure to null.
  • Supervisor WS auth handler logs the precise reason and closes the socket with code 4001 + a meaningful reason text (api_key_not_found, api_key_revoked, missing_supervisor_capability) so the supervisor sees WHY auth failed.
  • auth_error payload now includes a reason field; close code stays 4001 for backward compatibility.

Why

Prod supervisor auth failures (incl. the 237dfc95… incident logged in MEMORY.md) currently log nothing distinguishable — the legacy single-query WHERE revoked_at IS NULL collapsed three distinct failure modes (key gone / key revoked / key has wrong capabilities) into one silent null. Diagnosing required SSH-ing into Coolify. After this PR the hub logs reason=revoked / reason=missing_capability / reason=not_found on every fail, and the supervisor sees a 4001 close with a parseable reason string.

Notes

  • deleted branch is reserved in the union for forward-compat — api_keys.deleted_at column doesn't exist yet (only revoked_at). It is never returned at runtime today but typechecks as exhaustive.
  • Empty capabilities[] is still treated as legacy all-caps (no behavior change for legacy keys).

Test plan

  • bun test hub/test/ — 247 pass, 0 fail
  • New file hub/test/supervisor-auth-disambiguation.test.ts: 5 DB-gated branches (not_found, ok, revoked, missing_capability, legacy empty caps) + always-on exhaustive-switch type smoke
  • Post-merge: redeploy via Coolify zewfc6g9dw3c4h88z2jd2o4g and confirm a freshly-revoked supervisor key now logs [supervisor] auth fail reason=revoked on next reconnect

verifyApiKeyWithCapability now returns a discriminated union
({ ok: true | false, reason: 'not_found' | 'revoked' | 'deleted' |
'missing_capability' }) instead of collapsing every failure to null.
The supervisor WS auth handler logs the precise reason and closes
with code 4001 + a meaningful reason text (api_key_not_found,
api_key_revoked, missing_supervisor_capability) so prod auth bugs
stop masquerading as silent connection failures.

- hub/src/db/supervisor-dal.ts: VerifyApiKeyResult union, separate
  revoked + capability gates, structured log lines per branch
- hub/src/ws/agent.ts: switch over verified.reason, distinct close
  reason strings, structured auth_error payload (adds .reason field)
- hub/test/supervisor-auth-disambiguation.test.ts: REMO_E2E_DB_URL-gated
  coverage of all four runtime branches + always-on exhaustive-switch
  smoke test
@finedesignz finedesignz merged commit 2daf904 into main May 26, 2026
1 check passed
@finedesignz finedesignz deleted the fix/supervisor-auth-disambiguation branch May 26, 2026 14:35
finedesignz added a commit that referenced this pull request May 26, 2026
…th (#69)

Two silent failure paths in handleAgentAuth made it impossible to diagnose
"new sessions not populating in the UI" from Coolify logs:

1. Invalid api_key: verifyApiKey returns null, hub sends auth_error frame
   and closes 4001 — but logs NOTHING. The user's only signal is a bare
   "[agent] connection opened" with no follow-up, indistinguishable from a
   network blip. Now logs `[agent] auth fail reason=invalid_api_key
   hash=<8hex>... host=<hostname>` matching the supervisor disambiguation
   pattern from PR #50.

2. Missing project_dir AND rootless_sessions: Phase 05 made project_dir
   optional in the AgentAuth zod schema (rootless-only agents), but the
   subsequent `msg.project_dir.replace(...)` would throw at runtime, again
   with no diagnostic. Explicit reject with reason=no_project_or_rootless.

Adds hub/test/agent-auth-logging.test.ts (REMO_E2E_DB_URL gated) covering
both new log lines + verifying the schema-reject path (already logged) is
intact.

Root cause of the user-visible bug: agents reconnecting with stale/wrong
API keys after recent dashboard changes were silently rejected. The fix
itself is purely diagnostic — once deployed, the operator can see exactly
why an "[agent] connection opened" didn't produce "[agent] authenticated"
and act (rotate key, point agent at correct hub, etc.).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant