Skip to content

feat(s3-datastore, gcs-datastore): detect SSO/credential errors and prepend remediation hint#120

Merged
stack72 merged 1 commit intomainfrom
worktree-precious-wobbling-haven
May 4, 2026
Merged

feat(s3-datastore, gcs-datastore): detect SSO/credential errors and prepend remediation hint#120
stack72 merged 1 commit intomainfrom
worktree-precious-wobbling-haven

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented May 4, 2026

Summary

  • Refs swamp-club#226. Detect AWS SDK and GCP ADC credential failures in the S3 and GCS datastore extension error wrappers, and prepend a swamp-flavoured "Datastore session expired" / "Datastore credentials rejected" summary line that names the cause and points at the right remediation command (aws sso login / gcloud auth application-default login). The SDK detail and cause chain remain underneath; error.name is unchanged so existing call-site checks keep working.
  • Filed swamp-club#228 as out-of-scope follow-up to apply the same treatment to @swamp/aws-sm-vault.

Why

When AWS SSO sessions expire or GCP ADC refresh tokens are revoked mid-execution, the resulting S3OperationError [CredentialsProviderError] or GcsOperationError surfaces with the remediation hint (aws sso login) buried inside the SDK stack — users initially attribute the failure to lock contention or a backend issue. This change front-loads the swamp-flavoured summary so the cause and next action are visible first.

What

S3 (datastore/s3/extensions/datastores/_lib/s3_client.ts):

  • New exported pure helpers: classifyAwsCredentialError, deriveAwsErrorCode, formatAwsCredentialHint, AwsCredentialErrorKind.
  • wrapError calls them; cause-chain walk handles minified SDK builds where the real class name appears as _CredentialsProviderError on .cause (per the issue's stack trace).
  • Detects CredentialsProviderError + ExpiredTokenException (session-expired) and InvalidAccessKeyId + SignatureDoesNotMatch + 403 AccessDenied (credentials-rejected). AWS_PROFILE is read for the hint; falls back to generic phrasing when unset.
  • Existing 401/403 generic hint stays as fallback for non-credential auth failures (e.g. BucketRegionMismatch).

GCS (datastore/gcs/extensions/datastores/_lib/gcs_client.ts):

  • New exported pure helpers: classifyGcpCredentialError, formatGcpCredentialHint, tokenRefreshError, GcpCredentialErrorKind.
  • tokenFromUserCredentials and tokenFromServiceAccount now throw a typed GcsOperationError with name === "CredentialsProviderError" when the response body indicates invalid_grant — the GCP equivalent of expired SSO.
  • send() prepends the credentials-rejected hint on 401/403; suppresses the old generic hint when the credential-specific hint fires.

Manifests: Both datastore/s3 and datastore/gcs bumped from 2026.05.04.1 to 2026.05.04.2 (CalVer same-day .2).

Test plan

  • S3 unit tests for deriveAwsErrorCode — 7 cases covering e.Code precedence, e.name fallback, generic Error ignored, single + multiple leading underscore strip on cause, non-Error cause ignored, no-signal undefined.
  • S3 unit tests for classifyAwsCredentialError — 8 cases covering each error code branch and status guards.
  • S3 unit tests for formatAwsCredentialHint — 5 cases covering profile set/unset variants for both kinds + other returns undefined.
  • S3 integration tests via withMockServer — 4 cases for server-side errors (InvalidAccessKeyId, AccessDenied, SignatureDoesNotMatch, BucketRegionMismatch regression) + 1 case for 500 regression.
  • S3 client-side CredentialsProviderError integration test — sets AWS_PROFILE to a non-existent profile, unsets all access-key env vars, points config files at empty temp file, disables IMDS, calls headBucket(). Asserts the wrapped message starts with "Datastore session expired", includes the configured profile in the aws sso login --profile hint, and S3OperationError.name reflects the underlying SDK error name. This is the issue's primary failure path, exercised end-to-end against a real SDK chain failure.
  • GCS unit tests for classifyGcpCredentialError / formatGcpCredentialHint — 8 cases.
  • GCS unit tests for tokenRefreshError — 4 cases covering invalid_grant (session-expired framing + name = CredentialsProviderError), non-invalid_grant 4xx (generic name), 5xx (no credential framing), and a composition test that proves the resulting error flows through classifyGcpCredentialError to session-expired.
  • GCS integration tests via Deno.serve({ port: 0 }) — 401, 500 regression, 404/412 narrow-type regression guards. Existing 403 and gcs_cache_sync_test.ts 403 tests updated to assert the new hint.
  • assertDatastoreExportConformance continues to pass for both extensions.
  • deno check / lint / fmt --check / test / install --frozen clean for both extensions.

Out of scope

  • @swamp/aws-sm-vault has the same problem but no wrapError pattern to extend — introducing one is a meaningful separate lift. Filed as swamp-club#228.
  • The swamp-core summarizeSyncError reference path lives outside this repo.

🤖 Generated with Claude Code

…repend remediation hint (#226)

When AWS SSO sessions expire or GCP ADC refresh tokens are revoked, the
underlying SDK error currently surfaces buried inside an S3OperationError
or GcsOperationError stack, with the remediation hint (`aws sso login` /
`gcloud auth application-default login`) hard to find. Users initially
attribute the failure to the lock layer or backend.

Both wrappers now classify auth-class failures and prepend a
swamp-flavoured summary line:

- "Datastore session expired: ... Run 'aws sso login --profile <P>' ..."
- "Datastore credentials rejected by AWS: verify '<P>' ..."
- "Datastore session expired: your GCP ADC ... 'gcloud auth application-default login' ..."
- "Datastore credentials rejected by GCS: verify GOOGLE_APPLICATION_CREDENTIALS ..."

Detection is via pure exported helpers (classifyAwsCredentialError,
deriveAwsErrorCode, classifyGcpCredentialError, formatAwsCredentialHint,
formatGcpCredentialHint, tokenRefreshError) so each branch can be
unit-tested without constructing real SDK errors. The wrappers stay
private; only the helpers are exported.

S3 detection covers CredentialsProviderError + ExpiredTokenException
(session-expired) and InvalidAccessKeyId + SignatureDoesNotMatch + 403
AccessDenied (credentials-rejected). The cause-chain walk handles
minified SDK builds where the real class name appears as
_CredentialsProviderError on .cause.

GCS refactors tokenFromUserCredentials and tokenFromServiceAccount to
throw a typed GcsOperationError with name "CredentialsProviderError"
on invalid_grant responses, so the same classifier path fires for ADC
refresh failures.

The .name field on both wrapper classes is preserved as the underlying
SDK error name so existing `error.name === "..."` checks at call sites
keep working.

Filed swamp-club#228 to apply the same treatment to @swamp/aws-sm-vault.

Refs: swamp-club#226
Manifests: 2026.05.04.1 → 2026.05.04.2 for both extensions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Blocking Issues

None.

Suggestions

  1. Double docblock on tokenRefreshError (gcs_client.ts lines ~278–291): Two consecutive JSDoc blocks appear before the function — likely a merge artifact. TypeScript associates only the last one with the function; the first is dead documentation. The first block should be removed.

  2. classifyGcpCredentialError's causeName parameter is always undefined in production send() calls: send() always calls classifyGcpCredentialError(undefined, response.status), so the causeName === "CredentialsProviderError" branch in that function is never reached from send(). The session-expired path for GCS flows through tokenRefreshError() directly (which embeds the hint before throwing). The function is correctly exercised in the composition unit test (err.name is passed in), but a brief note in the JSDoc that causeName is intended for callers re-classifying a caught error — not for send() itself — would clarify the design intent.

  3. Stale section comment in gcs_client_test.ts (line ~467): The block comment preceding the wrapper integration tests says "writing a temporary authorized_user credentials file, pointing the OAuth token endpoint at our mock server, and intercepting the refresh request" but the actual tests call tokenRefreshError directly as pure unit tests. The description matches the test design, not what was implemented.


The implementation is well-structured: pure helper functions for testability, proper cause-chain walking for minified SDK builds, regression guards on non-credential 403s, env vars restored in finally blocks throughout, and sanitizeResources: false following the existing file-level comment pattern. Test coverage is thorough across both providers.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

None found.

Medium

  1. GCS error type change in token-refresh functions (gcs_client.ts:310–319, gcs_client.ts:336–340).
    tokenFromServiceAccount and tokenFromUserCredentials previously threw plain Error; they now throw GcsOperationError via tokenRefreshError(). This is a behavioral change: any catch block that dispatches on instanceof GcsOperationError and inspects httpStatusCode will now match these token-endpoint errors when it didn't before.

    Breaking example: A caller does catch (err) { if (err instanceof GcsOperationError && err.httpStatusCode === 400) { /* "GCS said bad request" */ } }. Before this PR, a 400 from the OAuth2 token endpoint (e.g. invalid_grant) would be a plain Error and skip this block. Now it enters the block with a status code whose semantic origin is the token endpoint, not GCS.

    Verified safe in existing code: isRetryableError in gcs_cache_sync.ts:182–195 handles the new type correctly — 400 → not retried (correct), 5xx → retried (arguably an improvement over the old behavior where 5xx token-endpoint failures were non-retryable plain Errors). The name field ("CredentialsProviderError" / "TokenRefreshError") disambiguates if needed. No action required unless there are call sites outside this repo that match instanceof GcsOperationError.

Low

  1. Duplicate JSDoc on tokenRefreshError (gcs_client.ts:278–293). Two consecutive /** */ blocks precede the function. The first (lines 278–284) is a stale draft; only the second (lines 285–293) is attached to the function by the parser. Harmless, but messy — delete the first block.

  2. Unquoted AWS_PROFILE in remediation hint (s3_client.ts:192). formatAwsCredentialHint interpolates the profile name bare: `aws sso login --profile ${awsProfile}`. If AWS_PROFILE contains spaces (uncommon but not prohibited by AWS config format), the copy-paste command would be broken. Consider wrapping: `aws sso login --profile '${awsProfile}'`.

  3. tokenFromMetadataServer still throws plain Error (gcs_client.ts:376). The other two token functions now use tokenRefreshError()GcsOperationError, but the metadata-server path still throws new Error(...). This means metadata-server token failures won't get the swamp-flavoured hint. The remediation for metadata-server failures is different (check instance SA, not run gcloud auth), so this may be intentional — but it's worth a comment to make that explicit.

Verdict

PASS. The new helpers are pure, well-tested, and correctly integrated into both providers' error paths. Classification logic handles all documented error codes with appropriate regression guards. The medium finding is a real behavioral change but is handled correctly by the existing retry classifier. Solid work.

@stack72 stack72 merged commit 0f83dad into main May 4, 2026
30 checks passed
@stack72 stack72 deleted the worktree-precious-wobbling-haven branch May 4, 2026 18:31
stack72 added a commit that referenced this pull request May 4, 2026
…121)

Five small follow-ups from the post-merge review of #120:

1. gcs_client.ts:tokenRefreshError — collapse the duplicate JSDoc block
   left over from the test-export commit (parser only attached the
   second one; the first was dead documentation).

2. gcs_client_test.ts — rewrite the stale "Wrapper integration tests"
   section comment that described an OAuth-mocking design that was
   never implemented (we export tokenRefreshError and unit-test it
   directly instead).

3. classifyGcpCredentialError — add a JSDoc note that the causeName
   parameter is for callers re-classifying a caught error. send() never
   reaches the cause-name branch because tokenRefreshError short-
   circuits the session-expired path at construction time.

4. s3_client.ts:formatAwsCredentialHint — wrap the profile name in
   double quotes inside the single-quoted command, so the copy-pasted
   `aws sso login --profile "..."` stays valid for profiles that
   contain spaces (uncommon but legal in AWS config). Updated unit and
   integration tests to assert the new shape; added a multi-word-
   profile unit test.

5. tokenFromMetadataServer — add a comment explaining why this path
   intentionally throws plain Error rather than going through
   tokenRefreshError. The metadata-server failure remediation differs
   from the user-credential / service-account paths (check instance SA
   and IAM scopes, not run gcloud auth), so stamping
   `name = "CredentialsProviderError"` here would mislead the
   classifier into prepending the wrong hint.

Manifest bumps: 2026.05.04.2 → 2026.05.04.3 for both extensions.

Refs: swamp-club#226, #120

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stack72 added a commit that referenced this pull request May 6, 2026
…ion hint (#228) (#126)

Follow-up to swamp-club#226 (PR #120/#121 covered s3-datastore and
gcs-datastore). The aws-sm vault wraps the AWS SDK and surfaced the
same CredentialsProviderError discoverability issue when SSO sessions
expire mid-secret-fetch.

Adds vault/aws-sm/extensions/vaults/aws_sm_errors.ts containing:

  - AwsSmOperationError: error wrapper that preserves the SDK error's
    `name` (so error.name === "ResourceNotFoundException" still works)
    and exposes httpStatusCode, code, and requestId.
  - classifyAwsCredentialError, deriveAwsErrorCode,
    formatAwsCredentialHint: pure helpers mirroring the s3 pattern.
  - wrapAwsSmError: front-loads "Vault session expired:" or "Vault
    credentials rejected by AWS:" with `aws sso login --profile <P>`
    pointing at the next action.

aws_sm.ts wraps every .send() call and routes errors through
wrapAwsSmError. The put() create-on-missing fallback migrates from
`error instanceof ResourceNotFoundException` to
`error.name === "ResourceNotFoundException"`, which is functionally
equivalent and removes the SDK class import.

Empirical grounding (probe at @aws-sdk/client-secrets-manager@3.1024.0,
2026-05-06):

  - Pre-flight credential failure surfaces as
    `name === "CredentialsProviderError"` directly on the outer error
    with no cause chain. The cause-walk + leading-underscore-strip in
    deriveAwsErrorCode remains as defensive coverage for minified SDK
    builds and the older #226 stack-trace pattern.
  - HTTP 400 with a body lacking `__type` produces err.name="Unknown"
    and err.message="UnknownError". wrapAwsSmError mirrors the s3
    wrapper's `code !== "Unknown"` and `message !== "UnknownError"`
    filters so the wrapped output stays clean.

Both shapes are covered by direct wrapAwsSmError tests in
aws_sm_errors_test.ts (24 tests). aws_sm_test.ts adds 6 behavioural
tests covering all four operations, the put create-on-missing
regression guard for the instanceof → name migration, and the
malformed-body defense.

Bumps the inline SDK pin in aws_sm.ts from @3.1010.0 → @3.1024.0 so
the new aws_sm_errors.ts and aws_sm.ts agree (matches deno.json's
import map).

Manifest: 2026.04.22.2 → 2026.05.06.1.

Refs: swamp-club#228, swamp-club#226

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant