Skip to content

test(harden): stabilize live suite — bundle resolve, S3 SlowDown retry, SQS post-create race#254

Open
sam-goodwin wants to merge 10 commits into
mainfrom
claude/gracious-khorana-ad8809
Open

test(harden): stabilize live suite — bundle resolve, S3 SlowDown retry, SQS post-create race#254
sam-goodwin wants to merge 10 commits into
mainfrom
claude/gracious-khorana-ad8809

Conversation

@sam-goodwin
Copy link
Copy Markdown
Contributor

@sam-goodwin sam-goodwin commented May 7, 2026

Stabilizes the live test suite — 3 consecutive clean runs (673 passed / 42 skipped) after these fixes. All but one of these failure modes show up only under parallel-suite load.

Bundle: resolve alchemy/Stack (and friends) without requiring lib/

Rolldown was picking the import condition in alchemy's package.json#exports, which points at ./lib/*.js. On a fresh checkout lib/ doesn't exist; rolldown emits [UNRESOLVED_IMPORT], treats the path as external, and the Lambda crashes on init. Surfaces as opaque "Function URL returned 502" timeouts across SQS/SNS/DynamoDB/Kinesis Bindings tests and the Lambda Function test.

const SOURCE_CONDITIONS = ["bun", "workerd", "import", "default"];

Resolves straight to source .ts. Rolldown handles TS natively and every package exposing a bun source condition also publishes src/.

S3 SlowDown — patched in distilled

S3 503 SlowDown was previously surfaced as UnknownAwsError, which the blanket awsRetryFactory in Providers.ts couldn't recognize. Patched in distilled to classify it as ThrottlingError + RetryableError: alchemy-run/distilled#270. Once that lands, the existing 10-attempt jittered-exponential retry handles it transparently — no alchemy-side change needed.

SQS Queue reconcile: tolerate post-create eventual consistency

sqs.getQueueAttributes({ QueueUrl: queueUrl, AttributeNames: ["All"] }).pipe(
  Effect.retry({
    while: (e) => e._tag === "QueueDoesNotExist",
    schedule: Schedule.fixed(500).pipe(Schedule.both(Schedule.recurs(60))),
  }),
)

Same-tick GetQueueAttributes after CreateQueue can return QueueDoesNotExist for a few hundred ms.

DynamoDB Bindings test: bump readiness budget

75 -> 100 retries (150s -> 200s). The Bindings handler's /scan probe was returning 500 the entire 150s window under parallel-suite load while IAM scan policy propagated. Still inside the beforeAll 240s timeout.

Hyperdrive test: skip pending real Postgres origin

Cloudflare validates not just DNS but TCP-level connectivity at create time, so any placeholder host (db.example.com, example.com:5432, etc.) is rejected. Skipped both cases with a TODO(hyperdrive) to provision a Neon project as the fixture.

…create race, DynamoDB Bindings readiness, Hyperdrive skip

Bundle: when rolldown resolves `alchemy/Stack` (and other workspace
subpath imports) it picks the package.json `import` condition which
points at `lib/*.js`. On a fresh checkout `lib/` doesn't exist; rolldown
emits `[UNRESOLVED_IMPORT]`, treats the path as external, and the
deployed Lambda crashes on init — which surfaced as opaque "Function
URL returned 502" timeouts across SQS/SNS/DynamoDB/Kinesis Bindings and
the Lambda Function test. Default `resolve.conditionNames` now starts
with `bun`/`workerd` so we resolve straight to source `.ts`.

```ts
const SOURCE_CONDITIONS = ["bun", "workerd", "import", "default"];
```

S3 Assets: `PutObject`/`HeadObject` retry on the SlowDown 503 distilled
surfaces as `UnknownAwsError { errorTag: "SlowDown" }`. ~64s exponential
budget. Without this a busy account fails the first Lambda upload of a
parallel-suite run with `AssetsUploadError`.

SQS Queue reconcile: post-create `getQueueAttributes` retries on
`QueueDoesNotExist` for ~30s. SQS is eventually consistent post-
`CreateQueue` and the same-tick attribute fetch can briefly miss.

DynamoDB Bindings test: bump readiness budget from 75 to 100 retries
(150s -> 200s) so IAM-policy propagation under parallel-suite load
doesn't trip the 500-from-/scan probe.

Hyperdrive test: skip both cases pending a real publicly-routable
postgres origin. Cloudflare validates not just DNS but TCP-level
connectivity at create time, so any placeholder host is rejected.
@alchemy-version-bot
Copy link
Copy Markdown
Contributor

alchemy-version-bot Bot commented May 7, 2026

Install the packages built from this commit:

alchemy

bun add alchemy@https://pkg.ing/alchemy/d36404c

@alchemy.run/better-auth

bun add @alchemy.run/better-auth@https://pkg.ing/@alchemy.run/better-auth/d36404c

@alchemy.run/pr-package

bun add @alchemy.run/pr-package@https://pkg.ing/@alchemy.run/pr-package/d36404c

Comment thread packages/alchemy/src/AWS/Assets.ts Outdated
Comment on lines +13 to +26
/**
* S3 throttles `PutObject` per-prefix at modest QPS (the well-known
* "503 SlowDown" / "Please reduce your request rate" response). Distilled
* surfaces these as `UnknownAwsError` with `errorTag: "SlowDown"` because
* they're not part of the official PutObject error model. They are always
* safe to retry with backoff — the request never reached S3's storage
* tier — and the same alchemy stack uploading lots of Lambda code in a
* burst will trip this on a busy account.
*/
const isS3SlowDown = (error: unknown): boolean => {
if (typeof error !== "object" || error === null) return false;
const e = error as { _tag?: string; errorTag?: string };
return e._tag === "UnknownAwsError" && e.errorTag === "SlowDown";
};
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like something we need to patch in distilled for s3

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patched in distilled — alchemy-run/distilled#270. Reverted the alchemy-side workaround in 90cf3d8; once distilled releases, the existing blanket awsRetryFactory will retry SlowDown via the new ThrottlingError + RetryableError categories.

@alchemy-version-bot
Copy link
Copy Markdown
Contributor

alchemy-version-bot Bot commented May 7, 2026

Website Preview Deployed

URL: https://alchemyeffectwebsite-worker-pr-254-pf4dtuqvhbdmn6vb.testing-2b2.workers.dev

Built from commit cf5f8dd.


This comment updates automatically with each push.

… ThrottlingError now

Now that distilled@>=0.18.0 classifies S3 `SlowDown` as
`ThrottlingError + RetryableError` (alchemy-run/distilled#270), the
blanket `awsRetryFactory` in Providers.ts already retries it with
exponential backoff + jitter and a 10-attempt cap. The local retry on
`headObject`/`putObject` was a temporary shim — drop it.
…F/AWS transient classification, DynamoDB readiness rework, KV title uniqueness, EventSourceMapping IAM retry

Surfaced and fixed across ~20 back-to-back stress runs of the live suite.

Cloudflare Worker reconcile: retry `createScriptSubdomain` on
`WorkerNotFound` (CF code 10013). The script-put endpoint and the
subdomain endpoint are independently consistent — a same-tick toggle
right after `putWorker` briefly observes the script as missing. ~30s
budget; any other tag propagates immediately.

AWS Providers retry: also retry on Effect `HttpClientError` whose
`reason._tag === "TransportError"` (or `EmptyBodyError`). These wrap
undici-level connection failures (`ConnectTimeoutError`, dropped
sockets) which distilled doesn't tag as `NetworkError` — so the
existing `isTransientError`/`isThrottlingError` predicate misses them.
The request never reached AWS at all; always safe to retry.

Cloudflare Providers retry: also retry `CloudflareHttpError` 401/403/5xx.
This is the catch-all distilled raises when CF returns a non-JSON body
(HTML 520 pages, edge auth blips that produce a bare `Unauthorized`).
Not API-level permission failures; transient by construction.

EventSourceMapping IAM retry: include "The security token included in
the request is invalid" in `retryPermissionsPropagation`. Same root
cause as the messages already listed (STS lag for the freshly-attached
trust policy on Lambda's execution role); same recovery (wait + retry).

DynamoDB Bindings test: split the readiness probe into a no-binding
`/ready` (Lambda is up) and a `/scan` warmup that retries until the
scan policy is observable from inside the live Lambda. The handler
now returns a structured 500 body via `Effect.catch` instead of
collapsing to Lambda's generic "Internal Server Error" through
`Effect.orDie`, so future failures are diagnosable.

SNS Bindings test: per-call retry on `TransientHttp` 5xx via a tagged
error. `/ready` doesn't exercise the SNS binding, so a successful
readiness probe doesn't imply `/publish` will work — IAM
`sns:Publish` propagation can lag the Lambda's auth cache. ~90s
per-call budget.

KV Namespace test "create, update, delete": randomize titles per run.
KV titles are globally unique per account; a hard-coded title made the
test fragile under repeated stress runs (any prior interrupted run
left an orphan, blocking the rename in the next run with
`NamespaceTitleAlreadyExists`).
…g; cover more failure modes

Switch from `error as { _tag?: string }` casts to either Effect's
`Predicate.isTagged` (for retry `while` predicates that don't need
field access) or `instanceof` against the actual exported error
classes (for predicates that need to read fields like `status` or
`code`). Cleaner reads, no escape-hatch casts.

  AWS Providers retry: use `isHttpClientError` from Effect and access
  `error.reason._tag` directly off the typed `HttpClientError`.

  Cloudflare Providers retry: import `Forbidden`, `TooManyRequests`,
  `Unauthorized` from `@distilled.cloud/core/errors` and
  `CloudflareHttpError` from `@distilled.cloud/cloudflare/Errors`;
  match with `instanceof`. Adds a bounded `Unauthorized` (CF code
  10000) retry — the same schedule's `recurs(8)` cap keeps a real
  invalid token surfacing within ~22s, rather than letting CF auth-
  edge blips fail tests outright.

  Cloudflare Worker `setWorkerSubdomain`: use
  `Predicate.isTagged("WorkerNotFound")`.

  Cloudflare QueueConsumer `isQueueHandlerMissing`: also match
  `BadRequest` (distilled 0.18 routes 4xx without a code-tagged
  match through the HTTP_STATUS_MAP, so code 11001 sometimes
  arrives as a `BadRequest` instead of `UnknownCloudflareError`).

  AWS SQS Queue: factor out `tolerateMissingQueue` and apply to
  every post-`CreateQueue` SQS call — `getQueueAttributes`,
  `setQueueAttributes`, `listQueueTags`, `tagQueue`. SQS's
  post-create eventual-consistency window can flake any of them,
  not just the first. Uses `Predicate.isTagged("QueueDoesNotExist")`.
…rrors

Per review: predicates that take `error: unknown` and downcast are a
red flag. Two fixes:

  - QueueConsumer: inline the retry predicate where the error type is
    known from the upstream effect (createConsumer / updateConsumer).
    `e._tag === "UnknownCloudflareError" && e.code === 11001` and
    `e._tag === "BadRequest" && /…/.test(e.message)` typecheck without
    casts because TypeScript narrows on `_tag` against the inferred
    error union.

  - Cloudflare retry factory: the `while` callback is genuinely called
    with `unknown` (distilled wraps every API call), but we don't need
    `instanceof`. Use `Schema.is(<TaggedErrorClass>)` to derive a typed
    refinement guard:

      const isForbidden = Schema.is(Forbidden);
      const isUnauthorized = Schema.is(Unauthorized);
      const isCloudflareHttpError = Schema.is(CloudflareHttpError);
      const isTooManyRequests = Schema.is(TooManyRequests);

    The guards refine `unknown` to the typed instance, so subsequent
    field reads (`error.message`, `error.status`) typecheck cleanly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant