test(harden): stabilize live suite — bundle resolve, S3 SlowDown retry, SQS post-create race by sam-goodwin · Pull Request #254 · alchemy-run/alchemy-effect

sam-goodwin · 2026-05-07T07:00:45Z

Stabilizes the live test suite — 3 consecutive clean runs (673 passed / 42 skipped) after these fixes. All but one of these failure modes show up only under parallel-suite load.

Bundle: resolve `alchemy/Stack` (and friends) without requiring `lib/`

Rolldown was picking the import condition in alchemy's package.json#exports, which points at ./lib/*.js. On a fresh checkout lib/ doesn't exist; rolldown emits [UNRESOLVED_IMPORT], treats the path as external, and the Lambda crashes on init. Surfaces as opaque "Function URL returned 502" timeouts across SQS/SNS/DynamoDB/Kinesis Bindings tests and the Lambda Function test.

const SOURCE_CONDITIONS = ["bun", "workerd", "import", "default"];

Resolves straight to source .ts. Rolldown handles TS natively and every package exposing a bun source condition also publishes src/.

S3 SlowDown — patched in distilled

S3 503 SlowDown was previously surfaced as UnknownAwsError, which the blanket awsRetryFactory in Providers.ts couldn't recognize. Patched in distilled to classify it as ThrottlingError + RetryableError: alchemy-run/distilled#270. Once that lands, the existing 10-attempt jittered-exponential retry handles it transparently — no alchemy-side change needed.

SQS Queue reconcile: tolerate post-create eventual consistency

sqs.getQueueAttributes({ QueueUrl: queueUrl, AttributeNames: ["All"] }).pipe(
  Effect.retry({
    while: (e) => e._tag === "QueueDoesNotExist",
    schedule: Schedule.fixed(500).pipe(Schedule.both(Schedule.recurs(60))),
  }),
)

Same-tick GetQueueAttributes after CreateQueue can return QueueDoesNotExist for a few hundred ms.

DynamoDB Bindings test: bump readiness budget

75 -> 100 retries (150s -> 200s). The Bindings handler's /scan probe was returning 500 the entire 150s window under parallel-suite load while IAM scan policy propagated. Still inside the beforeAll 240s timeout.

Hyperdrive test: skip pending real Postgres origin

Cloudflare validates not just DNS but TCP-level connectivity at create time, so any placeholder host (db.example.com, example.com:5432, etc.) is rejected. Skipped both cases with a TODO(hyperdrive) to provision a Neon project as the fixture.

…create race, DynamoDB Bindings readiness, Hyperdrive skip Bundle: when rolldown resolves `alchemy/Stack` (and other workspace subpath imports) it picks the package.json `import` condition which points at `lib/*.js`. On a fresh checkout `lib/` doesn't exist; rolldown emits `[UNRESOLVED_IMPORT]`, treats the path as external, and the deployed Lambda crashes on init — which surfaced as opaque "Function URL returned 502" timeouts across SQS/SNS/DynamoDB/Kinesis Bindings and the Lambda Function test. Default `resolve.conditionNames` now starts with `bun`/`workerd` so we resolve straight to source `.ts`. ```ts const SOURCE_CONDITIONS = ["bun", "workerd", "import", "default"]; ``` S3 Assets: `PutObject`/`HeadObject` retry on the SlowDown 503 distilled surfaces as `UnknownAwsError { errorTag: "SlowDown" }`. ~64s exponential budget. Without this a busy account fails the first Lambda upload of a parallel-suite run with `AssetsUploadError`. SQS Queue reconcile: post-create `getQueueAttributes` retries on `QueueDoesNotExist` for ~30s. SQS is eventually consistent post- `CreateQueue` and the same-tick attribute fetch can briefly miss. DynamoDB Bindings test: bump readiness budget from 75 to 100 retries (150s -> 200s) so IAM-policy propagation under parallel-suite load doesn't trip the 500-from-/scan probe. Hyperdrive test: skip both cases pending a real publicly-routable postgres origin. Cloudflare validates not just DNS but TCP-level connectivity at create time, so any placeholder host is rejected.

alchemy-version-bot · 2026-05-07T07:02:11Z

Install the packages built from this commit:

alchemy

bun add alchemy@https://pkg.ing/alchemy/d36404c

@alchemy.run/better-auth

bun add @alchemy.run/better-auth@https://pkg.ing/@alchemy.run/better-auth/d36404c

@alchemy.run/pr-package

bun add @alchemy.run/pr-package@https://pkg.ing/@alchemy.run/pr-package/d36404c

sam-goodwin · 2026-05-07T07:04:02Z

+/**
+ * S3 throttles `PutObject` per-prefix at modest QPS (the well-known
+ * "503 SlowDown" / "Please reduce your request rate" response). Distilled
+ * surfaces these as `UnknownAwsError` with `errorTag: "SlowDown"` because
+ * they're not part of the official PutObject error model. They are always
+ * safe to retry with backoff — the request never reached S3's storage
+ * tier — and the same alchemy stack uploading lots of Lambda code in a
+ * burst will trip this on a busy account.
+ */
+const isS3SlowDown = (error: unknown): boolean => {
+  if (typeof error !== "object" || error === null) return false;
+  const e = error as { _tag?: string; errorTag?: string };
+  return e._tag === "UnknownAwsError" && e.errorTag === "SlowDown";
+};


this looks like something we need to patch in distilled for s3

Patched in distilled — alchemy-run/distilled#270. Reverted the alchemy-side workaround in 90cf3d8; once distilled releases, the existing blanket awsRetryFactory will retry SlowDown via the new ThrottlingError + RetryableError categories.

alchemy-version-bot · 2026-05-07T07:04:42Z

Website Preview Deployed

URL: https://alchemyeffectwebsite-worker-pr-254-pf4dtuqvhbdmn6vb.testing-2b2.workers.dev

Built from commit cf5f8dd.

This comment updates automatically with each push.

… ThrottlingError now Now that distilled@>=0.18.0 classifies S3 `SlowDown` as `ThrottlingError + RetryableError` (alchemy-run/distilled#270), the blanket `awsRetryFactory` in Providers.ts already retries it with exponential backoff + jitter and a 10-attempt cap. The local retry on `headObject`/`putObject` was a temporary shim — drop it.

…F/AWS transient classification, DynamoDB readiness rework, KV title uniqueness, EventSourceMapping IAM retry Surfaced and fixed across ~20 back-to-back stress runs of the live suite. Cloudflare Worker reconcile: retry `createScriptSubdomain` on `WorkerNotFound` (CF code 10013). The script-put endpoint and the subdomain endpoint are independently consistent — a same-tick toggle right after `putWorker` briefly observes the script as missing. ~30s budget; any other tag propagates immediately. AWS Providers retry: also retry on Effect `HttpClientError` whose `reason._tag === "TransportError"` (or `EmptyBodyError`). These wrap undici-level connection failures (`ConnectTimeoutError`, dropped sockets) which distilled doesn't tag as `NetworkError` — so the existing `isTransientError`/`isThrottlingError` predicate misses them. The request never reached AWS at all; always safe to retry. Cloudflare Providers retry: also retry `CloudflareHttpError` 401/403/5xx. This is the catch-all distilled raises when CF returns a non-JSON body (HTML 520 pages, edge auth blips that produce a bare `Unauthorized`). Not API-level permission failures; transient by construction. EventSourceMapping IAM retry: include "The security token included in the request is invalid" in `retryPermissionsPropagation`. Same root cause as the messages already listed (STS lag for the freshly-attached trust policy on Lambda's execution role); same recovery (wait + retry). DynamoDB Bindings test: split the readiness probe into a no-binding `/ready` (Lambda is up) and a `/scan` warmup that retries until the scan policy is observable from inside the live Lambda. The handler now returns a structured 500 body via `Effect.catch` instead of collapsing to Lambda's generic "Internal Server Error" through `Effect.orDie`, so future failures are diagnosable. SNS Bindings test: per-call retry on `TransientHttp` 5xx via a tagged error. `/ready` doesn't exercise the SNS binding, so a successful readiness probe doesn't imply `/publish` will work — IAM `sns:Publish` propagation can lag the Lambda's auth cache. ~90s per-call budget. KV Namespace test "create, update, delete": randomize titles per run. KV titles are globally unique per account; a hard-coded title made the test fragile under repeated stress runs (any prior interrupted run left an orphan, blocking the rename in the next run with `NamespaceTitleAlreadyExists`).

…g; cover more failure modes Switch from `error as { _tag?: string }` casts to either Effect's `Predicate.isTagged` (for retry `while` predicates that don't need field access) or `instanceof` against the actual exported error classes (for predicates that need to read fields like `status` or `code`). Cleaner reads, no escape-hatch casts. AWS Providers retry: use `isHttpClientError` from Effect and access `error.reason._tag` directly off the typed `HttpClientError`. Cloudflare Providers retry: import `Forbidden`, `TooManyRequests`, `Unauthorized` from `@distilled.cloud/core/errors` and `CloudflareHttpError` from `@distilled.cloud/cloudflare/Errors`; match with `instanceof`. Adds a bounded `Unauthorized` (CF code 10000) retry — the same schedule's `recurs(8)` cap keeps a real invalid token surfacing within ~22s, rather than letting CF auth- edge blips fail tests outright. Cloudflare Worker `setWorkerSubdomain`: use `Predicate.isTagged("WorkerNotFound")`. Cloudflare QueueConsumer `isQueueHandlerMissing`: also match `BadRequest` (distilled 0.18 routes 4xx without a code-tagged match through the HTTP_STATUS_MAP, so code 11001 sometimes arrives as a `BadRequest` instead of `UnknownCloudflareError`). AWS SQS Queue: factor out `tolerateMissingQueue` and apply to every post-`CreateQueue` SQS call — `getQueueAttributes`, `setQueueAttributes`, `listQueueTags`, `tagQueue`. SQS's post-create eventual-consistency window can flake any of them, not just the first. Uses `Predicate.isTagged("QueueDoesNotExist")`.

…rrors Per review: predicates that take `error: unknown` and downcast are a red flag. Two fixes: - QueueConsumer: inline the retry predicate where the error type is known from the upstream effect (createConsumer / updateConsumer). `e._tag === "UnknownCloudflareError" && e.code === 11001` and `e._tag === "BadRequest" && /…/.test(e.message)` typecheck without casts because TypeScript narrows on `_tag` against the inferred error union. - Cloudflare retry factory: the `while` callback is genuinely called with `unknown` (distilled wraps every API call), but we don't need `instanceof`. Use `Schema.is(<TaggedErrorClass>)` to derive a typed refinement guard: const isForbidden = Schema.is(Forbidden); const isUnauthorized = Schema.is(Unauthorized); const isCloudflareHttpError = Schema.is(CloudflareHttpError); const isTooManyRequests = Schema.is(TooManyRequests); The guards refine `unknown` to the typed instance, so subsequent field reads (`error.message`, `error.status`) typecheck cleanly.

…ransients; bump retry budget to ~60s

…#273)

sam-goodwin commented May 7, 2026

View reviewed changes

sam-goodwin added 9 commits May 7, 2026 00:16

upgrade distilled

19109f1

format

b6ca08f

Merge branch 'main' into claude/gracious-khorana-ad8809

b0ecc19

fix(Cloudflare): also retry UnknownCloudflareError 'internal error' t…

a2a3099

…ransients; bump retry budget to ~60s

test(harden): TODO note re QueueHandlerMissing typed error (distilled…

cf5f8dd

…#273)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(harden): stabilize live suite — bundle resolve, S3 SlowDown retry, SQS post-create race#254

test(harden): stabilize live suite — bundle resolve, S3 SlowDown retry, SQS post-create race#254
sam-goodwin wants to merge 10 commits into
mainfrom
claude/gracious-khorana-ad8809

sam-goodwin commented May 7, 2026 •

edited

Loading

Uh oh!

alchemy-version-bot Bot commented May 7, 2026 •

edited

Loading

Uh oh!

sam-goodwin May 7, 2026

Uh oh!

sam-goodwin May 7, 2026

Uh oh!

alchemy-version-bot Bot commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sam-goodwin commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bundle: resolve alchemy/Stack (and friends) without requiring lib/

S3 SlowDown — patched in distilled

SQS Queue reconcile: tolerate post-create eventual consistency

DynamoDB Bindings test: bump readiness budget

Hyperdrive test: skip pending real Postgres origin

Uh oh!

alchemy-version-bot Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sam-goodwin May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sam-goodwin May 7, 2026

Choose a reason for hiding this comment

Uh oh!

alchemy-version-bot Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Website Preview Deployed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sam-goodwin commented May 7, 2026 •

edited

Loading

Bundle: resolve `alchemy/Stack` (and friends) without requiring `lib/`

alchemy-version-bot Bot commented May 7, 2026 •

edited

Loading

alchemy-version-bot Bot commented May 7, 2026 •

edited

Loading