test(harden): stabilize live suite — bundle resolve, S3 SlowDown retry, SQS post-create race#254
Open
sam-goodwin wants to merge 10 commits into
Open
test(harden): stabilize live suite — bundle resolve, S3 SlowDown retry, SQS post-create race#254sam-goodwin wants to merge 10 commits into
sam-goodwin wants to merge 10 commits into
Conversation
…create race, DynamoDB Bindings readiness, Hyperdrive skip
Bundle: when rolldown resolves `alchemy/Stack` (and other workspace
subpath imports) it picks the package.json `import` condition which
points at `lib/*.js`. On a fresh checkout `lib/` doesn't exist; rolldown
emits `[UNRESOLVED_IMPORT]`, treats the path as external, and the
deployed Lambda crashes on init — which surfaced as opaque "Function
URL returned 502" timeouts across SQS/SNS/DynamoDB/Kinesis Bindings and
the Lambda Function test. Default `resolve.conditionNames` now starts
with `bun`/`workerd` so we resolve straight to source `.ts`.
```ts
const SOURCE_CONDITIONS = ["bun", "workerd", "import", "default"];
```
S3 Assets: `PutObject`/`HeadObject` retry on the SlowDown 503 distilled
surfaces as `UnknownAwsError { errorTag: "SlowDown" }`. ~64s exponential
budget. Without this a busy account fails the first Lambda upload of a
parallel-suite run with `AssetsUploadError`.
SQS Queue reconcile: post-create `getQueueAttributes` retries on
`QueueDoesNotExist` for ~30s. SQS is eventually consistent post-
`CreateQueue` and the same-tick attribute fetch can briefly miss.
DynamoDB Bindings test: bump readiness budget from 75 to 100 retries
(150s -> 200s) so IAM-policy propagation under parallel-suite load
doesn't trip the 500-from-/scan probe.
Hyperdrive test: skip both cases pending a real publicly-routable
postgres origin. Cloudflare validates not just DNS but TCP-level
connectivity at create time, so any placeholder host is rejected.
Contributor
|
Install the packages built from this commit: alchemy bun add alchemy@https://pkg.ing/alchemy/d36404c@alchemy.run/better-auth bun add @alchemy.run/better-auth@https://pkg.ing/@alchemy.run/better-auth/d36404c@alchemy.run/pr-package bun add @alchemy.run/pr-package@https://pkg.ing/@alchemy.run/pr-package/d36404c |
sam-goodwin
commented
May 7, 2026
Comment on lines
+13
to
+26
| /** | ||
| * S3 throttles `PutObject` per-prefix at modest QPS (the well-known | ||
| * "503 SlowDown" / "Please reduce your request rate" response). Distilled | ||
| * surfaces these as `UnknownAwsError` with `errorTag: "SlowDown"` because | ||
| * they're not part of the official PutObject error model. They are always | ||
| * safe to retry with backoff — the request never reached S3's storage | ||
| * tier — and the same alchemy stack uploading lots of Lambda code in a | ||
| * burst will trip this on a busy account. | ||
| */ | ||
| const isS3SlowDown = (error: unknown): boolean => { | ||
| if (typeof error !== "object" || error === null) return false; | ||
| const e = error as { _tag?: string; errorTag?: string }; | ||
| return e._tag === "UnknownAwsError" && e.errorTag === "SlowDown"; | ||
| }; |
Contributor
Author
There was a problem hiding this comment.
this looks like something we need to patch in distilled for s3
Contributor
Author
There was a problem hiding this comment.
Patched in distilled — alchemy-run/distilled#270. Reverted the alchemy-side workaround in 90cf3d8; once distilled releases, the existing blanket awsRetryFactory will retry SlowDown via the new ThrottlingError + RetryableError categories.
Contributor
Website Preview DeployedURL: https://alchemyeffectwebsite-worker-pr-254-pf4dtuqvhbdmn6vb.testing-2b2.workers.dev Built from commit This comment updates automatically with each push. |
… ThrottlingError now Now that distilled@>=0.18.0 classifies S3 `SlowDown` as `ThrottlingError + RetryableError` (alchemy-run/distilled#270), the blanket `awsRetryFactory` in Providers.ts already retries it with exponential backoff + jitter and a 10-attempt cap. The local retry on `headObject`/`putObject` was a temporary shim — drop it.
…F/AWS transient classification, DynamoDB readiness rework, KV title uniqueness, EventSourceMapping IAM retry Surfaced and fixed across ~20 back-to-back stress runs of the live suite. Cloudflare Worker reconcile: retry `createScriptSubdomain` on `WorkerNotFound` (CF code 10013). The script-put endpoint and the subdomain endpoint are independently consistent — a same-tick toggle right after `putWorker` briefly observes the script as missing. ~30s budget; any other tag propagates immediately. AWS Providers retry: also retry on Effect `HttpClientError` whose `reason._tag === "TransportError"` (or `EmptyBodyError`). These wrap undici-level connection failures (`ConnectTimeoutError`, dropped sockets) which distilled doesn't tag as `NetworkError` — so the existing `isTransientError`/`isThrottlingError` predicate misses them. The request never reached AWS at all; always safe to retry. Cloudflare Providers retry: also retry `CloudflareHttpError` 401/403/5xx. This is the catch-all distilled raises when CF returns a non-JSON body (HTML 520 pages, edge auth blips that produce a bare `Unauthorized`). Not API-level permission failures; transient by construction. EventSourceMapping IAM retry: include "The security token included in the request is invalid" in `retryPermissionsPropagation`. Same root cause as the messages already listed (STS lag for the freshly-attached trust policy on Lambda's execution role); same recovery (wait + retry). DynamoDB Bindings test: split the readiness probe into a no-binding `/ready` (Lambda is up) and a `/scan` warmup that retries until the scan policy is observable from inside the live Lambda. The handler now returns a structured 500 body via `Effect.catch` instead of collapsing to Lambda's generic "Internal Server Error" through `Effect.orDie`, so future failures are diagnosable. SNS Bindings test: per-call retry on `TransientHttp` 5xx via a tagged error. `/ready` doesn't exercise the SNS binding, so a successful readiness probe doesn't imply `/publish` will work — IAM `sns:Publish` propagation can lag the Lambda's auth cache. ~90s per-call budget. KV Namespace test "create, update, delete": randomize titles per run. KV titles are globally unique per account; a hard-coded title made the test fragile under repeated stress runs (any prior interrupted run left an orphan, blocking the rename in the next run with `NamespaceTitleAlreadyExists`).
…g; cover more failure modes
Switch from `error as { _tag?: string }` casts to either Effect's
`Predicate.isTagged` (for retry `while` predicates that don't need
field access) or `instanceof` against the actual exported error
classes (for predicates that need to read fields like `status` or
`code`). Cleaner reads, no escape-hatch casts.
AWS Providers retry: use `isHttpClientError` from Effect and access
`error.reason._tag` directly off the typed `HttpClientError`.
Cloudflare Providers retry: import `Forbidden`, `TooManyRequests`,
`Unauthorized` from `@distilled.cloud/core/errors` and
`CloudflareHttpError` from `@distilled.cloud/cloudflare/Errors`;
match with `instanceof`. Adds a bounded `Unauthorized` (CF code
10000) retry — the same schedule's `recurs(8)` cap keeps a real
invalid token surfacing within ~22s, rather than letting CF auth-
edge blips fail tests outright.
Cloudflare Worker `setWorkerSubdomain`: use
`Predicate.isTagged("WorkerNotFound")`.
Cloudflare QueueConsumer `isQueueHandlerMissing`: also match
`BadRequest` (distilled 0.18 routes 4xx without a code-tagged
match through the HTTP_STATUS_MAP, so code 11001 sometimes
arrives as a `BadRequest` instead of `UnknownCloudflareError`).
AWS SQS Queue: factor out `tolerateMissingQueue` and apply to
every post-`CreateQueue` SQS call — `getQueueAttributes`,
`setQueueAttributes`, `listQueueTags`, `tagQueue`. SQS's
post-create eventual-consistency window can flake any of them,
not just the first. Uses `Predicate.isTagged("QueueDoesNotExist")`.
…rrors
Per review: predicates that take `error: unknown` and downcast are a
red flag. Two fixes:
- QueueConsumer: inline the retry predicate where the error type is
known from the upstream effect (createConsumer / updateConsumer).
`e._tag === "UnknownCloudflareError" && e.code === 11001` and
`e._tag === "BadRequest" && /…/.test(e.message)` typecheck without
casts because TypeScript narrows on `_tag` against the inferred
error union.
- Cloudflare retry factory: the `while` callback is genuinely called
with `unknown` (distilled wraps every API call), but we don't need
`instanceof`. Use `Schema.is(<TaggedErrorClass>)` to derive a typed
refinement guard:
const isForbidden = Schema.is(Forbidden);
const isUnauthorized = Schema.is(Unauthorized);
const isCloudflareHttpError = Schema.is(CloudflareHttpError);
const isTooManyRequests = Schema.is(TooManyRequests);
The guards refine `unknown` to the typed instance, so subsequent
field reads (`error.message`, `error.status`) typecheck cleanly.
…ransients; bump retry budget to ~60s
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stabilizes the live test suite — 3 consecutive clean runs (673 passed / 42 skipped) after these fixes. All but one of these failure modes show up only under parallel-suite load.
Bundle: resolve
alchemy/Stack(and friends) without requiringlib/Rolldown was picking the
importcondition inalchemy'spackage.json#exports, which points at./lib/*.js. On a fresh checkoutlib/doesn't exist; rolldown emits[UNRESOLVED_IMPORT], treats the path as external, and the Lambda crashes on init. Surfaces as opaque "Function URL returned 502" timeouts across SQS/SNS/DynamoDB/Kinesis Bindings tests and the Lambda Function test.Resolves straight to source
.ts. Rolldown handles TS natively and every package exposing abunsource condition also publishessrc/.S3 SlowDown — patched in distilled
S3 503
SlowDownwas previously surfaced asUnknownAwsError, which the blanketawsRetryFactoryinProviders.tscouldn't recognize. Patched in distilled to classify it asThrottlingError + RetryableError: alchemy-run/distilled#270. Once that lands, the existing 10-attempt jittered-exponential retry handles it transparently — no alchemy-side change needed.SQS Queue reconcile: tolerate post-create eventual consistency
Same-tick
GetQueueAttributesafterCreateQueuecan returnQueueDoesNotExistfor a few hundred ms.DynamoDB Bindings test: bump readiness budget
75 -> 100 retries (150s -> 200s). The Bindings handler's
/scanprobe was returning 500 the entire 150s window under parallel-suite load while IAM scan policy propagated. Still inside thebeforeAll240s timeout.Hyperdrive test: skip pending real Postgres origin
Cloudflare validates not just DNS but TCP-level connectivity at create time, so any placeholder host (
db.example.com,example.com:5432, etc.) is rejected. Skipped both cases with aTODO(hyperdrive)to provision a Neon project as the fixture.