@effect/cluster SQL storage can stall after runner transition while runners stay healthy

## Bug Report

### Versions

- `effect`: `3.21.2`
- `@effect/cluster`: `0.58.2`
- `@effect/workflow`: `0.18.1`
- `@effect/platform-node`: `0.106.0`
- Runtime: Node.js
- Storage: PostgreSQL
- Cluster layer: `NodeClusterSocket.layer({ storage: "sql" })`

### Description

We observed a production incident where SQL-backed workflow consumption stopped after a worker runner transition.

The worker processes stayed alive, runners continued to appear healthy, but `cluster_messages` stopped being consumed. Existing workflow messages remained unprocessed and `last_read` stopped advancing.

The relevant log emitted by `@effect/cluster` was:

```text
Could not find entity manager for address, retrying
```

The affected entity type was a workflow entity. Entity names and IDs are omitted because they come from a private application.

### Observed Behavior

After a rolling deployment / scale-out event:

- New runners registered successfully.
- Runner health stayed healthy.
- The runner socket started listening.
- `cluster_messages` contained unprocessed due messages.
- `cluster_messages.last_read` stopped advancing.
- Worker-level health checks stayed green because the process was alive.
- No workflow handlers were invoked after the stall.

This created a state where the cluster looked healthy externally, but SQL message consumption was effectively dead.

### Expected Behavior

A runner should not remain healthy while its SQL storage read loop is no longer making progress.

Also, if the storage read loop sees a message for an entity type whose entity manager is not registered yet, it should not permanently reserve or stall that message.

Expected alternatives would be acceptable:

- do not read/reserve messages until the relevant entity type is registered;
- leave `last_read` untouched for messages whose entity manager does not exist yet;
- fail the runner if it cannot recover from this state;
- expose a cluster-level health signal when SQL message consumption is stalled.

### Why This Looks Like a Cluster Lifecycle Issue

The cluster runner can start listening and begin SQL storage polling before all workflow entities are registered.

With existing backlog during a runner transition, the storage read loop can encounter messages for workflow entity types before their entity managers are available.

The log comes from `Sharding.ts`:

```text
Could not find entity manager for address, retrying
```

After that, the runner may still remain healthy while message consumption no longer progresses.

### Related Issue

This looks related to:

- https://github.com/Effect-TS/effect/issues/6155

That issue also describes SQL-backed cluster runners that are registered/healthy but non-functional after runner transitions.

### Suggested Fix Direction

The SQL storage read loop should avoid making durable progress on messages that cannot be dispatched to an entity manager.

Potential fix directions:

- delay SQL polling until entity registration has completed;
- do not update `last_read` for messages whose `entityType` is not registered;
- synchronously release/reset messages that were read before an entity manager existed;
- reopen the storage read latch when `registerEntity` completes;
- expose cluster storage-drain health so applications can fail fast.

### Suggested Regression Test

A regression test could:

1. Seed SQL storage with an unprocessed message for a clustered entity.
2. Start a runner with SQL storage.
3. Delay registration of the entity manager.
4. Assert that the message is not left stuck with stale `last_read`.
5. Register the entity.
6. Assert that the message is eventually consumed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

@effect/cluster SQL storage can stall after runner transition while runners stay healthy #6248

Bug Report

Versions

Description

Observed Behavior

Expected Behavior

Why This Looks Like a Cluster Lifecycle Issue

Related Issue

Suggested Fix Direction

Suggested Regression Test

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

@effect/cluster SQL storage can stall after runner transition while runners stay healthy #6248

Description

Bug Report

Versions

Description

Observed Behavior

Expected Behavior

Why This Looks Like a Cluster Lifecycle Issue

Related Issue

Suggested Fix Direction

Suggested Regression Test

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions