Skip to content

@effect/cluster SQL storage can stall after runner transition while runners stay healthy #6248

@AMoreaux

Description

@AMoreaux

Bug Report

Versions

  • effect: 3.21.2
  • @effect/cluster: 0.58.2
  • @effect/workflow: 0.18.1
  • @effect/platform-node: 0.106.0
  • Runtime: Node.js
  • Storage: PostgreSQL
  • Cluster layer: NodeClusterSocket.layer({ storage: "sql" })

Description

We observed a production incident where SQL-backed workflow consumption stopped after a worker runner transition.

The worker processes stayed alive, runners continued to appear healthy, but cluster_messages stopped being consumed. Existing workflow messages remained unprocessed and last_read stopped advancing.

The relevant log emitted by @effect/cluster was:

Could not find entity manager for address, retrying

The affected entity type was a workflow entity. Entity names and IDs are omitted because they come from a private application.

Observed Behavior

After a rolling deployment / scale-out event:

  • New runners registered successfully.
  • Runner health stayed healthy.
  • The runner socket started listening.
  • cluster_messages contained unprocessed due messages.
  • cluster_messages.last_read stopped advancing.
  • Worker-level health checks stayed green because the process was alive.
  • No workflow handlers were invoked after the stall.

This created a state where the cluster looked healthy externally, but SQL message consumption was effectively dead.

Expected Behavior

A runner should not remain healthy while its SQL storage read loop is no longer making progress.

Also, if the storage read loop sees a message for an entity type whose entity manager is not registered yet, it should not permanently reserve or stall that message.

Expected alternatives would be acceptable:

  • do not read/reserve messages until the relevant entity type is registered;
  • leave last_read untouched for messages whose entity manager does not exist yet;
  • fail the runner if it cannot recover from this state;
  • expose a cluster-level health signal when SQL message consumption is stalled.

Why This Looks Like a Cluster Lifecycle Issue

The cluster runner can start listening and begin SQL storage polling before all workflow entities are registered.

With existing backlog during a runner transition, the storage read loop can encounter messages for workflow entity types before their entity managers are available.

The log comes from Sharding.ts:

Could not find entity manager for address, retrying

After that, the runner may still remain healthy while message consumption no longer progresses.

Related Issue

This looks related to:

That issue also describes SQL-backed cluster runners that are registered/healthy but non-functional after runner transitions.

Suggested Fix Direction

The SQL storage read loop should avoid making durable progress on messages that cannot be dispatched to an entity manager.

Potential fix directions:

  • delay SQL polling until entity registration has completed;
  • do not update last_read for messages whose entityType is not registered;
  • synchronously release/reset messages that were read before an entity manager existed;
  • reopen the storage read latch when registerEntity completes;
  • expose cluster storage-drain health so applications can fail fast.

Suggested Regression Test

A regression test could:

  1. Seed SQL storage with an unprocessed message for a clustered entity.
  2. Start a runner with SQL storage.
  3. Delay registration of the entity manager.
  4. Assert that the message is not left stuck with stale last_read.
  5. Register the entity.
  6. Assert that the message is eventually consumed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions