Skip to content

fix(replication): stop connecting to removed nodes after remove_node#220

Merged
kriszyp merged 1 commit into
mainfrom
kris/jovial-mccarthy-062ae3
May 23, 2026
Merged

fix(replication): stop connecting to removed nodes after remove_node#220
kriszyp merged 1 commit into
mainfrom
kris/jovial-mccarthy-062ae3

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented May 22, 2026

Fixes #189.

Changes

Three related bugs allowed replication to keep retrying connections to a removed node's hostname indefinitely.

connect() missing early-return guard (replicationConnection.ts): connect() did not check intentionallyUnsubscribed at entry. If a retry timer was already scheduled when unsubscribe() ran, the timer would fire, open a new socket, and the node kept connecting. Now connect() returns immediately when intentionallyUnsubscribed is true.

once('finished') race in getSubscriptionConnection (replicator.ts): The cleanup handler unconditionally deleted dbConnections[dbName]. When a node is removed and re-added with the same URL, the old connection's deferred 'finished' event deleted the new connection's entry — making future unsubscribeFromNode() calls unable to find and stop it. Fixed with an identity check (dbConnections.get(dbName) === connection) before deleting.

nodeMap not cleaned up on delete (subscriptionManager.ts): onNodeUpdate's delete path never called nodeMap.delete(hostname), so removed nodes accumulated in nodeMap indefinitely.

Also added null-safety to unsubscribe() (?.close instead of .close) in case it's called before the first connect().

Test

New test in replicationReconnect.test.mjs: kill peer → remove_node → restart peer → verify node1 does not reconnect. Without the connect() guard, node1 would reconnect once node0 is reachable again; with the fix the retry loop is permanently halted.

🤖 Generated by Claude Sonnet 4.6 on behalf of Kris.

Three related bugs caused replication to keep trying to connect to a
removed node's hostname indefinitely after remove_node:

1. NodeReplicationConnection.connect() did not check
   intentionallyUnsubscribed at entry. If a retry timer fired after
   unsubscribe() was called, connect() would open a new socket to the
   removed node, reset the close handler's intentional flag, and loop
   again. Now connect() returns immediately when
   intentionallyUnsubscribed is true.

2. The once('finished') cleanup in getSubscriptionConnection deleted
   dbConnections[dbName] unconditionally. When a node is removed and
   re-added (same URL), the old connection's deferred 'finished' event
   would delete the new connection's entry, making future
   unsubscribeFromNode() calls unable to find and stop it. Fixed with
   an identity check before deleting.

3. onNodeUpdate's delete path never called nodeMap.delete(hostname), so
   removed nodes accumulated in nodeMap indefinitely.

Also added socket null-safety in unsubscribe() (?.close instead of .close)
for the edge case where unsubscribe() is called before connect() runs.

Adds integration test: kill peer → remove_node → restart peer → verify
node1 does NOT reconnect, directly exercising the retry-loop halt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kriszyp kriszyp requested a review from heskew May 22, 2026 15:42
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

Reviewed; no blockers found.

@kriszyp kriszyp marked this pull request as ready for review May 22, 2026 15:50
@kriszyp kriszyp requested a review from a team as a code owner May 22, 2026 15:50
@kriszyp kriszyp requested a review from cb1kenobi May 22, 2026 20:01
@kriszyp kriszyp merged commit 5eb7b7f into main May 23, 2026
23 checks passed
@kriszyp kriszyp deleted the kris/jovial-mccarthy-062ae3 branch May 23, 2026 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cluster continues connecting to removed nodes (stale connection state after remove_node)

2 participants