bitswap/bsnet: don't mark peer unresponsive on a single send failure#1164
Open
D4ryl00 wants to merge 4 commits into
Open
bitswap/bsnet: don't mark peer unresponsive on a single send failure#1164D4ryl00 wants to merge 4 commits into
D4ryl00 wants to merge 4 commits into
Conversation
…failure streamMessageSender.send() marked a peer unresponsive on every failed attempt to open a stream or send a message. send() is called from multiAttempt(), which already resets the stream, retries, and marks the peer unresponsive only after all retries are exhausted, so the per-attempt marking is both redundant and premature. It is harmful after a connection bounce (a disconnect quickly followed by a reconnect): the inbound disconnect notification is suppressed because the peer is already connected again, so the ConnectEventManager stays in the responsive state and the reconnect Connected event is a no-op. A single send failure on the stale stream then transitions the peer to unresponsive and emits PeerDisconnected, sidelining a peer whose connection is in fact live. There is no recovery path: returning to responsive requires an incoming message (OnMessage), which never arrives because we have stopped sending our wantlist. Fetches from that peer hang until it fully disconnects. Rely on multiAttempt() to mark the peer unresponsive after retries fail. Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
… peer unresponsive Regression test for the reconnect hang: a recoverable send error (failure on the first attempt, success on retry) must not emit a PeerDisconnected, i.e. must not mark the still-connected peer unresponsive. Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
D4ryl00
added a commit
to D4ryl00/go-orbit-db
that referenced
this pull request
Jun 2, 2026
boxo v0.34+ marks a peer unresponsive on a single bitswap send failure, which sidelines a peer that reconnected (its disconnect being suppressed) with no recovery path, hanging replication after a peer drops and reconnects. This caused TestReplicateAutomatically to hang intermittently under load. Point boxo at a fork branch carrying the one-line fix (based on v0.39.0, the version kubo v0.41 pins). Drop this replace once the upstream fix is released. Upstream issue: ipfs/boxo#1163 Upstream PR: ipfs/boxo#1164 Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
gammazero
reviewed
Jun 2, 2026
gammazero
reviewed
Jun 2, 2026
…d unresponsive Signed-off-by: D4ryl00 <d4ryl00@gmail.com>
gammazero
approved these changes
Jun 3, 2026
Contributor
|
@lidel Final check before merge? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1163
Problem
Since v0.34.0, a Bitswap block fetch can hang forever after a peer disconnects and quickly reconnects.
streamMessageSender.send()marks the peer unresponsive on every failed attempt to open a stream or send a message. Butsend()is called frommultiAttempt(), which already resets the stream, retries, and marks the peer unresponsive only after all retries are exhausted — so the per-attempt marking is redundant and premature.After a connection bounce, the inbound disconnect is suppressed (the peer is already reconnected, so the bsnet notifiee's
Disconnectedreturns early), leaving theConnectEventManagerinresponsiveand making the reconnectConnecteda no-op. A single send failure on the stale stream then marks the still-connected peerunresponsiveand emitsPeerDisconnected, with no recovery path (returning toresponsiveneeds an inboundOnMessagethat never comes because we stop sending our wantlist). The fetch hangs until the peer fully disconnects.Fix
Drop the two premature
MarkUnresponsivecalls insend()and rely onmultiAttempt()to mark the peer unresponsive once retries are exhausted.Testing
go test ./bitswap/network/...passes.