fixes for race conditions on disconnects#249
fixes for race conditions on disconnects#249ryanofsky wants to merge 6 commits intobitcoin-core:masterfrom
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please copy-paste LLM Linter (✨ experimental)Possible typos and grammar issues:
2026-03-12 18:00:27 |
Add test for race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The test currently crashes and will be fixed in the next commit. Co-authored-by: Ryan Ofsky <ryan@ofsky.org> git-bisect-skip: yes
This fixes a race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The bug can be reproduced by running the unit test added in the previous commit or by calling makeThread and immediately disconnecting or destroying the returned thread. The bug is not new and has existed since makeThread was implemented, but it was found due to a new functional test in bitcoin core and with antithesis testing (see details in linked issues). The fix was originally posted in bitcoin/bitcoin#34711 (comment)
Add test for disconnect race condition in the mp.Context PassField() overload that can currently trigger segfaults as reported in bitcoin/bitcoin#34777. The test currently crashes and will be fixed in the next commit. Co-authored-by: Ryan Ofsky <ryan@ofsky.org> git-bisect-skip: yes
This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34777 when it calls call_context.getParams() after a disconnect. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because PassField checks for cancellation and returns early before actually using the getParams() result. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit, requests would continue to execute after a disconnect and it was ok to call getParams(). This fix was originally posted in bitcoin/bitcoin#34777 (comment)
Add test disconnect for race condition in the mp.Context PassField() overload reported in bitcoin/bitcoin#34782. The test crashes currently with AddressSanitizer, but will be fixed in the next commit. It's also possible to reproduce the bug without AddressSanitizer by adding an assert: ```diff --- a/include/mp/type-context.h +++ b/include/mp/type-context.h @@ -101,2 +101,3 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn& server_context.cancel_lock = &cancel_lock; + KJ_DEFER(server_context.cancel_lock = nullptr); server.m_context.loop->sync([&] { @@ -111,2 +112,3 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn& MP_LOG(*server.m_context.loop, Log::Info) << "IPC server request #" << req << " canceled while executing."; + assert(server_context.cancel_lock); // Lock cancel_mutex here to block the event loop ``` Co-authored-by: Ryan Ofsky <ryan@ofsky.org> git-bisect-skip: yes
This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34782 when a cancellation happens after the request executes but before it returns. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because the cancellation would have to happen in a very short window for there to be a problem. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit a cancellation callback was not present. This fix was originally posted in bitcoin/bitcoin#34782 (comment)
|
Closing this, replaced by #250! |
|
Updated 9536b63 -> 884c846 ( Updated 884c846 -> 2fb97e8 ( |
|
ACK 2fb97e8 I checked that the fixes themselves are still the same as when I last looked (#250 (comment)). Since the tests here precede their fixes, it was also easy to confirm that each test actually caught something and the fix made it go away. I also lightly checked that they failed for the right reasons. The improved test code looks good to me as well. CI passed on Bitcoin Core. Our new TSan Bitcoin Core job passed too on #257, but it might be good to manually restart it a couple of times. I'll let the TSan job run locally for a while to see if it finds anything. |
| // | ||
| // The test works by using the `makethread` hook to start a disconnect as | ||
| // soon as ProxyServer<ThreadMap>::makeThread is called, and using the | ||
| // `makethread_created` hook to sleep 100ms after the thread is created but |
There was a problem hiding this comment.
In 88cacd4 test: worker thread destroyed before it is initialized: nit, it's only waiting 10ms
It's been running for well over an hour now without hitting anything. |
|
Concept ACK |
sedited
left a comment
There was a problem hiding this comment.
lgtm
Changes make sense to me, and I'm not seeing the same failures anymore after testing this out a bit, though as Sjors mentioned, hitting them can be flaky.
There was a problem hiding this comment.
Code review ACK 2fb97e8
- 88cacd4 "test: worker thread destroyed before it is initialized"
- f09731e "race fix: m_on_cancel called after request finishes"
- 75c5425 "test: getParams() called after request cancel"
- e69b6bf race fix: getParams() called after request cancel
- 846a43a "test: m_on_cancel called after request finishes"
- 2fb97e8 "race fix: m_on_cancel called after request finishes"
The current test hooks in the event loop and the branching approach is a bit awkward imo.
A potentially better approach could be to have virtual notification methods in the event loop that are triggered at points in the code that tests are interested in.
In certain code paths, the event loop simply invokes those notifications unconditionally.
By default, the implementations of those methods are no-ops and cheap — no branching, no null checks.
In tests, we then subclass EventLoop and override those methods to inject test-specific behaviour, keeping test concerns entirely out of production code.
see diff
diff --git a/include/mp/proxy-io.h b/include/mp/proxy-io.h
index 3594708..5174fe8 100644
--- a/include/mp/proxy-io.h
+++ b/include/mp/proxy-io.h
@@ -251,7 +251,7 @@ public:
LogFn{[old_callback = std::move(old_callback)](LogMessage log_data) {old_callback(log_data.level == Log::Raise, std::move(log_data.message));}},
context){}
- ~EventLoop();
+ virtual ~EventLoop();
//! Run event loop. Does not return until shutdown. This should only be
//! called once from the m_thread_id thread. This will block until
@@ -341,21 +341,22 @@ public:
//! External context pointer.
void* m_context;
- //! Hook called when ProxyServer<ThreadMap>::makeThread() is called.
- std::function<void()> testing_hook_makethread;
+ //! Virtual method called on the event loop thread when ProxyServer<ThreadMap>::makeThread()
+ //! is called to create a new remote worker thread.
+ virtual void onThreadCreate() { /* Nothing to notify by default. */ }
- //! Hook called on the worker thread inside makeThread(), after the thread
- //! context is set up and thread_context promise is fulfilled, but before it
- //! starts waiting for requests.
- std::function<void()> testing_hook_makethread_created;
+ //! Virtual method called on a new worker thread after the thread context
+ //! is set up and the thread_context promise is fulfilled, but before the
+ //! thread starts waiting for requests.
+ virtual void onThreadCreated() { /* Nothing to notify by default. */ }
- //! Hook called on the worker thread when it starts to execute an async
- //! request. Used by tests to control timing or inject behavior at this
- //! point in execution.
- std::function<void()> testing_hook_async_request_start;
+ //! Virtual method called on a worker thread when it starts to execute an
+ //! asynchronous request.
+ virtual void onAsyncRequestStart() { /* Nothing to notify by default. */ }
- //! Hook called on the worker thread just before returning results.
- std::function<void()> testing_hook_async_request_done;
+ //! Virtual method called on a worker thread after an asynchronous request
+ //! finishes executing, but before the response is sent.
+ virtual void onAsyncRequestDone() { /* Nothing to notify by default. */ }
};
//! Single element task queue used to handle recursive capnp calls. (If the
diff --git a/include/mp/type-context.h b/include/mp/type-context.h
index 9c7f21b..12218c3 100644
--- a/include/mp/type-context.h
+++ b/include/mp/type-context.h
@@ -73,7 +73,7 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
auto invoke = [self = kj::mv(self), call_context = kj::mv(server_context.call_context), &server, req, fn, args...](CancelMonitor& cancel_monitor) mutable {
MP_LOG(*server.m_context.loop, Log::Debug) << "IPC server executing request #" << req;
EventLoop& loop = *server.m_context.loop;
- if (loop.testing_hook_async_request_start) loop.testing_hook_async_request_start();
+ loop.onAsyncRequestStart();
ServerContext server_context{server, call_context, req};
{
// Before invoking the function, store a reference to the
@@ -192,7 +192,7 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
}
// End of scope: if KJ_DEFER was reached, it runs here
}
- if (loop.testing_hook_async_request_done) loop.testing_hook_async_request_done();
+ loop.onAsyncRequestDone();
return call_context;
};
diff --git a/src/mp/proxy.cpp b/src/mp/proxy.cpp
index f36e19f..0c9d146 100644
--- a/src/mp/proxy.cpp
+++ b/src/mp/proxy.cpp
@@ -411,7 +411,7 @@ ProxyServer<ThreadMap>::ProxyServer(Connection& connection) : m_connection(conne
kj::Promise<void> ProxyServer<ThreadMap>::makeThread(MakeThreadContext context)
{
- if (m_connection.m_loop->testing_hook_makethread) m_connection.m_loop->testing_hook_makethread();
+ m_connection.m_loop->onThreadCreate();
const std::string from = context.getParams().getName();
std::promise<ThreadContext*> thread_context;
std::thread thread([&thread_context, from, this]() {
@@ -420,7 +420,7 @@ kj::Promise<void> ProxyServer<ThreadMap>::makeThread(MakeThreadContext context)
g_thread_context.waiter = std::make_unique<Waiter>();
Lock lock(g_thread_context.waiter->m_mutex);
thread_context.set_value(&g_thread_context);
- if (loop.testing_hook_makethread_created) loop.testing_hook_makethread_created();
+ loop.onThreadCreated();
// Wait for shutdown signal from ProxyServer<Thread> destructor (signal
// is just waiter getting set to null.)
g_thread_context.waiter->wait(lock, [] { return !g_thread_context.waiter; });
diff --git a/test/mp/test/test.cpp b/test/mp/test/test.cpp
index 4f71a55..d3dd6d8 100644
--- a/test/mp/test/test.cpp
+++ b/test/mp/test/test.cpp
@@ -60,6 +60,24 @@ static_assert(std::is_integral_v<decltype(kMP_MINOR_VERSION)>, "MP_MINOR_VERSION
* client_disconnect manually, but false allows testing more ProxyClient
* behavior and the "IPC client method called after disconnect" code path.
*/
+
+//! EventLoop subclass used by tests to override virtual methods and inject
+//! test-specific behavior.
+class TestEventLoop : public EventLoop
+{
+public:
+ using EventLoop::EventLoop;
+ std::function<void()> on_thread_create;
+ std::function<void()> on_thread_created;
+ std::function<void()> on_async_request_start;
+ std::function<void()> on_async_request_done;
+
+ void onThreadCreate() override { if (on_thread_create) on_thread_create(); }
+ void onThreadCreated() override { if (on_thread_created) on_thread_created(); }
+ void onAsyncRequestStart() override { if (on_async_request_start) on_async_request_start(); }
+ void onAsyncRequestDone() override { if (on_async_request_done) on_async_request_done(); }
+};
+
class TestSetup
{
public:
@@ -69,17 +87,19 @@ public:
std::promise<std::unique_ptr<ProxyClient<messages::FooInterface>>> client_promise;
std::unique_ptr<ProxyClient<messages::FooInterface>> client;
ProxyServer<messages::FooInterface>* server{nullptr};
+ TestEventLoop* test_loop{nullptr};
//! Thread variable should be after other struct members so the thread does
//! not start until the other members are initialized.
std::thread thread;
TestSetup(bool client_owns_connection = true)
- : thread{[&] {
- EventLoop loop("mptest", [](mp::LogMessage log) {
+ : thread{[&, client_owns_connection] {
+ TestEventLoop loop("mptest", [](mp::LogMessage log) {
// Info logs are not printed by default, but will be shown with `mptest --verbose`
KJ_LOG(INFO, log.level, log.message);
if (log.level == mp::Log::Raise) throw std::runtime_error(log.message);
});
+ test_loop = &loop;
auto pipe = loop.m_io_context.provider->newTwoWayPipe();
auto server_connection =
@@ -336,31 +356,31 @@ KJ_TEST("Worker thread destroyed before it is initialized")
// Regression test for bitcoin/bitcoin#34711, bitcoin/bitcoin#34756
// where worker thread is destroyed before it starts.
//
- // The test works by using the `makethread` hook to start a disconnect as
- // soon as ProxyServer<ThreadMap>::makeThread is called, and using the
- // `makethread_created` hook to sleep 100ms after the thread is created but
- // before it starts waiting, so without the bugfix,
+ // The test works by overriding the `onThreadCreate` method to start a
+ // disconnect as soon as ProxyServer<ThreadMap>::makeThread is called, and
+ // overriding the `onThreadCreated` method to sleep after the thread is
+ // created but before it starts waiting, so without the bugfix,
// ProxyServer<Thread>::~ProxyServer would run and destroy the waiter,
// causing a SIGSEGV in the worker thread after the sleep.
- TestSetup setup;
- ProxyClient<messages::FooInterface>* foo = setup.client.get();
- foo->initThreadMap();
- setup.server->m_impl->m_fn = [] {};
-
- EventLoop& loop = *setup.server->m_context.connection->m_loop;
- loop.testing_hook_makethread = [&] {
- setup.server_disconnect_later();
- };
- loop.testing_hook_makethread_created = [&] {
- std::this_thread::sleep_for(std::chrono::milliseconds(10));
- };
-
bool disconnected{false};
- try {
- foo->callFnAsync();
- } catch (const std::runtime_error& e) {
- KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
- disconnected = true;
+ {
+ TestSetup setup;
+ setup.test_loop->on_thread_create = [&] {
+ setup.server_disconnect_later();
+ };
+ setup.test_loop->on_thread_created = [&] {
+ std::this_thread::sleep_for(std::chrono::milliseconds(10));
+ };
+ ProxyClient<messages::FooInterface>* foo = setup.client.get();
+ foo->initThreadMap();
+ setup.server->m_impl->m_fn = [] {};
+
+ try {
+ foo->callFnAsync();
+ } catch (const std::runtime_error& e) {
+ KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
+ disconnected = true;
+ }
}
KJ_EXPECT(disconnected);
}
@@ -370,29 +390,31 @@ KJ_TEST("Calling async IPC method, with server disconnect racing the call")
// Regression test for bitcoin/bitcoin#34777 heap-use-after-free where
// an async request is canceled before it starts to execute.
//
- // Use testing_hook_async_request_start to trigger a disconnect from the
+ // Override onAsyncRequestStart to trigger a disconnect from the
// worker thread as soon as it begins to execute an async request. Without
// the bugfix, the worker thread would trigger a SIGSEGV after this by
// calling call_context.getParams().
- TestSetup setup;
- ProxyClient<messages::FooInterface>* foo = setup.client.get();
- foo->initThreadMap();
- setup.server->m_impl->m_fn = [] {};
-
- EventLoop& loop = *setup.server->m_context.connection->m_loop;
- loop.testing_hook_async_request_start = [&] {
- setup.server_disconnect();
- // Sleep is neccessary to let the event loop fully clean up after the
- // disconnect and trigger the SIGSEGV.
- std::this_thread::sleep_for(std::chrono::milliseconds(10));
- };
-
- try {
- foo->callFnAsync();
- KJ_EXPECT(false);
- } catch (const std::runtime_error& e) {
- KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
+ bool disconnected{false};
+ {
+ TestSetup setup;
+ setup.test_loop->on_async_request_start = [&] {
+ setup.server_disconnect();
+ // Sleep is neccessary to let the event loop fully clean up after the
+ // disconnect and trigger the SIGSEGV.
+ std::this_thread::sleep_for(std::chrono::milliseconds(10));
+ };
+ ProxyClient<messages::FooInterface>* foo = setup.client.get();
+ foo->initThreadMap();
+ setup.server->m_impl->m_fn = [] {};
+
+ try {
+ foo->callFnAsync();
+ } catch (const std::runtime_error& e) {
+ KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
+ disconnected = true;
+ }
}
+ KJ_EXPECT(disconnected);
}
KJ_TEST("Calling async IPC method, with server disconnect after cleanup")
@@ -401,27 +423,29 @@ KJ_TEST("Calling async IPC method, with server disconnect after cleanup")
// an async request is canceled after it finishes executing but before the
// response is sent.
//
- // Use testing_hook_async_request_done to trigger a disconnect from the
+ // Override onAsyncRequestDone to trigger a disconnect from the
// worker thread after it execute an async requests but before it returns.
// Without the bugfix, the m_on_cancel callback would be called at this
// point accessing the cancel_mutex stack variable that had gone out of
// scope.
- TestSetup setup;
- ProxyClient<messages::FooInterface>* foo = setup.client.get();
- foo->initThreadMap();
- setup.server->m_impl->m_fn = [] {};
-
- EventLoop& loop = *setup.server->m_context.connection->m_loop;
- loop.testing_hook_async_request_done = [&] {
- setup.server_disconnect();
- };
-
- try {
- foo->callFnAsync();
- KJ_EXPECT(false);
- } catch (const std::runtime_error& e) {
- KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
+ bool disconnected{false};
+ {
+ TestSetup setup;
+ setup.test_loop->on_async_request_done = [&] {
+ setup.server_disconnect();
+ };
+ ProxyClient<messages::FooInterface>* foo = setup.client.get();
+ foo->initThreadMap();
+ setup.server->m_impl->m_fn = [] {};
+
+ try {
+ foo->callFnAsync();
+ } catch (const std::runtime_error& e) {
+ KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
+ disconnected = true;
+ }
}
+ KJ_EXPECT(disconnected);
}
KJ_TEST("Make simultaneous IPC calls on single remote thread")| if (m_connection.m_loop->testing_hook_makethread) m_connection.m_loop->testing_hook_makethread(); | ||
| const std::string from = context.getParams().getName(); | ||
| std::promise<ThreadContext*> thread_context; | ||
| std::thread thread([&thread_context, from, this]() { | ||
| g_thread_context.thread_name = ThreadName(m_connection.m_loop->m_exe_name) + " (from " + from + ")"; | ||
| EventLoop& loop{*m_connection.m_loop}; | ||
| g_thread_context.thread_name = ThreadName(loop.m_exe_name) + " (from " + from + ")"; | ||
| g_thread_context.waiter = std::make_unique<Waiter>(); | ||
| thread_context.set_value(&g_thread_context); | ||
| if (loop.testing_hook_makethread_created) loop.testing_hook_makethread_created(); |
There was a problem hiding this comment.
In "test: worker thread destroyed before it is initialized" 88cacd4
Why not initialize the loop at the top and of the method and capture it in the thread, which will make the first if statement less verbose and more readable?
| if (m_connection.m_loop->testing_hook_makethread) m_connection.m_loop->testing_hook_makethread(); | |
| const std::string from = context.getParams().getName(); | |
| std::promise<ThreadContext*> thread_context; | |
| std::thread thread([&thread_context, from, this]() { | |
| g_thread_context.thread_name = ThreadName(m_connection.m_loop->m_exe_name) + " (from " + from + ")"; | |
| EventLoop& loop{*m_connection.m_loop}; | |
| g_thread_context.thread_name = ThreadName(loop.m_exe_name) + " (from " + from + ")"; | |
| g_thread_context.waiter = std::make_unique<Waiter>(); | |
| thread_context.set_value(&g_thread_context); | |
| if (loop.testing_hook_makethread_created) loop.testing_hook_makethread_created(); | |
| EventLoop& loop{*m_connection.m_loop}; | |
| if (loop.testing_hook_makethread) loop.testing_hook_makethread(); | |
| const std::string from = context.getParams().getName(); | |
| std::promise<ThreadContext*> thread_context; | |
| std::thread thread([&thread_context, &loop, from, this]() { | |
| g_thread_context.thread_name = ThreadName(loop.m_exe_name) + " (from " + from + ")"; | |
| g_thread_context.waiter = std::make_unique<Waiter>(); | |
| thread_context.set_value(&g_thread_context); | |
| if (loop.testing_hook_makethread_created) loop.testing_hook_makethread_created(); |
| @@ -63,6 +64,7 @@ class TestSetup | |||
| { | |||
There was a problem hiding this comment.
In "test: worker thread destroyed before it is initialized" 88cacd4
Comment can be adjusted to with disconnect_later addition
- * Provides client_disconnect and server_disconnect lambdas that can be used to
- * trigger disconnects and test handling of broken and closed connections.
+ * Provides disconnection lambdas that can be used to trigger
+ * disconnects and test handling of broken and closed connections.
*|
|
||
| kj::Promise<void> ProxyServer<ThreadMap>::makeThread(MakeThreadContext context) | ||
| { | ||
| if (m_connection.m_loop->testing_hook_makethread) m_connection.m_loop->testing_hook_makethread(); |
There was a problem hiding this comment.
In "race fix: m_on_cancel called after request finishes" f09731e
IIUC, the reason why there is a race is that we do not lock before setting the thread context to the thread context promise.
When a disconnect occurs in the interval between setting the thread context and acquiring the lock, a race happens because in that interval the client drops the connection, ProxyServer is destroyed, and then the worker thread tries to acquire the lock and access the waiter — but it has already been destroyed, causing a SIGSEGV.
Acquiring the lock before calling set_value prevents this by ensuring the worker thread holds the mutex before the main thread can proceed. The ProxyServer destructor must acquire the same mutex to signal shutdown, so it blocks until the worker thread is safely inside its wait loop, guaranteeing the waiter is never destroyed while the worker thread is still trying to access it.
g_thread_context.waiter = std::make_unique<Waiter>();
thread_context.set_value(&g_thread_context); // ← signals main thread
// ← race window opens here
Lock lock(g_thread_context.waiter->m_mutex); // ← worker tries to lock
When set_value is called, the main thread unblocks and proceeds:
Main thread Worker thread
──────────────────────────────── ────────────────────────────────
thread_context.get_future().get() set_value() → main thread wakes
creates ProxyServer<Thread>
client drops reference
~ProxyServer<Thread> runs
waiter set to nullptr
waiter destroyed Lock(g_thread_context.waiter->m_mutex)
← waiter is gone → SIGSEGV
It will be nice to mention this in the commit message.
| server_context.request_canceled = true; | ||
| }; | ||
| // Update requests_threads map if not canceled. | ||
| const auto& params = call_context.getParams(); |
There was a problem hiding this comment.
In e69b6bf race fix: getParams() called after request cancel
Perhaps add a comment here (also trying to test my understanding of the fix)
// It is safe to call getParams() here because this code runs inside sync(),
// which executes on the event loop thread. Even if cancellation is initiated
// by the client at this point, the params structs will only be freed after
// sync() returns and the event loop resumes processing the cancellation.
| // we do not want to be notified because | ||
| // cancel_mutex and server_context could be out of | ||
| // scope when it happens. | ||
| cancel_monitor.m_on_cancel = nullptr; |
There was a problem hiding this comment.
Gemini wrote a mermaid diagram
Before 2fb97e8
sequenceDiagram
participant EL as Event Loop Thread
participant W as Worker Thread
participant CM as CancelMonitor (m_on_cancel)
Note over W: --- PHASE: EXECUTION ---
W->>W: fn.invoke(server_context, cancel_mutex)
Note over W: invocation finished
W->>W: [Pop Stack Frame]
Note right of W: server_context & cancel_mutex are now FREED
Note over EL, W: --- THE RACE WINDOW ---
Note right of EL: Client Disconnects NOW
EL->>CM: Trigger ~CancelProbe()
CM->>CM: Execute m_on_cancel()
CM-->>W: Access freed server_context / cancel_mutex
Note over CM: CRASH: Use-After-Free
W->>EL: m_loop->sync() (Too late!)
After 2fb97e8
sequenceDiagram
participant EL as Event Loop Thread
participant W as Worker Thread
participant CM as CancelMonitor (m_on_cancel)
Note over W: --- PHASE: EXECUTION ---
W->>W: fn.invoke(server_context, cancel_mutex)
Note over W: --- PHASE: CLEANUP ---
W->>EL: m_loop->sync() [START]
Note right of EL: SYNC BARRIER (On Event Loop)
EL->>CM: m_on_cancel = nullptr
Note right of EL: DISARMED: Callback is gone
W->>W: [Pop Stack Frame]
Note right of W: server_context & cancel_mutex are now FREED
Note over EL, W: --- THE DISCONNECT WINDOW ---
EL->>EL: Trigger ~CancelProbe()
EL->>EL: Check m_on_cancel (is NULL)
Note right of EL: SAFE: No callback to trigger
EL-->>W: sync() returns [END]
W->>W: Worker exits safely
The PR fixes 3 race conditions on disconnects that were detected in Bitcoin core CI runs and by antithesis:
capnp::CallContext<ipc::capnp::messages::BlockTemplate::GetBlockParams, ipc::capnp::messages::BlockTemplate::GetBlockResults>::getParams()bitcoin/bitcoin#34777 (comment)