Skip to content

proxy-client: fix TSan data race in clientDestroy#286

Merged
ryanofsky merged 2 commits into
bitcoin-core:masterfrom
ryanofsky:pr/drace
Jun 11, 2026
Merged

proxy-client: fix TSan data race in clientDestroy#286
ryanofsky merged 2 commits into
bitcoin-core:masterfrom
ryanofsky:pr/drace

Conversation

@ryanofsky

@ryanofsky ryanofsky commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

#273 added a new test which exposed a longstanding race condition in clientDestroy logging code, causing the sanitize CI job to now fail with a TSAN error. Fix the race by simplifying the logging code.

Details about the fix are in the commit message. TSAN error looked like

[ TEST ] test.cpp:437: Destroying ProxyClient<> with destroy method after peer disconnect
...
  Write of size 8 at 0x721400005020 by thread T24 (mutexes: write M0):
    #0 std::__1::__function::__func<mp::ProxyClientBase<mp::test::messages::FooCallback, mp::test::FooCallback>::ProxyClientBase(mp::test::messages::FooCallback::Client, mp::Connection*, bool)::{lambda()#1}, void ()>::operator()() /home/runner/work/libmultiprocess/libmultiprocess/include/mp/proxy-io.h:522 (mptest+0x242b20)
    #1 void mp::Unlock<mp::Lock, std::__1::function<void ()>&>(mp::Lock&, std::__1::function<void ()>&) /nix/store/6z6xj37ljxwrwd5w3rs0zyfkrxd48lzl-libcxx-21.1.2-dev/include/c++/v1/__functional/function.h:274 (mptest+0x35582d)
    #2 mp::Connection::~Connection() /home/runner/work/libmultiprocess/libmultiprocess/src/mp/proxy.cpp:150 (mptest+0x34e596)
    ...
    #11 mp::test::TestSetup::TestSetup(bool)::{lambda()#1}::operator()() const /home/runner/work/libmultiprocess/libmultiprocess/test/mp/test/test.cpp:112 (mptest+0x1504fe)

  Previous read of size 8 at 0x721400005020 by thread T26:
    #0 void mp::clientDestroy<mp::ProxyClient<mp::test::messages::FooCallback> >(mp::ProxyClient<mp::test::messages::FooCallback>&) /home/runner/work/libmultiprocess/libmultiprocess/include/mp/proxy-types.h:647 (mptest+0x342d2a)
     ...
    #21 std::__1::__function::__func<mp::ProxyServerBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::~ProxyServerBase()::{lambda()#1}, void ()>::operator()() /nix/store/6z6xj37ljxwrwd5w3rs0zyfkrxd48lzl-libcxx-21.1.2-dev/include/c++/v1/__functional/function.h:174 (mptest+0x347c76)
     ...
    #23 void* std::__1::__thread_proxy[abi:ne210101]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, mp::EventLoop::startAsyncThread()::$_0> >(void*) /home/runner/work/libmultiprocess/libmultiprocess/src/mp/proxy.cpp:309 (mptest+0x353ef2)

  Thread T24 (tid=10032, running) created by main thread at:
    ...
    #4 mp::test::TestCase437::run() /home/runner/work/libmultiprocess/libmultiprocess/test/mp/test/test.cpp:449 (mptest+0x1459a7)

https://github.com/bitcoin-core/libmultiprocess/actions/runs/26851297744/job/79183796152

Note: #273 made it possible for this TSAN error to trigger, but it didn't trigger most of the time, so it was not detected in CI. There is also an interaction with #218 and #273 + #218 together make it much more likely to trigger

clientDestroy() reads m_context.connection to decide whether to use
MP_LOG vs KJ_LOG, but Connection::~Connection() can set it to null
concurrently from the event loop thread (via the disconnect_cb sync
cleanup callback) while the destructor runs on an async cleanup thread,
causing a TSan-reported data race. The race is exposed by the test added
in commit 90be835 ("test: regression for ~ProxyClient destroy after peer
disconnect").

The KJ_LOG fallback was only needed before commit 315ff53 ("refactor:
Add ProxyContext EventLoop* member"), when logging required going
through connection->m_loop. Since that commit, m_context.loop is a
direct EventLoopRef that is always valid regardless of whether
m_context.connection is null. The KJ_LOG branch is now dead code, so
drop it and the connection check entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@DrahtBot

DrahtBot commented Jun 3, 2026

Copy link
Copy Markdown

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK maflcko, ViniciusCestarii, xyzconstant
Concept ACK w0xlt
Stale ACK enirox001

If your review is incorrectly listed, please copy-paste <!--meta-tag:bot-skip--> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #293 (proxy-types: prevent race condition when reading client.m_context.connection by ViniciusCestarii)
  • #269 (proxy: add local connection limit to ListenConnections by enirox001)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

gen.cpp uses IWYU pragma: keep for include which appears to be necessary for
newer versions of capnproto (1.4.0 but not 1.1.0)
@ryanofsky

ryanofsky commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

@w0xlt

w0xlt commented Jun 4, 2026

Copy link
Copy Markdown

Concept ACK,

@enirox001 enirox001 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 039e5ac

Built and ran mptest with -fsanitize=thread, all 11 tests pass cleanly.

The fix makes sense to me. Instead of adding synchronization for what constitutes a log line, we drop the branch and unconditionally log via MP_LOG.

@ryanofsky

Copy link
Copy Markdown
Collaborator Author

Thanks for the reviews. I also went back and tried to debug why CI did not catch the TSAN error. It turns out to be because an interaction between #273 and #218. The new test added in #273 does not trigger the TSAN error deterministically. When it is run standalone without #218 there's usually no error, although I did see a TSAN error once running the test maybe 10 times. But when #273 and #218 are combined, the test triggers TSAN failures most of the time. This seems to happen because #218 changes clientDestroy to call CxxTypeName which can change thread timing.

@ryanofsky

Copy link
Copy Markdown
Collaborator Author

Started to merge this but I saw some more IWYU errors locally that seem better to fix. I think the errors don't happen in CI because nixpkgs are older and use an older version of capnproto. I also ran IWYU locally before my last push, but didn't see these errors because I was using a combined branch with #287 with example files removed. Errors look like:

/home/ryan/mp/build-default/example/calculator.capnp.proxy-client.c++ should add these lines:
#include <capnp/common.h>                    // for MessageSize
#include <string>                            // for basic_string

Updated 039e5ac -> 90982f7 (pr/drace.3 -> pr/drace.4, compare) touching up IWYU commit to fix more local IWYU errors

@maflcko

maflcko commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

lgtm ACK 90982f7

A bit tedious, that different capnp versions require different set of includes, but I guess there is nothing that can be done about that in this repo.

@DrahtBot DrahtBot requested a review from enirox001 June 10, 2026 10:40
@ryanofsky

Copy link
Copy Markdown
Collaborator Author

A bit tedious, that different capnp versions require different set of includes, but I guess there is nothing that can be done about that in this repo.

Yes, this repo is taking a different strategy for IWYU than the Bitcoin core repo so it is easier to run IWYU locally.

With Bitcoin core AFAIK you need to run IWYU in a container and can't run it as part of your normal development process because headers must be changed to match what a single CI machine enforces, and pragmas are discouraged.

With this repo there is no special IWYU configuration. IWYU can be run in any configuration and "pragma: keeps" can be added freely to turn off complaints from other configurations. From my perspective this is good because there's no need for a container to run IWYU, and you can waste less time debugging IWYU errors and figuring out what "correct" includes are. You get the main benefits of IWYU (faster compiles, cleaner headers) at the cost of needing a few mysterious pragmas. I think this is about as much as you can hope for with IWYU.

@maflcko

maflcko commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Right, it is obviously more correct to run iwyu for the full range of all supported third-party dependency versions (especially if they have different include requirements).

Maybe it could make sense to expand the CI to cover both oldeps-iwyu and nightly-iwyu (or so) here?

I think for Bitcoin Core this doesn't matter, because the dependencies don't change the include requirements across versions. (E.g. boost multi index is just a single include header)

If that were to change, we should do the same. But no strong opinion, and this seems to be trailing off a bit 😅

@ViniciusCestarii ViniciusCestarii left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, 90982f7

This change removes differentiation on the logging when the connection was torn down "IPC interrupted client destroy" but I believe this is fine and simpler than locking a mutex.

@xyzconstant

Copy link
Copy Markdown
Contributor

tACK 90982f7

@ryanofsky ryanofsky merged commit 9885d7d into bitcoin-core:master Jun 11, 2026
20 of 23 checks passed
@w0xlt

w0xlt commented Jun 11, 2026

Copy link
Copy Markdown

post-merge ACK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants