An alternative TLS implementation to fix a crash on musl/aarch64 by zhengyu123 · Pull Request #600 · DataDog/java-profiler

zhengyu123 · 2026-06-17T17:41:59Z

What does this PR do?:
This PR is intended to replace thread_local variables (except Otel ones) with an alternative implementation, which is based on pthread_get/setspecific() API.

Motivation:
We have encountered a quite few defects related to TLS - whose implementation is compiler specific. In some circumstances, the implementation is very restrictive that does not fit our usage pattern - it manifests in SCP-1238

Additional Notes:
To limit the scope, I only replaced 3 thread locals in livenessTracker which led to the crash reported in SCP-1238](https://datadoghq.atlassian.net/browse/SCP-1238)

Converting ProfiledThread to use newly introduced ThreadLocal will be done in separate PR.

How to test the change?:

CI tests
New test case.

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a security review (run the dd:platform-security-review
skill, or file a request via the PSEC review form).
bewaire also runs automatically on every PR.
This PR doesn't touch any of that.
JIRA: PROF-15114

Unsure? Have a question? Request a review!

dd-octo-sts · 2026-06-17T17:58:29Z

CI Test Results

Run: #27779791157 | Commit: 93f91c9 | Duration: 14m 55s (longest job)

✅ All 32 test jobs passed

Status Overview

JDK	glibc-aarch64/debug	glibc-amd64/debug	musl-aarch64/debug	musl-amd64/debug
8	-	✅	-	-
8-ibm	-	✅	-	-
8-j9	✅	✅	-	-
8-librca	-	-	✅	✅
8-orcl	-	✅	-	-
11	-	✅	-	-
11-j9	✅	✅	-	-
11-librca	-	-	✅	✅
17	✅	✅	-	-
17-graal	✅	✅	-	-
17-j9	✅	✅	-	-
17-librca	-	-	✅	✅
21	✅	✅	-	-
21-graal	✅	✅	-	-
21-librca	-	-	✅	✅
25	✅	✅	-	-
25-graal	✅	✅	-	-
25-librca	-	-	✅	✅

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0

Updated: 2026-06-18 18:30:00 UTC

datadog-prod-us1-4 · 2026-06-17T18:03:39Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 14 Pipeline jobs failed

DataDog/java-profiler | benchmarks-candidate-aarch64: [alloc]

DataDog/java-profiler | benchmarks-candidate-aarch64: [cpu,wall,alloc,memleak]

DataDog/java-profiler | benchmarks-candidate-aarch64: [cpu,wall]

View all 14 failed jobs.

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 13c8891 | Docs | Datadog PR Page | Give us feedback!}

dd-octo-sts · 2026-06-18T13:21:10Z

Reliability & Chaos Results

✅ All reliability & chaos checks passed Pipeline: https://gitlab.ddbuild.io/DataDog/java-profiler/-/pipelines/119735058

Copilot

Pull request overview

This PR introduces a pthread-key-based ThreadLocal helper as an alternative to C++ thread_local, and migrates LivenessTracker’s per-thread random/subsampling state to use it (motivated by musl/aarch64 TLS issues). It also adds C++ unit tests validating the new TLS wrapper behavior.

Changes:

Added ddprof-lib/src/main/cpp/threadLocal.h implementing a ThreadLocal<T> wrapper on top of pthread_key_*.
Updated LivenessTracker::track() to replace thread_local PRNG/distribution/skipped-state with ThreadLocal instances.
Added threadLocal_ut.cpp gtest coverage for ThreadLocal set/get semantics, per-thread isolation, and per-thread cleanup.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 12 comments.

File	Description
`ddprof-lib/src/main/cpp/threadLocal.h`	New pthread-key TLS abstraction (generic + `double` specialization).
`ddprof-lib/src/main/cpp/livenessTracker.cpp`	Switches liveness subsampling TLS state from `thread_local` to the new `ThreadLocal` wrapper.
`ddprof-lib/src/test/cpp/threadLocal_ut.cpp`	Adds gtests validating correctness and cleanup behavior of the new TLS wrapper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13c88912f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-18T18:14:12Z

+        int err = pthread_key_create(&_key, F);
+        // What to do if we can not create a key?
+        assert(err == 0);


Handle pthread_key_create failures outside assert

When the process exhausts pthread TLS keys, pthread_key_create returns an error; in release builds this project compiles with -DNDEBUG, so this assertion is removed and the static _key remains zero/unusable. Subsequent liveness sampling get()/set() calls can then operate on an unrelated key or repeatedly allocate and leak per-thread RNG state instead of failing cleanly, so the error path needs non-assert handling such as an unconditional abort or disabling this TLS instance.

Useful? React with 👍 / 👎.

jbachorik · 2026-06-19T07:15:09Z

+    ThreadLocal& operator=(const ThreadLocal&) = delete;
+
+    ThreadLocal() {
+        int err = pthread_key_create(&_key, F);


[HIGH · robustness] pthread_key_create failure is guarded only by assert(), which is compiled out under NDEBUG. In a release build a failed key creation leaves _key uninitialized; all subsequent get()/set() calls then operate on an arbitrary or zero key, silently aliasing another thread's TSD slot (e.g. the ProfiledThread key in thread.cpp).

Suggestion: replace the assert with an unconditional abort:

if (err != 0) { fprintf(stderr, "ThreadLocal: pthread_key_create failed: %d\n", err); abort(); }

This matches the existing error handling in thread.cpp and makes failures visible in production.

jbachorik · 2026-06-19T07:15:09Z

+    }
+
+    ~ThreadLocal() {
+        pthread_key_delete(_key);


[HIGH · robustness] pthread_key_delete(_key) is called unconditionally in the destructor. If pthread_key_create failed in the constructor and the assert was compiled out (NDEBUG / release build), _key stays at its zero-initialized value (0). Calling pthread_key_delete(0) deletes a key we don't own — on musl, key 0 is legitimately held by another component (see thread.cpp:91–93) — corrupting the global pthread key table.

Suggestion: track construction success with a bool _key_valid member:

_key_valid = (pthread_key_create(&_key, F) == 0); // ... if (_key_valid) pthread_key_delete(_key);

Apply the same pattern to the double specialization.

jbachorik · 2026-06-19T07:15:09Z

+    T get() {
+        void* p = pthread_getspecific(_key);
+        if (p == nullptr && C != nullptr) {
+            p = C();


[HIGH · robustness] After p = C() allocates an object, set() asserts on pthread_setspecific; under NDEBUG the assert is stripped. If pthread_setspecific then fails silently, the TSD slot stays null. Every subsequent get() re-invokes C() and leaks the prior allocation — unbounded per-call leak for the lazy-create path.

Suggestion: check pthread_setspecific directly after C() and free on failure:

p = C(); if (pthread_setspecific(_key, (const void*)p) != 0) { if (F) F(p); return T{}; }

Alternatively, make set() call abort() instead of assert so failure is not silently swallowed in release builds.

jbachorik · 2026-06-19T07:15:09Z

+        pthread_key_delete(_key);
+    }
+
+    void set(T value) {


[MEDIUM · correctness] The primary template casts T directly to const void* with no compile-time guard that sizeof(T) == sizeof(void*). Instantiation with a sub-pointer-width type (e.g. ThreadLocal<int>) compiles silently but produces implementation-defined behaviour on the set/get round-trip for values that set the sign bit.

Suggestion: add a static_assert in the primary template constructor mirroring the guards already in the double specialization:

static_assert(sizeof(T) == sizeof(void*), "ThreadLocal<T> requires sizeof(T)==sizeof(void*); use a pointer type or add a specialization");

jbachorik · 2026-06-19T07:15:09Z

+        pthread_key_delete(_key);
+    }
+
+    void set(T value) {


[MEDIUM · correctness] set(nullptr) on a lazy-create ThreadLocal (i.e. C != nullptr) is semantically identical to 'never set': the next get() call will invoke C() and return a fresh object instead of null. This makes it impossible to store a null value or 'unset' without triggering re-creation, which is a silent foot-gun for the planned future conversions mentioned in the PR description.

Suggestion: document explicitly that set(nullptr) is reserved for clear() and must not be used when C != nullptr, or provide a separate reset() method that bypasses the lazy-create path.

jbachorik · 2026-06-19T07:15:09Z

+        return value;
+    }
+
+    void clear() {


[MEDIUM · robustness] ThreadLocal<double>::clear() uses assert(err == 0) for the pthread_setspecific return value; under NDEBUG the assert is stripped. If pthread_setspecific fails silently, the TSD slot retains the stale encoded double bits; the next get() sees a non-null value and decodes garbage instead of returning 0.0.

Suggestion: mirror the defensive pattern in the generic clear() — read the existing value first, attempt the set, and only proceed if it succeeded — or use abort() instead of assert so the failure is not silently swallowed in release builds.

jbachorik · 2026-06-19T07:15:09Z

    }
  }
-  skipped = 0; // reset the subsampling skipped bytes
+  skipped.set(0); // reset the subsampling skipped bytes


[LOW · correctness] (Pre-existing, not introduced by this PR) skipped.set(0) resets the accumulated skipped-bytes counter even when the current event was dropped (table full), losing weight that should carry forward to the next successfully recorded sample.

Suggestion: move skipped.set(0) inside the idx < _table_cap branch so it only resets when the sample was actually recorded.

jbachorik · 2026-06-19T07:15:09Z

+
+// The create function lazily initializes storage on first get() and is invoked
+// exactly once per thread; subsequent get()s return the same pointer.
+TEST_F(ThreadLocalTest, Lazy_CreateOncePerThread) {


[MEDIUM · completeness] The Lazy_* tests share g_tracked_tl, g_created, and g_freed as namespace-scope statics. Counts are reset per test but the TSD key itself is shared. In parallel test runs or with a thread-pool GTest runner where worker threads are reused, freed counts can be non-deterministic.

Suggestion: use a per-test-local ThreadLocal instance (or scoped key creation/deletion in SetUp/TearDown) so each test owns its own TSD key and counters, eliminating cross-test contamination regardless of thread reuse.

jbachorik · 2026-06-19T07:15:09Z

+    ASSERT_NE(nullptr, first);
+    EXPECT_EQ(first, second);          // same instance reused (pointer compare only)
+    EXPECT_EQ(1234, value);            // created via create_tracked()
+    EXPECT_EQ(1, g_created.load());    // created exactly once


[LOW · completeness] Lazy_CreateOncePerThread resets g_freed to 0 but never asserts its value after the thread joins. A stub implementation that omits CLEAN_FUNC entirely would still pass this test.

Suggestion: add EXPECT_EQ(1, g_freed.load()); after the g_created assertion to verify the TSD destructor is called on thread exit.

jbachorik · 2026-06-19T07:15:09Z

+
+// Per-thread accumulation mirrors LivenessTracker's `skipped` usage: each thread
+// keeps its own running sum, isolated from the others.
+TEST_F(ThreadLocalTest, Double_PerThreadAccumulation) {


[LOW · completeness] The test suite has zero coverage for clear() on either ThreadLocal specialization. A stub clear() that did nothing would pass all 8 tests without failure.

Suggestion: add at minimum: (1) a test for generic clear() verifying the stored pointer is zeroed and the destructor ran; (2) a test for ThreadLocal<double>::clear() verifying get() returns 0.0 afterwards; (3) a test for clear() when F == nullptr verifying the slot is actually nulled.

zhengyu123 added 3 commits June 17, 2026 13:36

v0

8d2b637

v1

02614cf

An alternative to thread local

9840b2f

zhengyu123 added 2 commits June 17, 2026 20:37

v2

a4aa192

Added a test

17889ce

zhengyu123 changed the title ~~Zgu/tls replacement~~ WIP: An alternative TLS implementation to fix a crash on musl/aarch64 Jun 18, 2026

zhengyu123 requested a review from Copilot June 18, 2026 13:29

Copilot started reviewing on behalf of zhengyu123 June 18, 2026 13:29 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

zhengyu123 and others added 13 commits June 18, 2026 09:57

Potential fix for pull request finding

f38c9cd

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

b387308

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

c2b7014

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

a632c1c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

a3d8950

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

792a4a3

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

f9c9b28

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

86c2e2e

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

2e788fd

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

282a346

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

5cf4a1c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

337b80f

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

fix

c190286

zhengyu123 changed the title ~~WIP: An alternative TLS implementation to fix a crash on musl/aarch64~~ An alternative TLS implementation to fix a crash on musl/aarch64 Jun 18, 2026

zhengyu123 marked this pull request as ready for review June 18, 2026 18:09

zhengyu123 requested a review from a team as a code owner June 18, 2026 18:09

Merge branch 'main' into zgu/tls_replacement

13c8891

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

jbachorik reviewed Jun 19, 2026

View reviewed changes

Conversation

zhengyu123 commented Jun 17, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dd-octo-sts Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Test Results

Status Overview

Uh oh!

datadog-prod-us1-4 Bot commented Jun 17, 2026 • edited by datadog-official Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

dd-octo-sts Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reliability & Chaos Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhengyu123 commented Jun 17, 2026 •

edited by atlassian Bot

Loading

dd-octo-sts Bot commented Jun 17, 2026 •

edited

Loading

datadog-prod-us1-4 Bot commented Jun 17, 2026 •

edited by datadog-official Bot

Loading

dd-octo-sts Bot commented Jun 18, 2026 •

edited

Loading