add TaskBlock events for blocking intervals by kaahos · Pull Request #570 · DataDog/java-profiler

kaahos · 2026-06-01T13:42:52Z

What does this PR do?:

Adds datadog.TaskBlock JFR events for blocking intervals such as LockSupport.park, Object.wait, monitor contention, including recording APIs used by dd-trace-java for instrumented blocking operations such as Thread.sleep.

Motivation:

This builds on paul.fournillon/wallclock-suppression (#560) by preserving visibility into blocked spans as explicit duration events.

Additional Notes:

This PR does not add the dd-trace-java instrumentation itself; Thread.sleep emission depends on that side calling the new profiler APIs. Monitor callback support is HotSpot-specific, and virtual-thread carrier attribution is avoided.

How to test the change?:

./.claude/commands/build-and-summarize :ddprof-test:testDebug -Ptests="*.wallclock.*TaskBlockTest"
./.claude/commands/build-and-summarize :ddprof-lib:gtestDebug

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.
JIRA: [JIRA-14354]

Unsure? Have a question? Request a review!

dd-octo-sts · 2026-06-01T13:57:37Z

CI Test Results

Run: #27772272408 | Commit: ff62c9a | Duration: 18m 6s (longest job)

✅ All 32 test jobs passed

Status Overview

JDK	glibc-aarch64/debug	glibc-amd64/debug	musl-aarch64/debug	musl-amd64/debug
8	-	✅	-	-
8-ibm	-	✅	-	-
8-j9	✅	✅	-	-
8-librca	-	-	✅	✅
8-orcl	-	✅	-	-
11	-	✅	-	-
11-j9	✅	✅	-	-
11-librca	-	-	✅	✅
17	✅	✅	-	-
17-graal	✅	✅	-	-
17-j9	✅	✅	-	-
17-librca	-	-	✅	✅
21	✅	✅	-	-
21-graal	✅	✅	-	-
21-librca	-	-	✅	✅
25	✅	✅	-	-
25-graal	✅	✅	-	-
25-librca	-	-	✅	✅

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0

Updated: 2026-06-18 16:23:08 UTC

kaahos · 2026-06-03T13:20:47Z

@copilot review this PR

Copilot · 2026-06-03T13:26:39Z

@copilot review this PR

Reviewed and found a flaky assertion in NativeSocketTaskBlockTest: it required zero TaskBlock events globally, which can fail when an untraced socket op emits one. Fixed in d21e493 by asserting that no TaskBlock is emitted for the traced span specifically, and re-ran :ddprof-test:testDebug -Ptests="*.wallclock.*TaskBlockTest" successfully.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d21e4936d2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

kaahos · 2026-06-12T12:38:04Z

the previous comments have been addressed and the new features merged into this branch.

jbachorik · 2026-06-16T09:04:30Z

+    // JVMThread::current() is the native thread self pointer. On OpenJ9/Zing it
+    // is not a HotSpot JavaThread*; only HotSpot may reinterpret it as VMThread*.
+    if (!VM::isHotspot() || JVMThread::current() == nullptr) {
+        return nullptr;
+    }
    return VMThread::cast(JVMThread::current());


This seems to be the wrong place to check for this. We should not be able to end up in vmstructs releated code when not running on hotspot.

jbachorik · 2026-06-16T09:04:59Z

+    if (!VM::isHotspot()) {
+        return nullptr;
+    }


Likewise, this code should be called only on hotspot already.

jbachorik · 2026-06-16T09:06:13Z

-    // Use atomic load: keys[] can be written concurrently via CAS in put()
-    // when a table is promoted to prev but still has in-flight insertions.


Why are these comments removed?

jbachorik · 2026-06-16T09:06:48Z

-      // Use acquireTrace() to pair with the RELEASE store in setTrace().
-      // If still PREPARING, treat as not found: callers will create a new entry.


Likewise - I think the comments are still valid

jbachorik · 2026-06-16T09:11:09Z

+  u64 _num_suppressed_sampled_run;
+  u64 _num_task_block_emitted;
+  u64 _num_task_block_skipped_trace_context;
+  u64 _num_task_block_skipped_too_short;



Clarification - are these counters supposed to be used to drive the data reconstruction from the samples? Or they are just counters?
If it is the latter case, we should rather use the COUNTERS (they will get automatically written in the recording)

jbachorik · 2026-06-16T09:56:26Z

+  }
+
+  Context context = ContextApi::snapshot();
+  if (context.spanId != 0) {


This seems wrong - you bail out when spanId is set for the context?

jbachorik · 2026-06-16T10:17:50Z

+int NativeSocketInterposer::close_hook(int fd) {
+  int ret;
+  if (_orig_close == nullptr) {
+    ret = static_cast<int>(syscall(SYS_close, fd));
+  } else {
+    ret = _orig_close(fd);
+  }
+  int saved_errno = errno;
+  if (ret == 0) {
+    NativeSocketInterposer::instance()->clearFdType(fd);
+  }
+  errno = saved_errno;
+  return ret;
+}


close is hooked to invalidate the fd-type cache here, but the other fd-closing paths are not: dup2/dup3 implicitly close their target fd, and neither is in NativeIoHookIndex. A raw syscall(SYS_close) or a close in an unpatched library has the same effect. After such a close + fd reuse, _fd_type_cache[fd] holds a stale verdict until the next hooked close or clearFdTypeCache() (start/stop).

Impact is observability-only - call(fn) always runs the real I/O, so only the NativeBlockScope wrapping decision is affected: a reused fd can be misattributed (stale SOCKET → file read recorded as IO_WAIT) or under-sampled (stale NON_SOCKET → real socket I/O not wrapped). Self-healing on next close/reset, no crash or semantic effect.

Since we already hook close for exactly this reason, consider hooking dup2/dup3 to clearFdType the target fd and closing the gap.

jbachorik · 2026-06-16T10:18:53Z

+  if (type != 0) {
+    _fd_type_cache[fd].store((gen << FD_TYPE_GEN_SHIFT) | type,
+                             std::memory_order_release);
+  }
+  return type;


When probe() returns 0 — i.e. getsockopt(SO_TYPE) fails with something other than ENOTSOCK (e.g. EBADF) - the result is intentionally not cached, so this fd gets a fresh getsockopt syscall on every intercepted I/O op rather than a one-time cold-path cost. For a valid in-use fd this shouldn't occur, but it's the only path where the classifier degrades to one syscall per call. Worth a brief comment noting the non-cacheable case is deliberate (transient errors must not be cached) so it isn't mistaken for an oversight.

jbachorik · 2026-06-16T10:24:02Z

+#include <sys/syscall.h>
+#include <unistd.h>
+
+namespace {


Probably not necessary. All the functions here are already static and internally linked. The rest of the codebase is not really using anonymous namespaces.

jbachorik · 2026-06-16T10:25:27Z

+  int entry_errno = errno;
+  bool eligible = NativeSocketInterposer::instance()->isDatagramSocket(fd);
+  errno = entry_errno;
+  return runNativeIoHook<Ret>(eligible, NativeBlockKind::UDP_RECEIVE, fd, fn,
+                              call);
+}


I wonder - I am seeing this pattern of storing and restoring errno.
Can we have a RAII for that?

jbachorik

Can you add more test coverage?

Tier 1

directly threatens the design's core invariant (no test exists, and the race is real, not theoretical):

Concurrent enter/exit/snapshot on the same ThreadFilter slot. enterBlockedRun, snapshotAndExitBlockedRun, and the signal-handler's markSampledThisRun are all lock-free on shared slot state, but every test drives them single-threaded. The whole feature is built on a signal handler racing the application thread — and that interleaving is never tested. This is the headline gap.
Snapshot consistency under concurrent mutation. snapshotBlockRun does several independent relaxed loads (sampled_this_run, owner, anchor_sample_id, suppressed_sample_count). No test mutates state mid-snapshot to confirm the result is self-consistent (e.g. anchored=false but non-zero suppressed count).
TaskBlockQueue concurrent push/pop. It's an MPSC-style sequence-cell queue (the hard-to-get-right kind), tested only single-threaded. No multi-producer race, no cell-reuse-staleness test.
### Tier 2 error/early-exit paths that are written but never executed by a test:
NativeBlockScope constructor gates. Of its ~6 early-exit gates (!taskBlockAsyncActive, !filter.enabled, current==nullptr, non-Java thread, slot_id<0, enterBlockedRun==0), only the spanId!=0 skip is tested. The others are silent - a regression that flips one would pass CI.
Null orig* function pointer → ENOSYS. The interposer handles it; no test injects null.
getsockopt failure with errno ≠ ENOTSOCK - uncovered, and it's the one real perf cliff.

Tier 3

lifecycle, worth one test each:

Profiler restart with changed args (wallprecheck=true then false; ASGCT↔JVMTI engine switch with precheck active) — the static-singleton counters and drain-thread lifecycle across restart are untested. Note the prior memory: restart-in-prod isn't a supported mode, which lowers this priority.
JVMTI-unavailable / J9 fallback — precheck wiring on non-ASGCT engines is asserted by the agents to silently no-op; worth a test to confirm it degrades cleanly rather than leaving weight slots un-restored.

…concurrency coverage test

kaahos added 11 commits May 28, 2026 15:28

feat: wall-clock precheck and signal suppression

1045a85

Merge branch 'main' into paul.fournillon/wallclock-suppression

aed7b1a

fix

7a250b6

Merge branch 'main' into paul.fournillon/wallclock-suppression

6028fdd

fix: fix build + tests

1e1bcd1

fix: fix mem leaks in tests

b1cb73f

fix: track wall precheck block state in thread filter

a3a9462

fix: arm wall precheck after recording sample

137065c

fix: include wait states in wall precheck suppression

c7caa46

Fix ProfiledThread ownership in park_state_ut

55073d0

Add Java block-state bridge for wall-clock precheck

619449a

This comment has been minimized.

Sign in to view

Fix wall-clock thread filter reset

1cd0f8b

kaahos force-pushed the paul.fournillon/wallclock-taskblock branch from bcf3c2b to a46d368 Compare June 2, 2026 08:24

Gate wall-clock precheck on untraced context

3ee7f42

kaahos force-pushed the paul.fournillon/wallclock-taskblock branch 3 times, most recently from 3f2f71a to d1e3210 Compare June 3, 2026 13:00

Copilot started work on behalf of kaahos June 3, 2026 13:21 View session

Copilot finished work on behalf of kaahos June 3, 2026 13:27

kaahos marked this pull request as ready for review June 3, 2026 13:40

kaahos requested a review from a team as a code owner June 3, 2026 13:40

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread ddprof-lib/src/main/cpp/wallClock.cpp Outdated

kaahos force-pushed the paul.fournillon/wallclock-taskblock branch 2 times, most recently from 2101007 to a94677a Compare June 3, 2026 14:05

Merge branch 'main' into paul.fournillon/wallclock-suppression

6bda356

kaahos force-pushed the paul.fournillon/wallclock-taskblock branch from a94677a to 681582c Compare June 3, 2026 14:38

kaahos force-pushed the paul.fournillon/wallclock-taskblock branch from 3168f13 to a7cf33f Compare June 12, 2026 11:22

This comment was marked as off-topic.

Sign in to view

This comment was marked as outdated.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as outdated.

Sign in to view

jbachorik reviewed Jun 15, 2026

View reviewed changes

Comment thread ddprof-lib/src/main/cpp/codeCache.cpp Outdated

Comment thread ddprof-lib/src/main/cpp/wallClockCounters.h

Comment thread ddprof-lib/src/main/cpp/nativeSocketInterposer.cpp

jbachorik requested changes Jun 16, 2026

View reviewed changes

kaahos added 18 commits June 18, 2026 15:50

fix: clean up branch based on PR review recommendations

a764667

fix: remove TaskBlock snapshot mechanism

0940fbe

chore: isolate TaskBlock recording infrastructure

dc3fdb8

chore: group park and monitor TaskBlock producers

d032564

chore: split out native socket interposition

3163a96

fix: address JFR recording review

1f3627e

fix: address TaskBlock recorder review

3d7ab5e

fix: FLAG_PARKED was published before context

c451125

fix: address merge regressions

d2b453e

fix: fix wallprecheck anchoring and delegated sample recording

5b3e863

fix: fix TaskBlock monitor ownership and counters

b07a466

fix: fix native socket hook correctness

bddb37c

fix: fix blocked-run ownership and exit races and add TaskBlockQueue …

9eaecdc

…concurrency coverage test

fix: refresh native socket fd type after dup2 and dup3

05d8933

fix: document native patcher and call trace concurrency invariants

fc0a61a

fix: make TaskBlock events carry direct stack references

64bec6a

fix: test TaskBlock stack reference capture

673289b

test: assert TaskBlock events are self-contained in JFR tests

c278669

kaahos force-pushed the paul.fournillon/wallclock-taskblock branch from 98cb129 to c278669 Compare June 18, 2026 15:57

		// Use atomic load: keys[] can be written concurrently via CAS in put()
		// when a table is promoted to prev but still has in-flight insertions.

		// Use acquireTrace() to pair with the RELEASE store in setTrace().
		// If still PREPARING, treat as not found: callers will create a new entry.

Conversation

kaahos commented Jun 1, 2026

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Test Results

Status Overview

Uh oh!

kaahos commented Jun 3, 2026

Uh oh!

Copilot AI commented Jun 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

kaahos commented Jun 12, 2026

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbachorik left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Tier 1

Tier 3

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dd-octo-sts Bot commented Jun 1, 2026 •

edited

Loading

jbachorik left a comment •

edited

Loading