add TaskBlock events for blocking intervals#570
Conversation
This comment has been minimized.
This comment has been minimized.
CI Test ResultsRun: #27772272408 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-06-18 16:23:08 UTC |
bcf3c2b to
a46d368
Compare
3f2f71a to
d1e3210
Compare
|
@copilot review this PR |
Reviewed and found a flaky assertion in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d21e4936d2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
2101007 to
a94677a
Compare
a94677a to
681582c
Compare
3168f13 to
a7cf33f
Compare
|
the previous comments have been addressed and the new features merged into this branch. |
| // JVMThread::current() is the native thread self pointer. On OpenJ9/Zing it | ||
| // is not a HotSpot JavaThread*; only HotSpot may reinterpret it as VMThread*. | ||
| if (!VM::isHotspot() || JVMThread::current() == nullptr) { | ||
| return nullptr; | ||
| } | ||
| return VMThread::cast(JVMThread::current()); |
There was a problem hiding this comment.
This seems to be the wrong place to check for this. We should not be able to end up in vmstructs releated code when not running on hotspot.
| if (!VM::isHotspot()) { | ||
| return nullptr; | ||
| } |
There was a problem hiding this comment.
Likewise, this code should be called only on hotspot already.
| // Use atomic load: keys[] can be written concurrently via CAS in put() | ||
| // when a table is promoted to prev but still has in-flight insertions. |
There was a problem hiding this comment.
Why are these comments removed?
| // Use acquireTrace() to pair with the RELEASE store in setTrace(). | ||
| // If still PREPARING, treat as not found: callers will create a new entry. |
There was a problem hiding this comment.
Likewise - I think the comments are still valid
| u64 _num_suppressed_sampled_run; | ||
| u64 _num_task_block_emitted; | ||
| u64 _num_task_block_skipped_trace_context; | ||
| u64 _num_task_block_skipped_too_short; | ||
|
|
There was a problem hiding this comment.
Clarification - are these counters supposed to be used to drive the data reconstruction from the samples? Or they are just counters?
If it is the latter case, we should rather use the COUNTERS (they will get automatically written in the recording)
| } | ||
|
|
||
| Context context = ContextApi::snapshot(); | ||
| if (context.spanId != 0) { |
There was a problem hiding this comment.
This seems wrong - you bail out when spanId is set for the context?
| int NativeSocketInterposer::close_hook(int fd) { | ||
| int ret; | ||
| if (_orig_close == nullptr) { | ||
| ret = static_cast<int>(syscall(SYS_close, fd)); | ||
| } else { | ||
| ret = _orig_close(fd); | ||
| } | ||
| int saved_errno = errno; | ||
| if (ret == 0) { | ||
| NativeSocketInterposer::instance()->clearFdType(fd); | ||
| } | ||
| errno = saved_errno; | ||
| return ret; | ||
| } |
There was a problem hiding this comment.
close is hooked to invalidate the fd-type cache here, but the other fd-closing paths are not: dup2/dup3 implicitly close their target fd, and neither is in NativeIoHookIndex. A raw syscall(SYS_close) or a close in an unpatched library has the same effect. After such a close + fd reuse, _fd_type_cache[fd] holds a stale verdict until the next hooked close or clearFdTypeCache() (start/stop).
Impact is observability-only - call(fn) always runs the real I/O, so only the NativeBlockScope wrapping decision is affected: a reused fd can be misattributed (stale SOCKET → file read recorded as IO_WAIT) or under-sampled (stale NON_SOCKET → real socket I/O not wrapped). Self-healing on next close/reset, no crash or semantic effect.
Since we already hook close for exactly this reason, consider hooking dup2/dup3 to clearFdType the target fd and closing the gap.
| if (type != 0) { | ||
| _fd_type_cache[fd].store((gen << FD_TYPE_GEN_SHIFT) | type, | ||
| std::memory_order_release); | ||
| } | ||
| return type; |
There was a problem hiding this comment.
When probe() returns 0 — i.e. getsockopt(SO_TYPE) fails with something other than ENOTSOCK (e.g. EBADF) - the result is intentionally not cached, so this fd gets a fresh getsockopt syscall on every intercepted I/O op rather than a one-time cold-path cost. For a valid in-use fd this shouldn't occur, but it's the only path where the classifier degrades to one syscall per call. Worth a brief comment noting the non-cacheable case is deliberate (transient errors must not be cached) so it isn't mistaken for an oversight.
| #include <sys/syscall.h> | ||
| #include <unistd.h> | ||
|
|
||
| namespace { |
There was a problem hiding this comment.
Probably not necessary. All the functions here are already static and internally linked. The rest of the codebase is not really using anonymous namespaces.
| int entry_errno = errno; | ||
| bool eligible = NativeSocketInterposer::instance()->isDatagramSocket(fd); | ||
| errno = entry_errno; | ||
| return runNativeIoHook<Ret>(eligible, NativeBlockKind::UDP_RECEIVE, fd, fn, | ||
| call); | ||
| } |
There was a problem hiding this comment.
I wonder - I am seeing this pattern of storing and restoring errno.
Can we have a RAII for that?
There was a problem hiding this comment.
Can you add more test coverage?
Tier 1
directly threatens the design's core invariant (no test exists, and the race is real, not theoretical):
-
Concurrent enter/exit/snapshot on the same ThreadFilter slot. enterBlockedRun, snapshotAndExitBlockedRun, and the signal-handler's markSampledThisRun are all lock-free on shared slot state, but every test drives them single-threaded. The whole feature is built on a signal handler racing the application thread — and that interleaving is never tested. This is the headline gap.
-
Snapshot consistency under concurrent mutation. snapshotBlockRun does several independent relaxed loads (sampled_this_run, owner, anchor_sample_id, suppressed_sample_count). No test mutates state mid-snapshot to confirm the result is self-consistent (e.g. anchored=false but non-zero suppressed count).
-
TaskBlockQueue concurrent push/pop. It's an MPSC-style sequence-cell queue (the hard-to-get-right kind), tested only single-threaded. No multi-producer race, no cell-reuse-staleness test.
### Tier 2 error/early-exit paths that are written but never executed by a test: -
NativeBlockScope constructor gates. Of its ~6 early-exit gates (!taskBlockAsyncActive, !filter.enabled, current==nullptr, non-Java thread, slot_id<0, enterBlockedRun==0), only the spanId!=0 skip is tested. The others are silent - a regression that flips one would pass CI.
-
Null orig* function pointer → ENOSYS. The interposer handles it; no test injects null.
-
getsockopt failure with errno ≠ ENOTSOCK - uncovered, and it's the one real perf cliff.
Tier 3
lifecycle, worth one test each:
- Profiler restart with changed args (wallprecheck=true then false; ASGCT↔JVMTI engine switch with precheck active) — the static-singleton counters and drain-thread lifecycle across restart are untested. Note the prior memory: restart-in-prod isn't a supported mode, which lowers this priority.
- JVMTI-unavailable / J9 fallback — precheck wiring on non-ASGCT engines is asserted by the agents to silently no-op; worth a test to confirm it degrades cleanly rather than leaving weight slots un-restored.
…concurrency coverage test
98cb129 to
c278669
Compare
What does this PR do?:
Adds
datadog.TaskBlockJFR events for blocking intervals such asLockSupport.park,Object.wait, monitor contention, including recording APIs used bydd-trace-javafor instrumented blocking operations such asThread.sleep.Motivation:
This builds on
paul.fournillon/wallclock-suppression(#560) by preserving visibility into blocked spans as explicit duration events.Additional Notes:
This PR does not add the dd-trace-java instrumentation itself;
Thread.sleepemission depends on that side calling the new profiler APIs. Monitor callback support is HotSpot-specific, and virtual-thread carrier attribution is avoided.How to test the change?:
./.claude/commands/build-and-summarize :ddprof-test:testDebug -Ptests="*.wallclock.*TaskBlockTest"./.claude/commands/build-and-summarize :ddprof-lib:gtestDebugFor Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.Unsure? Have a question? Request a review!