feat(profiling): add off-cpu cause label to stack profiler samples#18668
feat(profiling): add off-cpu cause label to stack profiler samples#18668vlad-scherbich wants to merge 4 commits into
Conversation
Codeowners resolved as |
|
There was a problem hiding this comment.
Pull request overview
Adds an off cpu cause pprof label to stack-profiler samples that have non-zero off-CPU time, classifying the likely blocking reason (sleep/lock/io/other). This extends the existing off-CPU approximation work by attaching a low-cardinality, UI-friendly categorization to each relevant sample.
Changes:
- Capture the leaf (top) frame name during stack rendering and classify it into an off-CPU cause at stack end.
- Emit the new
off cpu causelabel alongside off-CPU time samples when off-CPU collection is enabled andoff_cpu_ns > 0. - Add tests for the new label (sleep/lock) and document the feature via a release note.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
ddtrace/internal/datadog/profiling/stack/src/stack_renderer.cpp |
Tracks leaf frame name and emits off cpu cause label for off-CPU samples. |
ddtrace/internal/datadog/profiling/stack/include/stack_renderer.hpp |
Extends ThreadState to store the leaf frame name for classification. |
ddtrace/internal/datadog/profiling/dd_wrapper/include/libdatadog_helpers.hpp |
Adds off_cpu_cause to the exported label key set. |
tests/profiling/collector/test_stack.py |
Updates off-CPU type note and adds tests asserting the new label for sleep/lock scenarios. |
releasenotes/notes/profiling-offcpu-cause-label-a3b2c1d4e5f6a7b8.yaml |
Documents the new off cpu cause label and its semantics/limitations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rocess isolation - Add test_off_cpu_cause_io: verifies socket recv is tagged cause='io' - Fix test_off_cpu_cause_lock flakiness: filter to lock-thread samples before asserting cause, skip if none collected (CPU time may be unavailable) - Convert all cause tests to @pytest.mark.subprocess to avoid ddup init ordering issues (ProfilerState::start uses std::call_once)
02c0858 to
fecc4b0
Compare
…v var name - render_frame: leaf-frame name was looked up twice in the Echion string table (once for top_frame_name, once for name_str). Cache the result in leaf_name_str and reuse it to avoid the duplicate lookup on the hot path. - demo script: correct env var from DD_PROFILING_STACK_V2_OFFCPU_TIME_ENABLED to _DD_PROFILING_STACK_OFFCPU_TIME_ENABLED
<< Prev PR
Summary
Follow-up to #18623 (off-CPU time approximation). Adds an
off cpu causepprof label to each stack sample that has non-zero off-CPU time, classifying the likely reason a thread was blocked.sleep— explicit sleep (time.sleep,asyncio.sleep)lock— lock/semaphore/event wait (threading.Lock.acquire,asyncio.Lock.acquire,threading.Event.wait, etc.)io— blocking socket I/O (socket.recv,socket.send)other— off-CPU observed but cause not identifiable from the Python frame (OS preemption, C-extension blocking)How it works
The leaf frame (first frame rendered by echion, which is the frame where the thread is currently blocked) is captured in
ThreadState.top_frame_nameduringrender_frame()/render_native_frame(). Atrender_stack_end(), the name is matched against keyword patterns to produce the cause string, then emitted as a pprof label alongside the off-CPU value.The label is only emitted when:
_DD_PROFILING_STACK_OFFCPU_TIME_ENABLED=true)off_cpu_ns > 0for the sampleLimitations (documented in release note)
OS-level causes (involuntary preemption, page faults, futex contention) are indistinguishable from Python frame inspection alone — those fall through to
"other". True kernel-level attribution requires eBPF (sched_switch). The label is a Python-semantic approximation intended for backend aggregation and UI filtering.Testing
demo_offcpu_approximation.pydemo script