Skip to content

feat(profiling): add off-cpu cause label to stack profiler samples#18668

Draft
vlad-scherbich wants to merge 4 commits into
vlad/profiling-offcpu-approximationfrom
vlad/profiling-offcpu-cause-label
Draft

feat(profiling): add off-cpu cause label to stack profiler samples#18668
vlad-scherbich wants to merge 4 commits into
vlad/profiling-offcpu-approximationfrom
vlad/profiling-offcpu-cause-label

Conversation

@vlad-scherbich

@vlad-scherbich vlad-scherbich commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

<< Prev PR

Summary

Follow-up to #18623 (off-CPU time approximation). Adds an off cpu cause pprof label to each stack sample that has non-zero off-CPU time, classifying the likely reason a thread was blocked.

  • sleep — explicit sleep (time.sleep, asyncio.sleep)
  • lock — lock/semaphore/event wait (threading.Lock.acquire, asyncio.Lock.acquire, threading.Event.wait, etc.)
  • io — blocking socket I/O (socket.recv, socket.send)
  • other — off-CPU observed but cause not identifiable from the Python frame (OS preemption, C-extension blocking)

How it works

The leaf frame (first frame rendered by echion, which is the frame where the thread is currently blocked) is captured in ThreadState.top_frame_name during render_frame() / render_native_frame(). At render_stack_end(), the name is matched against keyword patterns to produce the cause string, then emitted as a pprof label alongside the off-CPU value.

The label is only emitted when:

  1. Off-CPU collection is enabled (_DD_PROFILING_STACK_OFFCPU_TIME_ENABLED=true)
  2. off_cpu_ns > 0 for the sample

Limitations (documented in release note)

OS-level causes (involuntary preemption, page faults, futex contention) are indistinguishable from Python frame inspection alone — those fall through to "other". True kernel-level attribution requires eBPF (sched_switch). The label is a Python-semantic approximation intended for backend aggregation and UI filtering.

Testing

$ .riot/venv_py31211_deps/bin/python scripts/demo_offcpu_approximation.py --duration 5
Profiling 5.0s  •  8 threads: sleeper / lock-waiter / event-waiter / queue-waiter / io-waiter / spinner / cpu-fibonacci / cpu-hash

─── Time summary ───────────────────────────────────────────────────────────────
Thread                     wall (ms)      cpu (ms)    off-cpu (ms)   off-cpu %
────────────────────────────────────────────────────────────────────────────────
MainThread                    5030.6           0.3          5030.4      100.0%
cpu-fibonacci-thread          5003.5        1618.4          3899.4       77.9%
cpu-hash-thread               5004.4        1655.6          3843.5       76.8%
event-waiter-thread           5000.8           2.0          4998.8      100.0%
io-waiter-thread              5019.7           2.6          5017.1       99.9%
lock-waiter-thread            5030.6           0.0          5030.6      100.0%
queue-waiter-thread           5011.5           2.4          5009.1      100.0%
sleeper-thread                5011.5           1.3          5010.1      100.0%
spinner-thread                5011.5        1739.6          3805.9       75.9%

─── Off-CPU cause breakdown ────────────────────────────────────────────────────
Thread                           sleep            lock              io           other
────────────────────────────────────────────────────────────────────────────────
MainThread                     5005.8ms           24.6ms            0.0ms            0.0ms
cpu-fibonacci-thread              0.0ms            0.0ms            0.0ms         3899.4ms
cpu-hash-thread                   0.0ms            0.0ms            0.0ms         3843.5ms
event-waiter-thread               0.0ms         4998.8ms            0.0ms            0.0ms
io-waiter-thread                  0.0ms            0.0ms         4650.7ms          366.3ms
lock-waiter-thread                0.0ms         5030.6ms            0.0ms            0.0ms
queue-waiter-thread               0.0ms         5003.8ms            0.0ms            5.2ms
sleeper-thread                 5010.1ms            0.0ms            0.0ms            0.0ms
spinner-thread                    0.0ms            0.0ms            0.0ms         3805.9ms

─── Top off-CPU stacks  (heaviest single sample per thread) ────────────────────

  [MainThread]  cause=sleep  5005.8 ms
      ├─ sleep                                    time
      ├─ main                                     demo_offcpu_approximation.py
      └─ <module>                                 demo_offcpu_approximation.py

  [cpu-fibonacci-thread]  cause=other  2930.9 ms
      ├─ monotonic                                time
      ├─ cpu_fibonacci                            demo_offcpu_approximation.py
      ├─ Thread.run                               threading.py
      ├─ Thread._bootstrap_inner                  threading.py
      ├─ init_stack.<locals>.thread_bootstrap_inner threading.py
      └─ Thread._bootstrap                        threading.py

  [cpu-hash-thread]  cause=other  2875.5 ms
      ├─ openssl_sha256                           _hashlib
      ├─ cpu_hash                                 demo_offcpu_approximation.py
      ├─ Thread.run                               threading.py
      ├─ Thread._bootstrap_inner                  threading.py
      ├─ init_stack.<locals>.thread_bootstrap_inner threading.py
      └─ Thread._bootstrap                        threading.py

  [event-waiter-thread]  cause=lock  4988.8 ms
      ├─ lock.acquire                             
      ├─ Condition.wait                           threading.py
      ├─ Event.wait                               threading.py
      ├─ event_waiter                             demo_offcpu_approximation.py
      ├─ Thread.run                               threading.py
      └─ Thread._bootstrap_inner                  threading.py

  [io-waiter-thread]  cause=io  4650.7 ms
      ├─ socket.recv                              
      ├─ io_waiter                                demo_offcpu_approximation.py
      ├─ Thread.run                               threading.py
      ├─ Thread._bootstrap_inner                  threading.py
      ├─ init_stack.<locals>.thread_bootstrap_inner threading.py
      └─ Thread._bootstrap                        threading.py

  [lock-waiter-thread]  cause=lock  5030.6 ms
      ├─ lock.acquire                             
      ├─ lock_waiter                              demo_offcpu_approximation.py
      ├─ Thread.run                               threading.py
      ├─ Thread._bootstrap_inner                  threading.py
      ├─ init_stack.<locals>.thread_bootstrap_inner threading.py
      └─ Thread._bootstrap                        threading.py

  [queue-waiter-thread]  cause=lock  4993.3 ms
      ├─ lock.acquire                             
      ├─ Condition.wait                           threading.py
      ├─ Queue.get                                queue.py
      ├─ queue_waiter                             demo_offcpu_approximation.py
      ├─ Thread.run                               threading.py
      └─ Thread._bootstrap_inner                  threading.py

  [sleeper-thread]  cause=sleep  5010.1 ms
      ├─ sleep                                    time
      ├─ sleeper                                  demo_offcpu_approximation.py
      ├─ Thread.run                               threading.py
      ├─ Thread._bootstrap_inner                  threading.py
      ├─ init_stack.<locals>.thread_bootstrap_inner threading.py
      └─ Thread._bootstrap                        threading.py

  [spinner-thread]  cause=other  3011.7 ms
      ├─ monotonic                                time
      ├─ spinner                                  demo_offcpu_approximation.py
      ├─ Thread.run                               threading.py
      ├─ Thread._bootstrap_inner                  threading.py
      ├─ init_stack.<locals>.thread_bootstrap_inner threading.py
      └─ Thread._bootstrap                        threading.py

@cit-pr-commenter-54b7da

cit-pr-commenter-54b7da Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codeowners resolved as

ddtrace/internal/datadog/profiling/stack/src/stack_renderer.cpp         @DataDog/profiling-python
scripts/demo_offcpu_approximation.py                                    @DataDog/apm-core-python

@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 21 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-py | build linux serverless: [amd64, cp315-cp315, v113741238-d2b8243-manylinux2014_x86_64, 1]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-py | build linux serverless: [amd64, cp315-cp315, v113741491-d2b8243-musllinux_1_2_x86_64, 1]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-py | build linux serverless: [arm64, cp315-cp315, v113741357-d2b8243-manylinux2014_aarch64, 1]   View in Datadog   GitLab

View all 21 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: c92cf40 | Docs | Datadog PR Page | Give us feedback!

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an off cpu cause pprof label to stack-profiler samples that have non-zero off-CPU time, classifying the likely blocking reason (sleep/lock/io/other). This extends the existing off-CPU approximation work by attaching a low-cardinality, UI-friendly categorization to each relevant sample.

Changes:

  • Capture the leaf (top) frame name during stack rendering and classify it into an off-CPU cause at stack end.
  • Emit the new off cpu cause label alongside off-CPU time samples when off-CPU collection is enabled and off_cpu_ns > 0.
  • Add tests for the new label (sleep/lock) and document the feature via a release note.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ddtrace/internal/datadog/profiling/stack/src/stack_renderer.cpp Tracks leaf frame name and emits off cpu cause label for off-CPU samples.
ddtrace/internal/datadog/profiling/stack/include/stack_renderer.hpp Extends ThreadState to store the leaf frame name for classification.
ddtrace/internal/datadog/profiling/dd_wrapper/include/libdatadog_helpers.hpp Adds off_cpu_cause to the exported label key set.
tests/profiling/collector/test_stack.py Updates off-CPU type note and adds tests asserting the new label for sleep/lock scenarios.
releasenotes/notes/profiling-offcpu-cause-label-a3b2c1d4e5f6a7b8.yaml Documents the new off cpu cause label and its semantics/limitations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/profiling/collector/test_stack.py
Comment thread ddtrace/internal/datadog/profiling/stack/src/stack_renderer.cpp

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread scripts/demo_offcpu_approximation.py Outdated
Comment thread ddtrace/internal/datadog/profiling/stack/src/stack_renderer.cpp
…rocess isolation

- Add test_off_cpu_cause_io: verifies socket recv is tagged cause='io'
- Fix test_off_cpu_cause_lock flakiness: filter to lock-thread samples before
  asserting cause, skip if none collected (CPU time may be unavailable)
- Convert all cause tests to @pytest.mark.subprocess to avoid ddup init
  ordering issues (ProfilerState::start uses std::call_once)
@vlad-scherbich vlad-scherbich force-pushed the vlad/profiling-offcpu-cause-label branch from 02c0858 to fecc4b0 Compare June 18, 2026 17:32
…v var name

- render_frame: leaf-frame name was looked up twice in the Echion string table
  (once for top_frame_name, once for name_str). Cache the result in
  leaf_name_str and reuse it to avoid the duplicate lookup on the hot path.
- demo script: correct env var from DD_PROFILING_STACK_V2_OFFCPU_TIME_ENABLED
  to _DD_PROFILING_STACK_OFFCPU_TIME_ENABLED
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants