perf(profiling): bench branch (do not merge) by taegyunkim · Pull Request #18335 · DataDog/dd-trace-py

taegyunkim · 2026-05-29T13:36:23Z

Description

Synthesis branch for benchmarking the combined effect of the three PROF-14423 split PRs.

Not for merge. This branch exists so the DOE benchmark can measure end-to-end heap-live-size reduction before any of the constituent PRs land.

Combined contents (all rebased onto bb9eea91511d2b71df275a29755707324e552caa):

dd-trace-py PR #18294 (framestack-vector): FrameStack is std::vector<Frame> instead of std::deque<Frame>, with no upfront reserve. Includes the follow-up that drops front insertion in the asyncio task path.
dd-trace-py PR #18295 (unwind-frame-scratch): single seen_frames cycle-detection set lives on EchionSampler, reused across every unwind_frame call.
dd-trace-py PR #18296 (unwind-greenlets-buffer-reuse): current_greenlets becomes vector<StackInfo> with a cursor; snapshots / parents / visited scratch lives on EchionSampler; early-return leak fix; RSS regression test.

Testing

The constituent split PRs have their own test runs; this branch is for benchmarking only.

Risks

Bench-only — must not merge.

Additional Notes

Replaces the prior bundled bench branch unwind-greenlets-memory (#18292); that branch has been deleted upstream.

… allocs The native stack collector represented every captured stack as a std::deque<Frame>. Each construction triggered a chunk-map allocation (_M_initialize_map) and subsequent pushes allocated fixed-size chunks. In a DOE gevent benchmark this combination dominated native heap-live-size for the unwind_greenlets path. Switching FrameStack to std::vector<Frame> with reserve(max_frames) in the default ctor: - Eliminates the chunk-map allocation per construction. - Eliminates per-chunk allocations on push_back, since max_frames is the hard cap on stack depth. - Keeps the cache-eviction-safety property documented above the class: FrameStack owns Frame values, so Frame::get cache eviction is unaffected. Every existing call site uses only push_back, forward iteration, size(), clear(), and integer indexing. The single push_front in the asyncio task path becomes insert(begin(), ...). The inner loop runs at most ~max_frames times so each O(n) insert leaves overall cost O(max_frames^2), well within the prior deque cost for small max_frames. PROF-14423 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

unwind_frame previously constructed a fresh std::unordered_set<PyObject*> on every call to detect cycles in the frame chain. With one sampling iteration walking many stacks (Python thread stack, every greenlet, every coroutine in every asyncio task), this added up to a measurable allocator churn in the sampling hot path. EchionSampler now owns a single seen_frames scratch set. unwind_frame gains a 4-arg overload that takes the scratch by reference and clears it on entry; a 3-arg convenience wrapper keeps fuzz harnesses and other off-sampling-thread callers working unchanged. Only the single sampling thread touches the scratch, so no lock is required. PROF-14423 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… samples unwind_greenlets previously allocated per leaf greenlet on every sample iteration: a new StackInfo (via make_unique), a fresh snapshots vector with per-greenlet parent_chain vectors, and per-call hash sets for parent tracking and cycle detection. In a DOE gevent benchmark this contributed ~half of the native heap-live-size of the entire process. Changes: - EchionSampler gains three sampling-thread-only scratch buffers (greenlet_snapshots, greenlet_parents, greenlet_visited). Only the single sampling thread touches these so no lock is needed. - ThreadInfo::current_greenlets becomes std::vector<StackInfo> (by value, not unique_ptr) with a greenlet_count_ cursor. Entries are kept alive between samples so the inner FrameStack capacity amortizes. unwind_greenlets uses a parallel snap_count cursor on the snapshots scratch so parent_chain vectors retain capacity too. The on-CPU swap is rewritten against current_greenlets[i] values. - ThreadInfo::sample iterates current_greenlets[0..greenlet_count_) and resets greenlet_count_ on the early-return error path. Previously a string-table lookup miss left entries un-cleared until the next sample overwrote them; with the cursor pattern that would have leaked indefinitely on persistent lookup failure. Adds a new test test_gevent_unwind_greenlets_rss_stable that runs 1000 idle greenlets with 30-frame stacks under aggressive (5 ms) sampling and asserts post-warmup RSS grows by < 20 MB. Before the fix this exercise grew RSS by hundreds of MB driven by per-sample StackInfo allocations. PROF-14423 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

reserve(max_frames) committed max_frames * sizeof(Frame) = ~96 KB per FrameStack on construction (max_frames=2048, sizeof(Frame)=48 B). In the current vector<unique_ptr<StackInfo>> ownership model, current_greenlets and current_tasks are cleared every sample, so every iteration would create fresh FrameStacks and pay that reservation cost N times. That is much worse than deque's ~3.5 KB typical per-stack footprint. Defaulting to no reserve lets the vector start at zero capacity and grow geometrically as push_back fills it. Total bytes per stack ends up close to deque's footprint (~3 KB for a 50-frame stack), still in a single contiguous buffer. The reserve only becomes worthwhile when StackInfo entries are reused across samples so the reservation is amortized; that lives in the subsequent buffer-reuse refactor. PROF-14423 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cit-pr-commenter-54b7da · 2026-05-29T13:36:50Z

Codeowners resolved as

src/native/Cargo.lock                                                   @DataDog/apm-core-python
src/native/Cargo.toml                                                   @DataDog/apm-core-python
src/native/py_string.rs                                                 @DataDog/apm-core-python

datadog-prod-us1-6 · 2026-05-29T13:37:33Z

Tests

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 8 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-py | build linux serverless: [amd64, cp315-cp315, v113741238-d2b8243-manylinux2014_x86_64, 1]

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration.
NotImplementedError: This version of CPython is not supported yet during import of ddtrace.

DataDog/apm-reliability/dd-trace-py | build linux serverless: [amd64, cp315-cp315, v113741491-d2b8243-musllinux_1_2_x86_64, 1]

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration.
NotImplementedError: This version of CPython is not supported yet due to incompatible Python version 3.15.0b1.

DataDog/apm-reliability/dd-trace-py | build linux serverless: [arm64, cp315-cp315, v113741357-d2b8243-manylinux2014_aarch64, 1]

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration.
NotImplementedError: This version of CPython is not supported yet during the import of ddtrace.

View all 8 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: fba16a0 | Docs | Datadog PR Page | Give us feedback!}

pr-commenter · 2026-05-29T14:03:52Z

Benchmarks

Benchmark execution time: 2026-05-29 14:03:49

Comparing candidate commit c218376 in PR branch taegyunkim/prof-14423-bench with baseline commit 2195dfd in branch main.

Found 0 performance improvements and 6 performance regressions! Performance is the same for 611 metrics, 10 unstable metrics.

scenario:httppropagationinject-ids_only

🟥 execution_time [+1.576µs; +1.726µs] or [+7.815%; +8.559%]

scenario:iast_aspects-re_match_noaspect

🟥 execution_time [+34.431µs; +40.356µs] or [+10.594%; +12.417%]

scenario:iastaspects-index_aspect

🟥 execution_time [+9.845µs; +15.224µs] or [+8.028%; +12.415%]

scenario:iastaspects-ljust_aspect

🟥 execution_time [+58.150µs; +68.411µs] or [+9.612%; +11.308%]

scenario:iastaspects-title_aspect

🟥 execution_time [+32.190µs; +41.306µs] or [+9.676%; +12.416%]

scenario:iastaspectsospath-ospathbasename_aspect

🟥 execution_time [+109.711µs; +116.194µs] or [+25.964%; +27.499%]

…r-reuse

taegyunkim and others added 5 commits May 27, 2026 17:18

perf(profiling): avoid framestack front insertion

8e557a0

taegyunkim changed the title ~~perf(profiling): PROF-14423 bench branch (do not merge)~~ perf(profiling): bench branch (do not merge) May 29, 2026

taegyunkim added 6 commits May 29, 2026 14:29

Merge branch 'main' into taegyunkim/prof-14423-framestack-vector

5b4efa6

Merge branch 'main' into taegyunkim/prof-14423-framestack-vector

4118698

Merge branch 'main' into taegyunkim/prof-14423-unwind-frame-scratch

d0960f1

Merge branch 'main' into taegyunkim/prof-14423-unwind-greenlets-buffe…

90bcbf6

…r-reuse

Merge taegyunkim/prof-14423-unwind-frame-scratch into bench

5a21206

Merge taegyunkim/prof-14423-unwind-greenlets-buffer-reuse into bench

fba16a0

taegyunkim force-pushed the taegyunkim/prof-14423-bench branch from c218376 to fba16a0 Compare June 1, 2026 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(profiling): bench branch (do not merge)#18335

perf(profiling): bench branch (do not merge)#18335
taegyunkim wants to merge 11 commits into
mainfrom
taegyunkim/prof-14423-bench

taegyunkim commented May 29, 2026 •

edited

Loading

Uh oh!

cit-pr-commenter-54b7da Bot commented May 29, 2026

Uh oh!

datadog-prod-us1-6 Bot commented May 29, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading

Uh oh!

pr-commenter Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taegyunkim commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Risks

Additional Notes

Uh oh!

cit-pr-commenter-54b7da Bot commented May 29, 2026

Codeowners resolved as

Uh oh!

datadog-prod-us1-6 Bot commented May 29, 2026 • edited by datadog-datadog-prod-us1 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

ℹ️ Info

Uh oh!

pr-commenter Bot commented May 29, 2026

Benchmarks

scenario:httppropagationinject-ids_only

scenario:iast_aspects-re_match_noaspect

scenario:iastaspects-index_aspect

scenario:iastaspects-ljust_aspect

scenario:iastaspects-title_aspect

scenario:iastaspectsospath-ospathbasename_aspect

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taegyunkim commented May 29, 2026 •

edited

Loading

datadog-prod-us1-6 Bot commented May 29, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading