Skip to content

perf(profiling): bench branch (do not merge)#18335

Draft
taegyunkim wants to merge 11 commits into
mainfrom
taegyunkim/prof-14423-bench
Draft

perf(profiling): bench branch (do not merge)#18335
taegyunkim wants to merge 11 commits into
mainfrom
taegyunkim/prof-14423-bench

Conversation

@taegyunkim

@taegyunkim taegyunkim commented May 29, 2026

Copy link
Copy Markdown
Contributor

Description

Synthesis branch for benchmarking the combined effect of the three PROF-14423 split PRs.

Not for merge. This branch exists so the DOE benchmark can measure end-to-end heap-live-size reduction before any of the constituent PRs land.

Combined contents (all rebased onto bb9eea91511d2b71df275a29755707324e552caa):

  • dd-trace-py PR #18294 (framestack-vector): FrameStack is std::vector<Frame> instead of std::deque<Frame>, with no upfront reserve. Includes the follow-up that drops front insertion in the asyncio task path.
  • dd-trace-py PR #18295 (unwind-frame-scratch): single seen_frames cycle-detection set lives on EchionSampler, reused across every unwind_frame call.
  • dd-trace-py PR #18296 (unwind-greenlets-buffer-reuse): current_greenlets becomes vector<StackInfo> with a cursor; snapshots / parents / visited scratch lives on EchionSampler; early-return leak fix; RSS regression test.

Testing

The constituent split PRs have their own test runs; this branch is for benchmarking only.

Risks

Bench-only — must not merge.

Additional Notes

Replaces the prior bundled bench branch unwind-greenlets-memory (#18292); that branch has been deleted upstream.

taegyunkim and others added 5 commits May 27, 2026 17:18
… allocs

The native stack collector represented every captured stack as a
std::deque<Frame>. Each construction triggered a chunk-map allocation
(_M_initialize_map) and subsequent pushes allocated fixed-size chunks. In a
DOE gevent benchmark this combination dominated native heap-live-size for
the unwind_greenlets path.

Switching FrameStack to std::vector<Frame> with reserve(max_frames) in the
default ctor:

- Eliminates the chunk-map allocation per construction.
- Eliminates per-chunk allocations on push_back, since max_frames is the
  hard cap on stack depth.
- Keeps the cache-eviction-safety property documented above the class:
  FrameStack owns Frame values, so Frame::get cache eviction is unaffected.

Every existing call site uses only push_back, forward iteration, size(),
clear(), and integer indexing. The single push_front in the asyncio task
path becomes insert(begin(), ...). The inner loop runs at most ~max_frames
times so each O(n) insert leaves overall cost O(max_frames^2), well within
the prior deque cost for small max_frames.

PROF-14423

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
unwind_frame previously constructed a fresh std::unordered_set<PyObject*> on
every call to detect cycles in the frame chain. With one sampling iteration
walking many stacks (Python thread stack, every greenlet, every coroutine in
every asyncio task), this added up to a measurable allocator churn in the
sampling hot path.

EchionSampler now owns a single seen_frames scratch set. unwind_frame gains a
4-arg overload that takes the scratch by reference and clears it on entry; a
3-arg convenience wrapper keeps fuzz harnesses and other off-sampling-thread
callers working unchanged. Only the single sampling thread touches the
scratch, so no lock is required.

PROF-14423

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… samples

unwind_greenlets previously allocated per leaf greenlet on every sample
iteration: a new StackInfo (via make_unique), a fresh snapshots vector with
per-greenlet parent_chain vectors, and per-call hash sets for parent
tracking and cycle detection. In a DOE gevent benchmark this contributed
~half of the native heap-live-size of the entire process.

Changes:

- EchionSampler gains three sampling-thread-only scratch buffers
  (greenlet_snapshots, greenlet_parents, greenlet_visited). Only the single
  sampling thread touches these so no lock is needed.

- ThreadInfo::current_greenlets becomes std::vector<StackInfo> (by value,
  not unique_ptr) with a greenlet_count_ cursor. Entries are kept alive
  between samples so the inner FrameStack capacity amortizes.
  unwind_greenlets uses a parallel snap_count cursor on the snapshots
  scratch so parent_chain vectors retain capacity too. The on-CPU swap is
  rewritten against current_greenlets[i] values.

- ThreadInfo::sample iterates current_greenlets[0..greenlet_count_) and
  resets greenlet_count_ on the early-return error path. Previously a
  string-table lookup miss left entries un-cleared until the next sample
  overwrote them; with the cursor pattern that would have leaked
  indefinitely on persistent lookup failure.

Adds a new test test_gevent_unwind_greenlets_rss_stable that runs 1000 idle
greenlets with 30-frame stacks under aggressive (5 ms) sampling and asserts
post-warmup RSS grows by < 20 MB. Before the fix this exercise grew RSS by
hundreds of MB driven by per-sample StackInfo allocations.

PROF-14423

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reserve(max_frames) committed max_frames * sizeof(Frame) = ~96 KB per
FrameStack on construction (max_frames=2048, sizeof(Frame)=48 B). In the
current vector<unique_ptr<StackInfo>> ownership model, current_greenlets and
current_tasks are cleared every sample, so every iteration would create
fresh FrameStacks and pay that reservation cost N times. That is much worse
than deque's ~3.5 KB typical per-stack footprint.

Defaulting to no reserve lets the vector start at zero capacity and grow
geometrically as push_back fills it. Total bytes per stack ends up close to
deque's footprint (~3 KB for a 50-frame stack), still in a single
contiguous buffer.

The reserve only becomes worthwhile when StackInfo entries are reused
across samples so the reservation is amortized; that lives in the
subsequent buffer-reuse refactor.

PROF-14423

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cit-pr-commenter-54b7da

Copy link
Copy Markdown

Codeowners resolved as

src/native/Cargo.lock                                                   @DataDog/apm-core-python
src/native/Cargo.toml                                                   @DataDog/apm-core-python
src/native/py_string.rs                                                 @DataDog/apm-core-python

@datadog-prod-us1-6

datadog-prod-us1-6 Bot commented May 29, 2026

Copy link
Copy Markdown

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 8 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-py | build linux serverless: [amd64, cp315-cp315, v113741238-d2b8243-manylinux2014_x86_64, 1]   View in Datadog   GitLab

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. NotImplementedError: This version of CPython is not supported yet during import of ddtrace.

DataDog/apm-reliability/dd-trace-py | build linux serverless: [amd64, cp315-cp315, v113741491-d2b8243-musllinux_1_2_x86_64, 1]   View in Datadog   GitLab

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. NotImplementedError: This version of CPython is not supported yet due to incompatible Python version 3.15.0b1.

DataDog/apm-reliability/dd-trace-py | build linux serverless: [arm64, cp315-cp315, v113741357-d2b8243-manylinux2014_aarch64, 1]   View in Datadog   GitLab

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. NotImplementedError: This version of CPython is not supported yet during the import of ddtrace.

View all 8 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: fba16a0 | Docs | Datadog PR Page | Give us feedback!

@taegyunkim taegyunkim changed the title perf(profiling): PROF-14423 bench branch (do not merge) perf(profiling): bench branch (do not merge) May 29, 2026
@pr-commenter

pr-commenter Bot commented May 29, 2026

Copy link
Copy Markdown

Benchmarks

Benchmark execution time: 2026-05-29 14:03:49

Comparing candidate commit c218376 in PR branch taegyunkim/prof-14423-bench with baseline commit 2195dfd in branch main.

Found 0 performance improvements and 6 performance regressions! Performance is the same for 611 metrics, 10 unstable metrics.

scenario:httppropagationinject-ids_only

  • 🟥 execution_time [+1.576µs; +1.726µs] or [+7.815%; +8.559%]

scenario:iast_aspects-re_match_noaspect

  • 🟥 execution_time [+34.431µs; +40.356µs] or [+10.594%; +12.417%]

scenario:iastaspects-index_aspect

  • 🟥 execution_time [+9.845µs; +15.224µs] or [+8.028%; +12.415%]

scenario:iastaspects-ljust_aspect

  • 🟥 execution_time [+58.150µs; +68.411µs] or [+9.612%; +11.308%]

scenario:iastaspects-title_aspect

  • 🟥 execution_time [+32.190µs; +41.306µs] or [+9.676%; +12.416%]

scenario:iastaspectsospath-ospathbasename_aspect

  • 🟥 execution_time [+109.711µs; +116.194µs] or [+25.964%; +27.499%]

@taegyunkim taegyunkim force-pushed the taegyunkim/prof-14423-bench branch from c218376 to fba16a0 Compare June 1, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant