perf(profiling): bench branch (do not merge)#18335
Conversation
… allocs The native stack collector represented every captured stack as a std::deque<Frame>. Each construction triggered a chunk-map allocation (_M_initialize_map) and subsequent pushes allocated fixed-size chunks. In a DOE gevent benchmark this combination dominated native heap-live-size for the unwind_greenlets path. Switching FrameStack to std::vector<Frame> with reserve(max_frames) in the default ctor: - Eliminates the chunk-map allocation per construction. - Eliminates per-chunk allocations on push_back, since max_frames is the hard cap on stack depth. - Keeps the cache-eviction-safety property documented above the class: FrameStack owns Frame values, so Frame::get cache eviction is unaffected. Every existing call site uses only push_back, forward iteration, size(), clear(), and integer indexing. The single push_front in the asyncio task path becomes insert(begin(), ...). The inner loop runs at most ~max_frames times so each O(n) insert leaves overall cost O(max_frames^2), well within the prior deque cost for small max_frames. PROF-14423 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
unwind_frame previously constructed a fresh std::unordered_set<PyObject*> on every call to detect cycles in the frame chain. With one sampling iteration walking many stacks (Python thread stack, every greenlet, every coroutine in every asyncio task), this added up to a measurable allocator churn in the sampling hot path. EchionSampler now owns a single seen_frames scratch set. unwind_frame gains a 4-arg overload that takes the scratch by reference and clears it on entry; a 3-arg convenience wrapper keeps fuzz harnesses and other off-sampling-thread callers working unchanged. Only the single sampling thread touches the scratch, so no lock is required. PROF-14423 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… samples unwind_greenlets previously allocated per leaf greenlet on every sample iteration: a new StackInfo (via make_unique), a fresh snapshots vector with per-greenlet parent_chain vectors, and per-call hash sets for parent tracking and cycle detection. In a DOE gevent benchmark this contributed ~half of the native heap-live-size of the entire process. Changes: - EchionSampler gains three sampling-thread-only scratch buffers (greenlet_snapshots, greenlet_parents, greenlet_visited). Only the single sampling thread touches these so no lock is needed. - ThreadInfo::current_greenlets becomes std::vector<StackInfo> (by value, not unique_ptr) with a greenlet_count_ cursor. Entries are kept alive between samples so the inner FrameStack capacity amortizes. unwind_greenlets uses a parallel snap_count cursor on the snapshots scratch so parent_chain vectors retain capacity too. The on-CPU swap is rewritten against current_greenlets[i] values. - ThreadInfo::sample iterates current_greenlets[0..greenlet_count_) and resets greenlet_count_ on the early-return error path. Previously a string-table lookup miss left entries un-cleared until the next sample overwrote them; with the cursor pattern that would have leaked indefinitely on persistent lookup failure. Adds a new test test_gevent_unwind_greenlets_rss_stable that runs 1000 idle greenlets with 30-frame stacks under aggressive (5 ms) sampling and asserts post-warmup RSS grows by < 20 MB. Before the fix this exercise grew RSS by hundreds of MB driven by per-sample StackInfo allocations. PROF-14423 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reserve(max_frames) committed max_frames * sizeof(Frame) = ~96 KB per FrameStack on construction (max_frames=2048, sizeof(Frame)=48 B). In the current vector<unique_ptr<StackInfo>> ownership model, current_greenlets and current_tasks are cleared every sample, so every iteration would create fresh FrameStacks and pay that reservation cost N times. That is much worse than deque's ~3.5 KB typical per-stack footprint. Defaulting to no reserve lets the vector start at zero capacity and grow geometrically as push_back fills it. Total bytes per stack ends up close to deque's footprint (~3 KB for a 50-frame stack), still in a single contiguous buffer. The reserve only becomes worthwhile when StackInfo entries are reused across samples so the reservation is amortized; that lives in the subsequent buffer-reuse refactor. PROF-14423 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codeowners resolved as |
|
BenchmarksBenchmark execution time: 2026-05-29 14:03:49 Comparing candidate commit c218376 in PR branch Found 0 performance improvements and 6 performance regressions! Performance is the same for 611 metrics, 10 unstable metrics. scenario:httppropagationinject-ids_only
scenario:iast_aspects-re_match_noaspect
scenario:iastaspects-index_aspect
scenario:iastaspects-ljust_aspect
scenario:iastaspects-title_aspect
scenario:iastaspectsospath-ospathbasename_aspect
|
c218376 to
fba16a0
Compare
Description
Synthesis branch for benchmarking the combined effect of the three PROF-14423 split PRs.
Not for merge. This branch exists so the DOE benchmark can measure end-to-end heap-live-size reduction before any of the constituent PRs land.
Combined contents (all rebased onto
bb9eea91511d2b71df275a29755707324e552caa):framestack-vector):FrameStackisstd::vector<Frame>instead ofstd::deque<Frame>, with no upfrontreserve. Includes the follow-up that drops front insertion in the asyncio task path.unwind-frame-scratch): singleseen_framescycle-detection set lives onEchionSampler, reused across everyunwind_framecall.unwind-greenlets-buffer-reuse):current_greenletsbecomesvector<StackInfo>with a cursor; snapshots / parents / visited scratch lives onEchionSampler; early-return leak fix; RSS regression test.Testing
The constituent split PRs have their own test runs; this branch is for benchmarking only.
Risks
Bench-only — must not merge.
Additional Notes
Replaces the prior bundled bench branch
unwind-greenlets-memory(#18292); that branch has been deleted upstream.