Skip to content

fix(wall): break SIGPROF self-chain in signal handler install#319

Open
szegedi wants to merge 2 commits intomainfrom
fix/signal-handler-self-chain
Open

fix(wall): break SIGPROF self-chain in signal handler install#319
szegedi wants to merge 2 commits intomainfrom
fix/signal-handler-self-chain

Conversation

@szegedi
Copy link
Copy Markdown

@szegedi szegedi commented May 4, 2026

Summary

Fix infinite recursion in SignalHandler::HandleProfilerSignal observed under dd-trace-js + Node.js 20 on x86_64 Linux as ~430 stack frames of HandleProfilerSignal calling itself, with each frame paused right after the indirect callq *%r14 (i.e. the call to old_handler_func_).

Root cause

SignalHandler::Install() else-branch captures whatever was the previous SIGPROF disposition into old_handler_func_:

installed_ = (sigaction(SIGPROF, &sa, &old_handler_) == 0);
old_handler_func_.store(old_handler_.sa_sigaction, std::memory_order_relaxed);

If, at that moment, SIGPROF was already pointing at &HandleProfilerSignal while installed_ == false, we end up storing our own address into old_handler_func_, and the next signal delivery loops forever via:

auto old_handler = old_handler_func_.load(...);
if (!old_handler) return;
...
old_handler(sig, info, context);  // recurses into HandleProfilerSignal

The state (installed_ == false, SIGPROF == &HandleProfilerSignal) requires a non-pprof party to write &HandleProfilerSignal back into SIGPROF after our Restore() cleared installed_. Static analysis of single-instance pprof + v8 SIGPROF interaction (StartInternal always runs cpuProfiler_->Start — which is what may trigger V8's SignalHandler::IncreaseSamplerCount → V8.Install — before pprof.IncreaseUseCount) shows that V8's old_signal_handler_ can never end up holding &HandleProfilerSignal, because V8.Install only runs when V8's process-wide client_count_ transitions 0→1, and at that point pprof has not yet installed in this lifecycle. So V8.Restore can never write us back into SIGPROF.

The realistic trigger is therefore two copies of @datadog/pprof loaded in the same process (the dlopen-key for native addons is the file path, so a bundled copy under node_modules/dd-trace/node_modules/@datadog/pprof plus a top-level dep at node_modules/@datadog/pprof are two distinct instances with independent static state). With that, the sequence

A.Install      ; SIGPROF=A, installed_A=true,  pp_old_A=DEFAULT
B.Install      ; SIGPROF=B, installed_B=true,  pp_old_B=A
A.Restore      ; SIGPROF=DEFAULT, installed_A=false
B.Restore      ; SIGPROF=A,       installed_B=false   ← B writes A back
A.Install (2)  ; captures A as its own old_handler        ← self-loop

leaves A's old_handler_func_ pointing at A's own HandleProfilerSignal. A non-pprof third-party SIGPROF user that captured pprof while pprof was active and later restores it produces the same shape.

Fix

Refuse to chain to ourselves in the install path. If the captured previous disposition is &HandleProfilerSignal, store nullptr in old_handler_func_ so the handler's existing if (!old_handler) return; fast-path kicks in. The signal is dropped instead of recursing — same observable behavior the handler already has when SIGPROF was previously unhandled.

This is a strict improvement: in the two-instance case, the other loaded copy still chains to us once via its own old_handler_func_, but a single hop is harmless; only the self-loop runs away.

Test plan

  • Builds cleanly with npm run build.
  • Existing test suite still passes.
  • Manual reproduction (multi-instance @datadog/pprof load and verify no recursion).

Jira: [PROF-14467]

If SIGPROF is already pointing at HandleProfilerSignal when Install()
runs the else-branch (installed_ == false), the captured "previous"
handler is our own function. Storing that into old_handler_func_ turns
the next signal delivery into infinite recursion via the chained call
in HandleProfilerSignal — observed in the wild as ~430 stack frames of
HandleProfilerSignal calling itself before the thread blows its stack.

Refuse to chain to ourselves: if the captured previous disposition is
&HandleProfilerSignal, store nullptr instead, so the handler fast-paths
out via the existing `if (!old_handler) return;` check.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Overall package size

Self size: 2.01 MB
Deduped: 2.38 MB
No deduping: 2.38 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | source-map | 0.7.6 | 185.63 kB | 185.63 kB | | pprof-format | 2.2.1 | 163.06 kB | 163.06 kB | | node-gyp-build | 4.8.3 | 13.81 kB | 13.81 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

Comment thread bindings/profilers/wall.cc Outdated
// SIGPROF user restored its own saved disposition — which it had
// captured while we were the active handler — after we Restore()'d),
// calling prev would loop forever on the next signal.
old_handler_func_.store(prev == &HandleProfilerSignal ? nullptr : prev,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If old_handler_func_ is null then we won't be able to forward SIGPROF to v8 profiler and profiling won't work, maybe we should return an error in StartImpl (since this can only occur at profiler startup).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this required some propagation of boolean flags, but you're right that it's the actual correct solution; thanks for catching it. The whole thing is weird – I saw it in a single crash report today, and I'm not sure how it can arise; working theory is that the customer somehow managed to run the profiler twice? I really don't know. Anyhow, let me know if this now makes sense. It's a bit more complicated now 😁

Storing nullptr in old_handler_func_ avoided the runaway recursion but
left the wall profiler's SIGPROF handler unable to forward signals to
v8's sampler, which silently produces empty profiles. Plumb the install
result out through IncreaseUseCount and StartInternal so StartImpl can
roll back the just-started v8 cpu profile and return a Result error to
the JS caller. The condition can only arise on the very first install
(installed_ == false branch), so StopCore's restart path cannot hit it.
auto startProcessCpuTime = startProcessCpuTime_;

if (restart) {
profileId_ = StartInternal();
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're capturing profileId_ in StartInternal() now so the rollback on line 914 has the right id to stop.

@szegedi szegedi added the semver-patch Bug or security fixes, mainly label May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver-patch Bug or security fixes, mainly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants