fix(wall): break SIGPROF self-chain in signal handler install#319
fix(wall): break SIGPROF self-chain in signal handler install#319
Conversation
If SIGPROF is already pointing at HandleProfilerSignal when Install() runs the else-branch (installed_ == false), the captured "previous" handler is our own function. Storing that into old_handler_func_ turns the next signal delivery into infinite recursion via the chained call in HandleProfilerSignal — observed in the wild as ~430 stack frames of HandleProfilerSignal calling itself before the thread blows its stack. Refuse to chain to ourselves: if the captured previous disposition is &HandleProfilerSignal, store nullptr instead, so the handler fast-paths out via the existing `if (!old_handler) return;` check.
Overall package sizeSelf size: 2.01 MB Dependency sizes| name | version | self size | total size | |------|---------|-----------|------------| | source-map | 0.7.6 | 185.63 kB | 185.63 kB | | pprof-format | 2.2.1 | 163.06 kB | 163.06 kB | | node-gyp-build | 4.8.3 | 13.81 kB | 13.81 kB |🤖 This report was automatically generated by heaviest-objects-in-the-universe |
| // SIGPROF user restored its own saved disposition — which it had | ||
| // captured while we were the active handler — after we Restore()'d), | ||
| // calling prev would loop forever on the next signal. | ||
| old_handler_func_.store(prev == &HandleProfilerSignal ? nullptr : prev, |
There was a problem hiding this comment.
If old_handler_func_ is null then we won't be able to forward SIGPROF to v8 profiler and profiling won't work, maybe we should return an error in StartImpl (since this can only occur at profiler startup).
There was a problem hiding this comment.
Okay, this required some propagation of boolean flags, but you're right that it's the actual correct solution; thanks for catching it. The whole thing is weird – I saw it in a single crash report today, and I'm not sure how it can arise; working theory is that the customer somehow managed to run the profiler twice? I really don't know. Anyhow, let me know if this now makes sense. It's a bit more complicated now 😁
Storing nullptr in old_handler_func_ avoided the runaway recursion but left the wall profiler's SIGPROF handler unable to forward signals to v8's sampler, which silently produces empty profiles. Plumb the install result out through IncreaseUseCount and StartInternal so StartImpl can roll back the just-started v8 cpu profile and return a Result error to the JS caller. The condition can only arise on the very first install (installed_ == false branch), so StopCore's restart path cannot hit it.
| auto startProcessCpuTime = startProcessCpuTime_; | ||
|
|
||
| if (restart) { | ||
| profileId_ = StartInternal(); |
There was a problem hiding this comment.
We're capturing profileId_ in StartInternal() now so the rollback on line 914 has the right id to stop.
Summary
Fix infinite recursion in
SignalHandler::HandleProfilerSignalobserved under dd-trace-js + Node.js 20 on x86_64 Linux as ~430 stack frames ofHandleProfilerSignalcalling itself, with each frame paused right after the indirectcallq *%r14(i.e. the call toold_handler_func_).Root cause
SignalHandler::Install()else-branch captures whatever was the previous SIGPROF disposition intoold_handler_func_:installed_ = (sigaction(SIGPROF, &sa, &old_handler_) == 0); old_handler_func_.store(old_handler_.sa_sigaction, std::memory_order_relaxed);If, at that moment, SIGPROF was already pointing at
&HandleProfilerSignalwhileinstalled_ == false, we end up storing our own address intoold_handler_func_, and the next signal delivery loops forever via:The state
(installed_ == false, SIGPROF == &HandleProfilerSignal)requires a non-pprof party to write&HandleProfilerSignalback into SIGPROF after ourRestore()clearedinstalled_. Static analysis of single-instance pprof + v8 SIGPROF interaction (StartInternalalways runscpuProfiler_->Start— which is what may trigger V8'sSignalHandler::IncreaseSamplerCount→ V8.Install — beforepprof.IncreaseUseCount) shows that V8'sold_signal_handler_can never end up holding&HandleProfilerSignal, because V8.Install only runs when V8's process-wideclient_count_transitions 0→1, and at that point pprof has not yet installed in this lifecycle. So V8.Restore can never write us back into SIGPROF.The realistic trigger is therefore two copies of
@datadog/pprofloaded in the same process (the dlopen-key for native addons is the file path, so a bundled copy undernode_modules/dd-trace/node_modules/@datadog/pprofplus a top-level dep atnode_modules/@datadog/pprofare two distinct instances with independent static state). With that, the sequenceleaves A's
old_handler_func_pointing at A's ownHandleProfilerSignal. A non-pprof third-party SIGPROF user that captured pprof while pprof was active and later restores it produces the same shape.Fix
Refuse to chain to ourselves in the install path. If the captured previous disposition is
&HandleProfilerSignal, storenullptrinold_handler_func_so the handler's existingif (!old_handler) return;fast-path kicks in. The signal is dropped instead of recursing — same observable behavior the handler already has when SIGPROF was previously unhandled.This is a strict improvement: in the two-instance case, the other loaded copy still chains to us once via its own
old_handler_func_, but a single hop is harmless; only the self-loop runs away.Test plan
npm run build.@datadog/pprofload and verify no recursion).Jira: [PROF-14467]