feat(llmobs): flush writers on SIGTERM with 5s shutdown timeout#17130
feat(llmobs): flush writers on SIGTERM with 5s shutdown timeout#17130ZStriker19 wants to merge 4 commits into4.6from
Conversation
Registers LLMObs.disable with register_on_exit_signal so SIGTERM (e.g. Kubernetes pod termination) triggers a graceful flush of buffered spans and eval metrics before exit. Also joins writer threads with a 5-second default timeout, matching the tracer's SHUTDOWN_TIMEOUT convention. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Codeowners resolved as |
| class LLMObs(Service): | ||
| _instance = None # type: LLMObs | ||
| enabled = False | ||
| SHUTDOWN_TIMEOUT = 5 |
There was a problem hiding this comment.
can we make both this and the SHUTDOWN_TIMEOUT in Span_processor configurable?
There was a problem hiding this comment.
I'll discuss with the team
ddtrace/llmobs/_llmobs.py
Outdated
| ) | ||
|
|
||
| atexit.register(cls.disable) | ||
| atexit.register_on_exit_signal(cls.disable) |
There was a problem hiding this comment.
This immediately disables LLMObs on SIGTERM signal. We have a use case where an application pod continues running for a graceful termination period of a few minutes after receiving SIGTERM signal. During that it need to able to continue writing LLM Obs traces. This is a generally valid use case.
For that, instead of immediately disabling LLMObs on SIGTERM, it should continue writing for a configurable period of time (configured with something like DD_LLMOBS_GRACEFUL_TERMINATION_PERIOD). So instead of registering cls.disable here, a new classmethod can be registered that waits the configured time period before calling disable().
…IGTERM shutdown Adds _on_exit_signal classmethod that checks config._llmobs_graceful_termination_period (DD_LLMOBS_GRACEFUL_TERMINATION_PERIOD, default 0). If zero, disables immediately on SIGTERM. If non-zero, starts a threading.Timer so the app can continue writing spans during a Kubernetes graceful termination window before flushing and stopping. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…MEOUT hardcoded The join timeout is an implementation detail (thread cleanup after a synchronous flush), not a meaningful user-facing knob. Keep it as a hardcoded 5s constant. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Performance SLOsComparing candidate zach.groves/llmobs-graceful-sigterm-shutdown (4e3b542) with baseline 4.6 (bae3a8a) 📈 Performance Regressions (2 suites)📈 iastaspects - 118/118✅ add_aspectTime: ✅ 105.736µs (SLO: <130.000µs 📉 -18.7%) vs baseline: +4.7% Memory: ✅ 43.657MB (SLO: <46.000MB -5.1%) vs baseline: +4.9% ✅ add_inplace_aspectTime: ✅ 102.586µs (SLO: <130.000µs 📉 -21.1%) vs baseline: +0.4% Memory: ✅ 43.679MB (SLO: <46.000MB -5.0%) vs baseline: +4.8% ✅ add_inplace_noaspectTime: ✅ 28.014µs (SLO: <40.000µs 📉 -30.0%) vs baseline: ~same Memory: ✅ 43.615MB (SLO: <46.000MB -5.2%) vs baseline: +4.7% ✅ add_noaspectTime: ✅ 49.249µs (SLO: <70.000µs 📉 -29.6%) vs baseline: +0.7% Memory: ✅ 43.684MB (SLO: <46.000MB -5.0%) vs baseline: +5.0% ✅ bytearray_aspectTime: ✅ 250.364µs (SLO: <400.000µs 📉 -37.4%) vs baseline: -0.7% Memory: ✅ 43.704MB (SLO: <46.000MB -5.0%) vs baseline: +5.0% ✅ bytearray_extend_aspectTime: ✅ 648.645µs (SLO: <800.000µs 📉 -18.9%) vs baseline: ~same Memory: ✅ 43.652MB (SLO: <46.000MB -5.1%) vs baseline: +4.9% ✅ bytearray_extend_noaspectTime: ✅ 267.444µs (SLO: <400.000µs 📉 -33.1%) vs baseline: -1.5% Memory: ✅ 43.663MB (SLO: <46.000MB -5.1%) vs baseline: +4.9% ✅ bytearray_noaspectTime: ✅ 138.911µs (SLO: <300.000µs 📉 -53.7%) vs baseline: -0.9% Memory: ✅ 43.641MB (SLO: <46.000MB -5.1%) vs baseline: +4.8% ✅ bytes_aspectTime: ✅ 218.687µs (SLO: <300.000µs 📉 -27.1%) vs baseline: +0.1% Memory: ✅ 43.634MB (SLO: <46.000MB -5.1%) vs baseline: +4.7% ✅ bytes_noaspectTime: ✅ 134.991µs (SLO: <200.000µs 📉 -32.5%) vs baseline: +0.3% Memory: ✅ 43.698MB (SLO: <46.000MB -5.0%) vs baseline: +4.9% ✅ bytesio_aspectTime: ✅ 3.775ms (SLO: <5.000ms 📉 -24.5%) vs baseline: -0.7% Memory: ✅ 43.700MB (SLO: <46.000MB -5.0%) vs baseline: +5.1% ✅ bytesio_noaspectTime: ✅ 313.825µs (SLO: <420.000µs 📉 -25.3%) vs baseline: +0.3% Memory: ✅ 43.676MB (SLO: <46.000MB -5.1%) vs baseline: +5.0% ✅ capitalize_aspectTime: ✅ 89.363µs (SLO: <300.000µs 📉 -70.2%) vs baseline: +1.4% Memory: ✅ 43.698MB (SLO: <46.000MB -5.0%) vs baseline: +5.0% ✅ capitalize_noaspectTime: ✅ 254.561µs (SLO: <300.000µs 📉 -15.1%) vs baseline: +3.1% Memory: ✅ 43.686MB (SLO: <46.000MB -5.0%) vs baseline: +4.8% ✅ casefold_aspectTime: ✅ 89.176µs (SLO: <500.000µs 📉 -82.2%) vs baseline: +0.4% Memory: ✅ 43.722MB (SLO: <46.000MB -5.0%) vs baseline: +5.0% ✅ casefold_noaspectTime: ✅ 302.401µs (SLO: <500.000µs 📉 -39.5%) vs baseline: -0.2% Memory: ✅ 43.675MB (SLO: <46.000MB -5.1%) vs baseline: +4.9% ✅ decode_aspectTime: ✅ 86.545µs (SLO: <100.000µs 📉 -13.5%) vs baseline: -0.6% Memory: ✅ 43.616MB (SLO: <46.000MB -5.2%) vs baseline: +4.8% ✅ decode_noaspectTime: ✅ 153.794µs (SLO: <210.000µs 📉 -26.8%) vs baseline: +1.4% Memory: ✅ 43.719MB (SLO: <46.000MB -5.0%) vs baseline: +4.8% ✅ encode_aspectTime: ✅ 84.216µs (SLO: <200.000µs 📉 -57.9%) vs baseline: -0.5% Memory: ✅ 43.726MB (SLO: <46.000MB -4.9%) vs baseline: +5.0% ✅ encode_noaspectTime: ✅ 141.630µs (SLO: <200.000µs 📉 -29.2%) vs baseline: +1.4% Memory: ✅ 43.630MB (SLO: <46.000MB -5.2%) vs baseline: +4.8% ✅ format_aspectTime: ✅ 14.588ms (SLO: <19.200ms 📉 -24.0%) vs baseline: +0.2% Memory: ✅ 43.776MB (SLO: <46.000MB -4.8%) vs baseline: +5.0% ✅ format_map_aspectTime: ✅ 16.339ms (SLO: <21.500ms 📉 -24.0%) vs baseline: -0.5% Memory: ✅ 43.795MB (SLO: <46.000MB -4.8%) vs baseline: +5.0% ✅ format_map_noaspectTime: ✅ 369.508µs (SLO: <500.000µs 📉 -26.1%) vs baseline: -0.2% Memory: ✅ 43.615MB (SLO: <46.000MB -5.2%) vs baseline: +4.7% ✅ format_noaspectTime: ✅ 311.852µs (SLO: <500.000µs 📉 -37.6%) vs baseline: +0.8% Memory: ✅ 43.722MB (SLO: <46.000MB -5.0%) vs baseline: +5.0% ✅ index_aspectTime: ✅ 127.223µs (SLO: <300.000µs 📉 -57.6%) vs baseline: +3.4% Memory: ✅ 43.675MB (SLO: <46.000MB -5.1%) vs baseline: +5.0% ✅ index_noaspectTime: ✅ 40.317µs (SLO: <300.000µs 📉 -86.6%) vs baseline: +0.3% Memory: ✅ 43.678MB (SLO: <46.000MB -5.0%) vs baseline: +5.0% ✅ join_aspectTime: ✅ 212.266µs (SLO: <300.000µs 📉 -29.2%) vs baseline: +0.9% Memory: ✅ 43.635MB (SLO: <46.000MB -5.1%) vs baseline: +4.8% ✅ join_noaspectTime: ✅ 143.778µs (SLO: <300.000µs 📉 -52.1%) vs baseline: +1.0% Memory: ✅ 43.704MB (SLO: <46.000MB -5.0%) vs baseline: +4.8% ✅ ljust_aspectTime: ✅ 566.364µs (SLO: <700.000µs 📉 -19.1%) vs baseline: 📈 +15.6% Memory: ✅ 43.658MB (SLO: <46.000MB -5.1%) vs baseline: +4.9% ✅ ljust_noaspectTime: ✅ 254.255µs (SLO: <300.000µs 📉 -15.2%) vs baseline: -0.2% Memory: ✅ 43.721MB (SLO: <46.000MB -5.0%) vs baseline: +5.1% ✅ lower_aspectTime: ✅ 293.838µs (SLO: <500.000µs 📉 -41.2%) vs baseline: +0.5% Memory: ✅ 43.614MB (SLO: <46.000MB -5.2%) vs baseline: +4.8% ✅ lower_noaspectTime: ✅ 231.443µs (SLO: <300.000µs 📉 -22.9%) vs baseline: +0.6% Memory: ✅ 43.651MB (SLO: <46.000MB -5.1%) vs baseline: +4.9% ✅ lstrip_aspectTime: ✅ 0.272ms (SLO: <3.000ms 📉 -90.9%) vs baseline: +0.4% Memory: ✅ 43.658MB (SLO: <46.000MB -5.1%) vs baseline: +4.6% ✅ lstrip_noaspectTime: ✅ 0.175ms (SLO: <3.000ms 📉 -94.2%) vs baseline: -0.1% Memory: ✅ 43.655MB (SLO: <46.000MB -5.1%) vs baseline: +4.8% ✅ modulo_aspectTime: ✅ 14.313ms (SLO: <18.750ms 📉 -23.7%) vs baseline: +0.5% Memory: ✅ 43.655MB (SLO: <46.000MB -5.1%) vs baseline: +4.6% ✅ modulo_aspect_for_bytearray_bytearrayTime: ✅ 14.787ms (SLO: <19.350ms 📉 -23.6%) vs baseline: ~same Memory: ✅ 43.677MB (SLO: <46.000MB -5.1%) vs baseline: +4.7% ✅ modulo_aspect_for_bytesTime: ✅ 14.409ms (SLO: <18.900ms 📉 -23.8%) vs baseline: +0.3% Memory: ✅ 43.690MB (SLO: <46.000MB -5.0%) vs baseline: +4.8% ✅ modulo_aspect_for_bytes_bytearrayTime: ✅ 14.615ms (SLO: <19.150ms 📉 -23.7%) vs baseline: +0.3% Memory: ✅ 43.745MB (SLO: <46.000MB -4.9%) vs baseline: +5.1% ✅ modulo_noaspectTime: ✅ 0.363ms (SLO: <3.000ms 📉 -87.9%) vs baseline: +0.9% Memory: ✅ 43.766MB (SLO: <46.000MB -4.9%) vs baseline: +5.0% ✅ replace_aspectTime: ✅ 18.352ms (SLO: <24.000ms 📉 -23.5%) vs baseline: ~same Memory: ✅ 43.652MB (SLO: <46.000MB -5.1%) vs baseline: +4.7% ✅ replace_noaspectTime: ✅ 282.079µs (SLO: <300.000µs -6.0%) vs baseline: ~same Memory: ✅ 43.702MB (SLO: <46.000MB -5.0%) vs baseline: +5.0% ✅ repr_aspectTime: ✅ 317.541µs (SLO: <420.000µs 📉 -24.4%) vs baseline: +0.6% Memory: ✅ 43.657MB (SLO: <46.000MB -5.1%) vs baseline: +4.7% ✅ repr_noaspectTime: ✅ 46.678µs (SLO: <90.000µs 📉 -48.1%) vs baseline: +0.3% Memory: ✅ 43.706MB (SLO: <46.000MB -5.0%) vs baseline: +4.9% ✅ rstrip_aspectTime: ✅ 382.745µs (SLO: <500.000µs 📉 -23.5%) vs baseline: -0.4% Memory: ✅ 43.682MB (SLO: <46.000MB -5.0%) vs baseline: +4.9% ✅ rstrip_noaspectTime: ✅ 184.650µs (SLO: <300.000µs 📉 -38.4%) vs baseline: +1.1% Memory: ✅ 43.674MB (SLO: <46.000MB -5.1%) vs baseline: +4.9% ✅ slice_aspectTime: ✅ 180.928µs (SLO: <300.000µs 📉 -39.7%) vs baseline: -0.4% Memory: ✅ 43.617MB (SLO: <46.000MB -5.2%) vs baseline: +4.7% ✅ slice_noaspectTime: ✅ 54.696µs (SLO: <90.000µs 📉 -39.2%) vs baseline: +1.2% Memory: ✅ 43.816MB (SLO: <46.000MB -4.7%) vs baseline: +5.2% ✅ stringio_aspectTime: ✅ 4.405ms (SLO: <5.000ms 📉 -11.9%) vs baseline: 📈 +15.7% Memory: ✅ 43.720MB (SLO: <46.000MB -5.0%) vs baseline: +5.0% ✅ stringio_noaspectTime: ✅ 343.686µs (SLO: <500.000µs 📉 -31.3%) vs baseline: -0.8% Memory: ✅ 43.648MB (SLO: <46.000MB -5.1%) vs baseline: +4.7% ✅ strip_aspectTime: ✅ 270.079µs (SLO: <350.000µs 📉 -22.8%) vs baseline: -0.7% Memory: ✅ 43.633MB (SLO: <46.000MB -5.1%) vs baseline: +4.7% ✅ strip_noaspectTime: ✅ 174.549µs (SLO: <240.000µs 📉 -27.3%) vs baseline: -1.3% Memory: ✅ 43.655MB (SLO: <46.000MB -5.1%) vs baseline: +5.0% ✅ swapcase_aspectTime: ✅ 327.540µs (SLO: <500.000µs 📉 -34.5%) vs baseline: +0.8% Memory: ✅ 43.698MB (SLO: <46.000MB -5.0%) vs baseline: +5.2% ✅ swapcase_noaspectTime: ✅ 269.918µs (SLO: <400.000µs 📉 -32.5%) vs baseline: +1.7% Memory: ✅ 43.655MB (SLO: <46.000MB -5.1%) vs baseline: +4.9% ✅ title_aspectTime: ✅ 316.329µs (SLO: <500.000µs 📉 -36.7%) vs baseline: -0.2% Memory: ✅ 43.635MB (SLO: <46.000MB -5.1%) vs baseline: +4.6% ✅ title_noaspectTime: ✅ 256.512µs (SLO: <400.000µs 📉 -35.9%) vs baseline: +0.8% Memory: ✅ 43.698MB (SLO: <46.000MB -5.0%) vs baseline: +4.9% ✅ translate_aspectTime: ✅ 487.304µs (SLO: <700.000µs 📉 -30.4%) vs baseline: -0.9% Memory: ✅ 43.609MB (SLO: <46.000MB -5.2%) vs baseline: +4.8% ✅ translate_noaspectTime: ✅ 425.398µs (SLO: <500.000µs 📉 -14.9%) vs baseline: -0.2% Memory: ✅ 43.734MB (SLO: <46.000MB -4.9%) vs baseline: +5.0% ✅ upper_aspectTime: ✅ 295.418µs (SLO: <500.000µs 📉 -40.9%) vs baseline: +1.1% Memory: ✅ 43.613MB (SLO: <46.000MB -5.2%) vs baseline: +4.7% ✅ upper_noaspectTime: ✅ 230.668µs (SLO: <400.000µs 📉 -42.3%) vs baseline: -0.5% Memory: ✅ 43.747MB (SLO: <46.000MB -4.9%) vs baseline: +5.1% 📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 521.050µs (SLO: <700.000µs 📉 -25.6%) vs baseline: 📈 +23.1% Memory: ✅ 43.515MB (SLO: <46.000MB -5.4%) vs baseline: +4.7% ✅ ospathbasename_noaspectTime: ✅ 432.748µs (SLO: <700.000µs 📉 -38.2%) vs baseline: -0.6% Memory: ✅ 43.663MB (SLO: <46.000MB -5.1%) vs baseline: +5.1% ✅ ospathjoin_aspectTime: ✅ 622.069µs (SLO: <700.000µs 📉 -11.1%) vs baseline: +0.3% Memory: ✅ 43.630MB (SLO: <46.000MB -5.2%) vs baseline: +4.9% ✅ ospathjoin_noaspectTime: ✅ 631.047µs (SLO: <700.000µs -9.9%) vs baseline: -0.1% Memory: ✅ 43.674MB (SLO: <46.000MB -5.1%) vs baseline: +5.0% ✅ ospathnormcase_aspectTime: ✅ 346.471µs (SLO: <700.000µs 📉 -50.5%) vs baseline: -1.0% Memory: ✅ 43.622MB (SLO: <46.000MB -5.2%) vs baseline: +5.0% ✅ ospathnormcase_noaspectTime: ✅ 357.183µs (SLO: <700.000µs 📉 -49.0%) vs baseline: +1.2% Memory: ✅ 43.579MB (SLO: <46.000MB -5.3%) vs baseline: +4.7% ✅ ospathsplit_aspectTime: ✅ 482.624µs (SLO: <700.000µs 📉 -31.1%) vs baseline: -0.7% Memory: ✅ 43.623MB (SLO: <46.000MB -5.2%) vs baseline: +4.9% ✅ ospathsplit_noaspectTime: ✅ 499.713µs (SLO: <700.000µs 📉 -28.6%) vs baseline: ~same Memory: ✅ 43.629MB (SLO: <46.000MB -5.2%) vs baseline: +5.1% ✅ ospathsplitdrive_aspectTime: ✅ 374.650µs (SLO: <700.000µs 📉 -46.5%) vs baseline: -0.3% Memory: ✅ 43.610MB (SLO: <46.000MB -5.2%) vs baseline: +5.0% ✅ ospathsplitdrive_noaspectTime: ✅ 72.643µs (SLO: <700.000µs 📉 -89.6%) vs baseline: +1.1% Memory: ✅ 43.633MB (SLO: <46.000MB -5.1%) vs baseline: +5.1% ✅ ospathsplitext_aspectTime: ✅ 456.828µs (SLO: <700.000µs 📉 -34.7%) vs baseline: +0.3% Memory: ✅ 43.537MB (SLO: <46.000MB -5.4%) vs baseline: +4.6% ✅ ospathsplitext_noaspectTime: ✅ 462.248µs (SLO: <700.000µs 📉 -34.0%) vs baseline: -0.4% Memory: ✅ 43.533MB (SLO: <46.000MB -5.4%) vs baseline: +4.7% 🟡 Near SLO Breach (2 suites)🟡 flasksimple - 18/18✅ appsec-getTime: ✅ 3.447ms (SLO: <4.750ms 📉 -27.4%) vs baseline: ~same Memory: ✅ 55.458MB (SLO: <66.500MB 📉 -16.6%) vs baseline: +4.9% ✅ appsec-postTime: ✅ 2.909ms (SLO: <6.750ms 📉 -56.9%) vs baseline: ~same Memory: ✅ 55.772MB (SLO: <66.500MB 📉 -16.1%) vs baseline: +4.8% ✅ appsec-telemetryTime: ✅ 3.448ms (SLO: <4.750ms 📉 -27.4%) vs baseline: +0.8% Memory: ✅ 55.452MB (SLO: <66.500MB 📉 -16.6%) vs baseline: +4.8% ✅ debuggerTime: ✅ 1.877ms (SLO: <2.000ms -6.2%) vs baseline: ~same Memory: ✅ 49.031MB (SLO: <51.500MB -4.8%) vs baseline: +4.8% ✅ iast-getTime: ✅ 1.869ms (SLO: <2.000ms -6.6%) vs baseline: +0.1% Memory: ✅ 46.003MB (SLO: <49.000MB -6.1%) vs baseline: +4.7% ✅ profilerTime: ✅ 1.913ms (SLO: <2.100ms -8.9%) vs baseline: +0.2% Memory: ✅ 52.135MB (SLO: <52.500MB 🟡 -0.7%) vs baseline: +4.9% ✅ resource-renamingTime: ✅ 3.422ms (SLO: <3.650ms -6.3%) vs baseline: +0.2% Memory: ✅ 55.510MB (SLO: <60.000MB -7.5%) vs baseline: +5.0% ✅ tracerTime: ✅ 3.425ms (SLO: <3.650ms -6.2%) vs baseline: ~same Memory: ✅ 55.438MB (SLO: <60.000MB -7.6%) vs baseline: +4.9% ✅ tracer-nativeTime: ✅ 3.408ms (SLO: <3.650ms -6.6%) vs baseline: -0.3% Memory: ✅ 55.497MB (SLO: <60.000MB -7.5%) vs baseline: +5.0% 🟡 tracer - 6/6✅ largeTime: ✅ 31.399ms (SLO: <32.950ms -4.7%) vs baseline: -0.2% Memory: ✅ 37.552MB (SLO: <39.250MB -4.3%) vs baseline: +4.9% ✅ mediumTime: ✅ 3.098ms (SLO: <3.200ms -3.2%) vs baseline: -0.5% Memory: ✅ 36.333MB (SLO: <38.750MB -6.2%) vs baseline: +5.0% ✅ smallTime: ✅ 363.019µs (SLO: <370.000µs 🟡 -1.9%) vs baseline: +3.7% Memory: ✅ 36.294MB (SLO: <38.750MB -6.3%) vs baseline: +4.9%
|
|
Superseded by #17174 which targets main instead of 4.6. |
Summary
LLMObs.disablewithregister_on_exit_signalso SIGTERM (e.g. Kubernetes pod termination during rolling deployments) triggers graceful flushing of buffered LLMObs spans and eval metrics before process exit. Previously these were silently dropped.LLMObs.SHUTDOWN_TIMEOUT = 5(matching the tracer and data streams processor convention) and joins writer threads with this timeout during_stop_service.Closes #17125
Root cause
LLMObs.enable()only registereddisablewith standardatexit, which Python does not invoke on SIGTERM (the OS terminates the process before the interpreter shutdown hook runs). The tracer uses bothatexit.registerandatexit.register_on_exit_signal; LLMObs now does the same.