feat: Stable Session ID headers for telemetry#8352
feat: Stable Session ID headers for telemetry#8352khanayan123 wants to merge 9 commits intomasterfrom
Conversation
BenchmarksBenchmark execution time: 2026-03-31 02:34:00 Comparing candidate commit 9eff4f3 in PR branch Found 7 performance improvements and 7 performance regressions! Performance is the same for 257 metrics, 17 unstable metrics.
|
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing This PR (8352) and master. ✅ No regressions detected - check the details below Full Metrics ComparisonFakeDbCommand
HttpMessageHandler
Comparison explanationExecution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). Duration chartsFakeDbCommand (.NET Framework 4.8)gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8352) - mean (72ms) : 70, 74
master - mean (72ms) : 70, 75
section Bailout
This PR (8352) - mean (76ms) : 74, 78
master - mean (76ms) : 74, 77
section CallTarget+Inlining+NGEN
This PR (8352) - mean (1,081ms) : 1033, 1128
master - mean (1,072ms) : 1031, 1114
FakeDbCommand (.NET Core 3.1)gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8352) - mean (113ms) : 109, 116
master - mean (115ms) : 103, 127
section Bailout
This PR (8352) - mean (113ms) : 110, 116
master - mean (114ms) : 110, 118
section CallTarget+Inlining+NGEN
This PR (8352) - mean (796ms) : 776, 816
master - mean (792ms) : 774, 811
FakeDbCommand (.NET 6)gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8352) - mean (98ms) : 96, 101
master - mean (98ms) : 95, 101
section Bailout
This PR (8352) - mean (100ms) : 98, 102
master - mean (100ms) : 98, 103
section CallTarget+Inlining+NGEN
This PR (8352) - mean (952ms) : 906, 999
master - mean (948ms) : 916, 979
FakeDbCommand (.NET 8)gantt
title Execution time (ms) FakeDbCommand (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8352) - mean (99ms) : 95, 103
master - mean (99ms) : 96, 102
section Bailout
This PR (8352) - mean (100ms) : 98, 103
master - mean (100ms) : 98, 102
section CallTarget+Inlining+NGEN
This PR (8352) - mean (834ms) : 800, 867
master - mean (832ms) : 788, 875
HttpMessageHandler (.NET Framework 4.8)gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8352) - mean (189ms) : 185, 193
master - mean (212ms) : 207, 218
section Bailout
This PR (8352) - mean (192ms) : 190, 194
master - mean (217ms) : 211, 222
section CallTarget+Inlining+NGEN
This PR (8352) - mean (1,156ms) : 1093, 1219
master - mean (1,250ms) : 1193, 1306
HttpMessageHandler (.NET Core 3.1)gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8352) - mean (272ms) : 266, 277
master - mean (309ms) : 299, 318
section Bailout
This PR (8352) - mean (271ms) : 269, 274
master - mean (312ms) : 302, 321
section CallTarget+Inlining+NGEN
This PR (8352) - mean (940ms) : 913, 968
master - mean (1,027ms) : 999, 1056
HttpMessageHandler (.NET 6)gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8352) - mean (265ms) : 262, 269
master - mean (301ms) : 293, 310
section Bailout
This PR (8352) - mean (265ms) : 262, 267
master - mean (303ms) : 293, 312
section CallTarget+Inlining+NGEN
This PR (8352) - mean (1,145ms) : 1111, 1180
master - mean (1,201ms) : 1165, 1238
HttpMessageHandler (.NET 8)gantt
title Execution time (ms) HttpMessageHandler (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8352) - mean (263ms) : 260, 267
master - mean (301ms) : 294, 308
section Bailout
This PR (8352) - mean (263ms) : 261, 265
master - mean (302ms) : 296, 308
section CallTarget+Inlining+NGEN
This PR (8352) - mean (1,031ms) : 986, 1075
master - mean (1,131ms) : 1010, 1253
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
c70ab79 to
14696e1
Compare
Implements the Stable Service Instance Identifier RFC for .NET telemetry. - DD-Session-ID (= runtime_id) added to every telemetry request - DD-Root-Session-ID added only when the process inherited _DD_ROOT_DOTNET_SESSION_ID from a parent (child process scenario) - Root session ID auto-propagates to child processes via Environment.SetEnvironmentVariable() - _DD_ROOT_DOTNET_SESSION_ID registered in supported-configurations.yaml as ConfigurationKeys.Telemetry.RootSessionId Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1334cb0 to
dc79c61
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4351aa034f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
tracer/src/Datadog.Trace/Telemetry/Transports/JsonTelemetryTransport.cs
Outdated
Show resolved
Hide resolved
Eagerly set _DD_ROOT_DOTNET_SESSION_ID in TracerManager constructor so child processes spawned before the first telemetry flush inherit it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
| documentation: |- | ||
| Configuration key for whether telemetry metrics should be sent. | ||
| <see cref="Datadog.Trace.Telemetry.TelemetrySettings.MetricsEnabled"/> | ||
| _DD_ROOT_DOTNET_SESSION_ID: |
There was a problem hiding this comment.
Why the leading underscore? _DD... instead of DD..? Is that something we're doing now for internal env vars? The RFC says DD_ROOT_<LIB>_SESSION_ID.
There was a problem hiding this comment.
Agreed, this seems non standard
There was a problem hiding this comment.
Prefixing environment variables with _DD is convention we use in dd-trace-py to mark a configuration as internal/private. All implementations currently follow this spec, the implementation plan was out of date but I just pushed a fix.
Ideally we'd use _DD to be consistent across SDKs but this isn't a big deal for us. We would like to do what's best for the .NET library and ideally this PR would introduce minimal changes to library internals. From a backend perspective as long as the expected headers sent by the telemetry client we should be good.
There was a problem hiding this comment.
If that's the standard and it follows the spec and the other implementations, then yeah, go for it. I only asked because I had not seen it before and I didn't see in the RFC. Thanks!
(I wonder though it we have places where we look for DD_* env vars that would need to be updated. Probably not production code, but maybe tests or scripts that list all DD env vars for troubleshooting.)
There was a problem hiding this comment.
The _DD prefix was established by the Python implementation (dd-trace-py#16839).
The implementation proposal/parent RFC only mandates the HTTP header names (DD-Session-ID, DD-Root-Session-ID) the env var naming (DD_ROOT_LIB_SESSION_ID) is mentioned only in the "Process Spawning: Fork & Exec Support" appendix as an implementation detail.
The underscore prefix signals this is internal: we don't want to expose it as a public API, and keeping it private gives us flexibility to replace the propagation mechanism in a minor version without a breaking change.
That said, this is purely an internal convention if the DD_ prefix is preferred for the DOTNET implementation, I can update it. It's not customer-facing either way.
There was a problem hiding this comment.
As per discussion on slack and after updating the implementation proposal we will keep _DD_ROOT_LIB_SESSION_ID as the naming convention across SDKs for this env var
| private static string _rootSessionId; | ||
|
|
||
| public static string Get() => LazyInitializer.EnsureInitialized(ref _runtimeId, () => GetImpl()); | ||
| public static string Get() => LazyInitializer.EnsureInitialized(ref _runtimeId, () => GetRuntimeIdImpl()); |
There was a problem hiding this comment.
The naming of public Get() and private GetImpl() was intentional (one method is the private implementation of the other).
To keep that naming convention, we can either rename the private GetRuntimeIdImpl() back to GetImpl(), or rename the public method to GetRuntimeId()`, which I think is better overall to distinguish the two:
GetRuntimeId() -> GetRuntimeIdImpl()
GetRootSessionId() -> GetRootSessionIdImpl()
There was a problem hiding this comment.
RuntimeId.GetRuntimeId() 😅 arguably we need a different name for RuntimeId now but whatever 😅
There was a problem hiding this comment.
RuntimeId.GetRuntimeId()
Whoa. I was looking only at the method names and didn't notice this. We could also go with:
Get() -> GetImpl()
GetRootSessionId() -> GetRootSessionIdImpl()
No strong feelings either way, as long as we keep the internal consistency: Method()->MethodImpl().
There was a problem hiding this comment.
Went with approach 2 to minimize lines changes:
Get() -> GetImpl()
GetRootSessionId() -> GetRootSessionIdImpl()
| Environment.GetEnvironmentVariable(ConfigurationKeys.Telemetry.RootSessionId) | ||
| .Should().Be(rootSessionId); | ||
| } | ||
|
|
There was a problem hiding this comment.
This file contains tests for the Tracer class. Since these two tests are for the RuntimeId class, can you move this into a tracer/test/Datadog.Trace.Tests/Util/RuntimeIdTests.cs file? In Datadog.Trace.Tests we generally try to match the file/namespace layout of Datadog.Trace.
tracer/src/Datadog.Trace/Util/RuntimeId.cs
tracer/test/Datadog.Trace.Tests/Util/RuntimeIdTests.cs
There was a problem hiding this comment.
Moved to tracer/test/Datadog.Trace.Tests/Util/RuntimeIdTests.cs
| var rootSessionId = Datadog.Trace.Util.RuntimeId.GetRootSessionId(); | ||
| Environment.GetEnvironmentVariable(ConfigurationKeys.Telemetry.RootSessionId) | ||
| .Should().Be(rootSessionId); | ||
| } |
There was a problem hiding this comment.
Do we need a test that sets env var _DD_ROOT_DOTNET_SESSION_ID and asserts that RuntimeId.GetRootSessionId()returns the inherited value (not the current runtime_id)?
I'm not sure how we would handle resetting the private static fields in RuntimeId, though. Might need some refactoring to make it more testable.
There was a problem hiding this comment.
Added RuntimeId.ResetForTests() (follows SmallCacheOrNoCache.ResetForTests() pattern) and a single test that covers both paths: first asserts GetRootSessionId() defaults to runtime_id when no env var is set, then resets, pre-sets the env var, and asserts the inherited value is returned instead of the current runtime_id
| documentation: |- | ||
| Configuration key for whether telemetry metrics should be sent. | ||
| <see cref="Datadog.Trace.Telemetry.TelemetrySettings.MetricsEnabled"/> | ||
| _DD_ROOT_DOTNET_SESSION_ID: |
There was a problem hiding this comment.
Agreed, this seems non standard
| var sessionId = RuntimeId.Get(); | ||
| request.AddHeader(TelemetryConstants.SessionIdHeader, sessionId); | ||
| var rootSessionId = RuntimeId.GetRootSessionId(); | ||
| if (rootSessionId != sessionId) | ||
| { | ||
| request.AddHeader(TelemetryConstants.RootSessionIdHeader, rootSessionId); | ||
| } |
There was a problem hiding this comment.
Given that these headers are constants for the lifetime of the process, it would be preferable to add them to the TelemetryAgentHttpHeaderHelper and TelemetryAgentHttpHeaderHelper types. The other headers are added here because they depend on the payload. (Arguably container Metadata should be added elsewhere too, but that's a pre-existing issue 😅 )
There was a problem hiding this comment.
Moved session headers to TelemetryHttpHeaderNames
| private static string _rootSessionId; | ||
|
|
||
| public static string Get() => LazyInitializer.EnsureInitialized(ref _runtimeId, () => GetImpl()); | ||
| public static string Get() => LazyInitializer.EnsureInitialized(ref _runtimeId, () => GetRuntimeIdImpl()); |
There was a problem hiding this comment.
RuntimeId.GetRuntimeId() 😅 arguably we need a different name for RuntimeId now but whatever 😅
| } | ||
|
|
||
| var rootId = Get(); | ||
| EnvironmentHelpers.SetEnvironmentVariable(ConfigurationKeys.Telemetry.RootSessionId, rootId); |
There was a problem hiding this comment.
For clarity, this simply won't work in many cases, such that this whole approach is potentially flawed...
- On Unix, environment changes are not seen by native processes. There's no way to remediate this.
- Users may well call
Process.Start()with a custom env var block, in which case this won't be inherited. - If
UseShellExecute=trueis set when starting a new process, this won't work.
The "good" news is that starting new dotnet processes is not a standard pattern in general, so hopefully it just doesn't really matter...
There was a problem hiding this comment.
as we already instrument ProcessStart , maybe the RootSessionId could be passed down via the instrumentation setting some additional args or something 🤔
There was a problem hiding this comment.
starting new dotnet processes is not a standard pattern
Yet we have a whole instrumentation for it. 😉
tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/Process
(edit: I had not refreshed to see Anna's comment before I posted mine 😅 )
There was a problem hiding this comment.
@lucaspimentel @anna-git @andrewlock Good point about the Process instrumentation.
Would you prefer we inject _DD_ROOT_DOTNET_SESSION_ID directly into the child process's ProcessStartInfo.Environment via the existing ClrProfiler/AutoInstrumentation/Process hooks, rather than relying on Environment.SetEnvironmentVariable()? And would that address the Unix limitations Andrew mentioned and handle cases where the user provides a custom env block or sets UseShellExecute=true?
There was a problem hiding this comment.
I think since dotnet already instruments process start it might make sense to utilize that instead of env vars, we implemented it in a similar way for Node
There was a problem hiding this comment.
Resolved per @andrewlock's recommendation, we'll keep SetEnvironmentVariable() and accept it won't give 100% coverage. Process instrumentation is too risky for this. Spawning child .NET processes is rare enough that this is acceptable.
|
|
||
| // Eagerly initialize the root session ID so child processes inherit it | ||
| // even if spawned before the first telemetry flush. | ||
| _ = RuntimeId.GetRootSessionId(); | ||
|
|
There was a problem hiding this comment.
Rather than putting this here, I wonder if you actually need to call this in Instrumentation.Initialize, and make sure this flows through to the native profiler too. The reason being that they might send telemetry independently (I believe libdatadog does for example), and in that case you need to make sure they also apply the RootSessionId 🤔.
If you do modify that interface and pass it through to the profiler side, then you can remove that here too.
There was a problem hiding this comment.
@andrewlock Yes I believe that libdatadog does send telemetry independently, and those requests currently won't include the session headers.
@mabdinur is working on adding session ID support to libdatadog's FFI and once that lands we'll have a follow-up PR to pass it through from the managed side
I've also moved the GetRootSessionId() call earlier to Instrumentation.InitializeNoNativeParts()
| - implementation: A | ||
| type: string | ||
| default: null | ||
| product: Telemetry |
There was a problem hiding this comment.
I think this needs to be marked as a "platform key" so that it can be read directly from the environment
There was a problem hiding this comment.
Unfortunately, anything starting with _DD_ has to be controlled by ConfigRegistry, hence the constraint in the analyzer
There was a problem hiding this comment.
Any resolution on this @anna-git @andrewlock? Since this is purely an internal implementation detail (the RFC only mandates the HTTP header names, not the env var), I'm happy to rename it to whatever works best with DOTNET standards whether that's dropping the _DD prefix to allow it to be a platform key, or keeping the current approach
There was a problem hiding this comment.
Resolved per discussion with @anna-git , this should stay as a config key in ConfigRegistry (not a platform key). Platform keys are for non-Datadog config keys. Since this is ours (_DD_ prefix), ConfigRegistry is the right place, which is how it's currently implemented.
- Rename private GetRuntimeIdImpl() back to GetImpl() to preserve naming convention - Move session ID headers from per-request in JsonTelemetryTransport to TelemetryHttpHeaderNames default headers (constant for process lifetime) - Move tests from TracerTests.cs to Util/RuntimeIdTests.cs to match source layout - Move eager root session ID init from TracerManager to Instrumentation.InitializeNoNativeParts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| var headers = new List<KeyValuePair<string, string>> | ||
| { | ||
| new(TelemetryConstants.ClientLibraryLanguageHeader, TracerConstants.Language), | ||
| new(TelemetryConstants.ClientLibraryVersionHeader, TracerConstants.AssemblyVersion), | ||
| new(HttpHeaderNames.TracingEnabled, "false") // don't add automatic instrumentation to requests directed to the agent | ||
| ]; | ||
| new(HttpHeaderNames.TracingEnabled, "false"), // don't add automatic instrumentation to requests directed to the agent | ||
| new(TelemetryConstants.SessionIdHeader, RuntimeId.Get()), | ||
| }; | ||
|
|
||
| AddRootSessionIdHeader(headers); | ||
|
|
||
| return headers.ToArray(); |
There was a problem hiding this comment.
If we're going to use List<T> and we know the number of headers it will contain, we should initialize it with the required size to minimize the heap allocations of growing the list.
We should also look into removing ToArray(), which allocates another array. Maybe we can change the return type from KeyValuePair<string, string>[] to List<KeyValuePair<string, string>>.
There was a problem hiding this comment.
Changed to pre-sized arrays with direct index assignment
| $"{TelemetryConstants.ClientLibraryLanguageHeader}: {TracerConstants.Language}" + DatadogHttpValues.CrLf + | ||
| $"{TelemetryConstants.ClientLibraryVersionHeader}: {TracerConstants.AssemblyVersion}" + DatadogHttpValues.CrLf + | ||
| $"{HttpHeaderNames.TracingEnabled}: false" + DatadogHttpValues.CrLf; | ||
| internal static string HttpSerializedDefaultAgentHeaders { get; } = BuildSerializedAgentHeaders(); |
There was a problem hiding this comment.
This used to be a compile-time constant but now it's doing more work in a static initializer, including calls to RuntimeId.Get() and RuntimeId.GetRootSessionId(). Consider using a lazy initializer instead.
There was a problem hiding this comment.
Changed to LazyInitializer.EnsureInitialized, defers the RuntimeId calls until first access instead of running in the static initializer
| internal static KeyValuePair<string, string>[] GetDefaultIntakeHeaders(TelemetrySettings.AgentlessSettings settings) | ||
| { | ||
| var headerCount = settings.Cloud is null ? 4 : 7; | ||
| var headers = new List<KeyValuePair<string, string>> |
There was a problem hiding this comment.
Same comment as above. If we're doing to use List<T>, try to initialize to the required size and try to avoid the ToArray().
There was a problem hiding this comment.
Changed to pre-sized arrays with direct index assignment
| headers[4] = new(TelemetryConstants.CloudProviderHeader, cloud.Provider); | ||
| headers[5] = new(TelemetryConstants.CloudResourceTypeHeader, cloud.ResourceType); | ||
| headers[6] = new(TelemetryConstants.CloudResourceIdentifierHeader, cloud.ResourceIdentifier); | ||
| serialized += $"{TelemetryConstants.RootSessionIdHeader}: {rootSessionId}" + DatadogHttpValues.CrLf; |
There was a problem hiding this comment.
We shouldn't concatenate strings like this. Consider using a cached StringBuilder or a ValueStringBuilder to reduce heap allocations.
There was a problem hiding this comment.
Changed to StringBuilderCache.Acquire() / GetStringAndRelease()
| $"{HttpHeaderNames.TracingEnabled}: false" + DatadogHttpValues.CrLf + | ||
| $"{TelemetryConstants.SessionIdHeader}: {RuntimeId.Get()}" + DatadogHttpValues.CrLf; | ||
|
|
||
| var sessionId = RuntimeId.Get(); |
There was a problem hiding this comment.
There's another call to RuntimeId.Get() two lines above that we may want to consolidate.
- Use pre-sized arrays instead of List<T> + ToArray() - Lazy-initialize HttpSerializedDefaultAgentHeaders - Use StringBuilderCache instead of string concatenation - Consolidate duplicate RuntimeId.Get() calls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add RuntimeId.ResetForTests() to allow resetting cached state - Add RootSessionId_InheritsFromEnvVar test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
199c641 to
3406b39
Compare
Summary
Adds stable session ID headers to telemetry requests per the Stable Service Instance Identifier RFC.
runtime_id) added to every telemetry request via default headers_DD_ROOT_DOTNET_SESSION_IDenv var (registered in config registry)Instrumentation.InitializeNoNativeParts()so child processes spawned before the first telemetry flush inherit itFiles changed
Instrumentation.cs— Eagerly initializes root session ID at startupRuntimeId.cs—GetRootSessionId()with env var inheritance and auto-propagationTelemetryConstants.cs— Header name constantsTelemetryHttpHeaderNames.cs— Session headers added to default agent and intake headerssupported-configurations.yaml— Registers_DD_ROOT_DOTNET_SESSION_IDin config registryRuntimeIdTests.cs— Tests for root session ID (default + inherited paths)Related PRs
Test plan
SetsRequiredHeadersvalidates DD-Session-ID present on all telemetryRootSessionId_UsesRuntimeIdWhenNotInherited_AndInheritsWhenSetcovers both paths