Skip to content

feat(tracer): Implement OTLP Traces Export#4600

Open
mtoffl01 wants to merge 49 commits intomainfrom
mtoff/otlp-export-traces
Open

feat(tracer): Implement OTLP Traces Export#4600
mtoffl01 wants to merge 49 commits intomainfrom
mtoff/otlp-export-traces

Conversation

@mtoffl01
Copy link
Copy Markdown
Contributor

@mtoffl01 mtoffl01 commented Mar 25, 2026

What does this PR do?

Implements the OTLP trace export pipeline (See: RFC) for dd-trace-go: when OTEL_TRACES_EXPORTER=otlp is set, the tracer converts Datadog spans into OTLP protobuf format and sends them directly to an OpenTelemetry Collector instead of the Datadog agent.

This adds three new components:

  1. span_to_otlp.go — converts internal Datadog spans to OTLP TracesData protobuf, including resource attributes, span kind/status, events, links, 128-bit trace IDs, and sampling filtering (only sampled spans are exported).
  2. otlp_writer.go — a traceWriter implementation that buffers converted OTLP spans, marshals them to protobuf, and flushes with retry logic.
  3. otlp_transport.go — a lightweight HTTP transport for sending protobuf payloads to an OTLP endpoint, separate from the existing Datadog-protocol transport.

The PR also renames the existing transport interface and config.transport field to ddTransport to make it clear that it handles Datadog-specific traffic (msgpack traces, stats), and ensures ddTransport always points at the Datadog agent even when OTLP mode is active.

Motivation

We are adding native OTLP export support to dd-trace-go so users can send traces directly to any OTLP-compatible collector. A prior PR added the configuration layer that resolves OTEL_TRACES_EXPORTER, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, and OTEL_EXPORTER_OTLP_TRACES_HEADERS. This PR builds the runtime pipeline on top of that configuration.

IMPORTANT NOTE!

The existing agentTraceWriter and ddTransport was too tightly coupled to Datadog conventions — msgpack encoding, agent sampling rate feedback, stats collection — to extend it for the otlp path, so a separate writer and transport were introduced instead. The otlpTraceWriter owns its own otlpTransport directly (rather than reaching through config.ddTransport), keeping OTLP concerns isolated from the Datadog agent path. A follow-up PR may remove ddTransport from the config struct entirely, instead passing the transport directly to each writer and consumer at construction time. This would make all writer implementations structurally consistent — each owning its transport as a field — and eliminate the implicit coupling through config.

An additional PR to disable stats collection in otlp mode is in progress here.

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • New code is free of linting errors. You can check this by running make lint locally.
  • New code doesn't break existing tests. You can check this by running make test locally.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • All generated files are up to date. You can check this by running make generate locally.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild. Make sure all nested modules are up to date by running make fix-modules locally.

Unsure? Have a question? Request a review!

mtoffl01 and others added 30 commits March 23, 2026 09:30
… service name differs from global service name
…me (#4576)

### What does this PR do?

Moves the `strconv.ParseInt` call for `_dd.p.dm` (decision maker) out of the v1 payload flush path and into the write path.

A `dm uint32` field is added to the `trace` struct. It is kept in sync with `propagatingTags[keyDecisionMaker]` at all mutation sites (`setPropagatingTagLocked`, `unsetPropagatingTagLocked`, `replacePropagatingTags`). A shared `unsetPropagatingTagLocked` helper is introduced so the delete-and-clear logic is not duplicated between `unsetPropagatingTag` and `setSamplingPriorityLockedWithForce`.

`payloadV1.push` now reads the pre-computed value via `trace.decisionMaker()` instead of fetching the string from the propagating tags map and parsing it on every flush.

### Motivation

`payloadV1.push` is called once per finished trace chunk. On every call it was acquiring a read lock, doing a map lookup, and then running `strconv.ParseInt` — all to read a value that was set at most once or twice during the lifetime of the trace. Moving the parse to write-time eliminates the per-flush map lookup and parse, replacing them with a single scalar read behind an `RLock`.

### Reviewer's Checklist

- [x] Changed code has unit tests for its functionality at or near 100% coverage.
- [x] New code is free of linting errors. You can check this by running `make lint` locally.
- [x] New code doesn't break existing tests. You can check this by running `make test` locally.
- [x] Add an appropriate team label so this PR gets put in the right place for the release notes.
- [x] All generated files are up to date. You can check this by running `make generate` locally.


Co-authored-by: dario.castane <dario.castane@datadoghq.com>
…ED` once, use a global `bool` (#4548)

Stop checking environment variable `DD_TRACE_128_BIT_TRACEID_GENERATION_ENABLED` on span start. Checks it once, use a global `bool`.

It'd be better to load it on `newConfig` and make the value available, but this one happens in `newSpanContext`, which modification will cause a lot of changes in multiple files, including tests.

Complete #1905 work using the current benchmarking platform.

- [x] Changed code has unit tests for its functionality at or near 100% coverage.
- [x] There is a benchmark for any new code, or changes to existing code.
- [x] New code is free of linting errors. You can check this by running `make lint` locally.
- [x] New code doesn't break existing tests. You can check this by running `make test` locally.
- [x] Add an appropriate team label so this PR gets put in the right place for the release notes.

Unsure? Have a question? Request a review!

Co-authored-by: kakkoyun <kakkoyun@users.noreply.github.com>
Co-authored-by: kemal.akkoyun <kemal.akkoyun@datadoghq.com>
Signed-off-by: Moe Zein <moe.zein@datadoghq.com>
Co-authored-by: Benjamin De Bernardi <debernardi.benjamin@gmail.com>
@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Mar 25, 2026

Benchmarks

Benchmark execution time: 2026-03-27 18:57:30

Comparing candidate commit 12d75b3 in PR branch mtoff/otlp-export-traces with baseline commit ee8dce2 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 215 metrics, 9 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

@mtoffl01 mtoffl01 marked this pull request as ready for review March 25, 2026 18:51
@mtoffl01 mtoffl01 requested a review from a team as a code owner March 25, 2026 18:51
@mtoffl01 mtoffl01 marked this pull request as draft March 25, 2026 19:38
@mtoffl01 mtoffl01 marked this pull request as ready for review March 26, 2026 19:44
@mtoffl01
Copy link
Copy Markdown
Contributor Author

mtoffl01 commented Mar 27, 2026

Benchmark comparison: OTLP vs Agent TraceWriter

Time (sec/op)

Benchmark OTLP Agent
Add/1span 512.8ns 549.7ns ~same
Add/5spans 2.513µs 772.9ns Agent 3.3x faster
Add/10spans 5.076µs 1.399µs Agent 3.6x faster
Add/50spans 25.06µs 6.370µs Agent 3.9x faster
Flush 47.12µs 390.7µs OTLP 8.3x faster
ConcurrentAdd/1 goroutine 2.893µs 1.918µs Agent 1.5x faster
ConcurrentAdd/2 goroutines 6.359µs 4.543µs Agent 1.4x faster
ConcurrentAdd/4 goroutines 12.35µs 7.930µs Agent 1.6x faster
ConcurrentAdd/8 goroutines 24.53µs 16.42µs Agent 1.5x faster

Allocations (allocs/op)

Benchmark OTLP Agent
Add/1span 15 1 Agent 15x fewer
Add/5spans 75 1 Agent 75x fewer
Add/10spans 150 1 Agent 150x fewer
Add/50spans 750 1 Agent 750x fewer
Flush 102 223 OTLP 2.2x fewer

Summary

  • add: OTLP is slower because it allocates proto structs per span (15 allocs/span), whereas the agent encodes directly into a reusable byte buffer (1 alloc/span).
  • flush: OTLP is faster despite needing to proto.Marshal at flush time, whereas the agent's payload is already serialized and just needs to be sent. The difference is likely due to the agent flush path doing more work (payload bookkeeping, rate reading, statsd metrics) than the OTLP flush path (proto.Marshal, send, retry).
  • concurrent add: Both scale linearly with goroutine count, indicating clean mutex behavior with no unexpected contention overhead.

@darccio
Copy link
Copy Markdown
Member

darccio commented Mar 27, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c371138f9b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

func newOTLPTraceWriter(c *config) *otlpTraceWriter {
return &otlpTraceWriter{
config: c,
transport: newOTLPTransport(c.httpClient, c.internalConfig.OTLPTraceURL(), c.internalConfig.OTLPHeaders()),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Build OTLP transport with a non-UDS HTTP client

In OTLP mode this constructor reuses c.httpClient, but newConfig builds that client from the Datadog agent URL and, when the agent is configured via unix:// (or auto-detected UDS), the client dialer always targets the Unix socket regardless of request host. As a result, OTLP exports to OTEL_EXPORTER_OTLP_TRACES_ENDPOINT are sent through the agent socket and fail instead of reaching the collector. This breaks OTLP trace export in UDS-based deployments unless users manually override the HTTP client.

Useful? React with 👍 / 👎.

Comment on lines +276 to +280
case int64:
return otlpIntValue(val)
case float64:
return otlpDoubleValue(val)
case []any:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve typed meta_struct values in OTLP conversion

This conversion only treats int64, float64, and []any as typed OTLP values, so common meta_struct payloads like int, []int64, []string, and other concrete slice/integer forms fall into the JSON-string fallback path. That silently changes structured attributes into strings in OTLP output, which loses type information and can break downstream processing that expects numeric/array values.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's accurate that there are a few types not captured by this switch but to capture all types would be infinite and recursive. The only types stored in meta_struct and not captured here, that I could find, are:

  1. map[string][]any
  2. map[string][]*stacktrace.Event

return &otlptrace.Span{
TraceId: convertTraceID(s.context.traceID.Upper(), s.context.traceID.Lower()),
SpanId: convertSpanID(s.spanID),
ParentSpanId: convertSpanID(s.parentID),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be the value of ParentSpanId if s.parentID is 0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think []byte{0}... @zacharycmontoya is that true?

Comment on lines +165 to +170
add := func(key string, val *otlpcommon.AnyValue) bool {
if val != nil {
attrs = append(attrs, &otlpcommon.KeyValue{Key: key, Value: val})
}
return len(attrs) < maxAttributesCount
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a real function? No need for a closure here, I think.

Also, what will happen with the dropped attributes when len(attrs) == maxAttributesCount? I'm curious too about the returned boolean, because it's checked sometimes but not always.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Closure vs function: You're right, this function will be built and torn down on every call to convertSpanAttributes, which we expect to happen many many times
  2. Once we hit the limit, we silently drop the rest of the attributes for now
  3. Checking/not checking the returned boolean was an oversight, it should be fixed now.

"github.com/DataDog/dd-trace-go/v2/internal/version"
)

const maxAttributesCount = 128
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this limit, I assume it comes from the spec: isn't the order in convertSpanAttributes relevant?

If meta contains hundreds of tags, which is pretty common, we risk losing important attributes. Iterating maps happens in random order, so we might should guarantee adding first the required attributes that the collector might expect.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the priority order is explicitly undefined for now, beyond span.kind and operation.name which have the top priority.

From my conversation with @zacharycmontoya the other day:

We should prioritize these two span attributes [span.kind and operation.name] over all others, so I recommend adding them first.
I don't really want to prescribe an order of preference for the meta/metrics/metastruct, only that operation.name and span.type are always set first

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In follow-up PRs we would support the "dropped attributes count" -- a metric that increments for every attribute we have to drop beyond the limit -- but that's not in scope yet

assert.Equal(t, numAdders*spansPerAdder, totalSpans)
}

// TODO: Add benchmark tests.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't benchmarks present in otlp_writer_bench_test.go?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this was leftover 😆


// send posts a protobuf-encoded payload to the configured OTLP endpoint.
func (t *otlpTransport) send(data []byte) error {
req, err := http.NewRequest("POST", t.endpoint, bytes.NewReader(data))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a context to be able to cancel the request in case there are timeouts?

@darccio darccio requested review from a team and darccio and removed request for darccio March 27, 2026 17:46
mtoffl01 and others added 2 commits March 27, 2026 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants