feat(data-pipeline)!: export client-computed span stats as OTLP trace metrics#2067
feat(data-pipeline)!: export client-computed span stats as OTLP trace metrics#2067mabdinur wants to merge 10 commits into
Conversation
📚 Documentation Check Results📦
|
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: 3bd3773 | Docs | Datadog PR Page | Give us feedback! |
🔒 Cargo Deny Results📦
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2067 +/- ##
==========================================
+ Coverage 73.42% 73.46% +0.04%
==========================================
Files 465 466 +1
Lines 77949 78552 +603
==========================================
+ Hits 57231 57706 +475
- Misses 20718 20846 +128
🚀 New features to boost your workflow:
|
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
ad8610e to
dbc8c4b
Compare
dbc8c4b to
43fdf9b
Compare
Foundation pieces consumed by the OTLP trace-metrics exporter that follows. These are pure additions with no breaking changes. - libdd-ddsketch: `DDSketch::from_pb` rebuilds a sketch from its protobuf representation (or `None` when the mapping is missing/invalid); a thin `DDSketch::from_encoded` wraps protobuf decoding + `from_pb`. Lets callers read back the ok/error sketches that the span concentrator publishes. Includes a roundtrip test that goes `encode_to_vec` -> `from_encoded` and asserts bin count + total weight survive the trip. - libdd-trace-utils: extend `OtlpResourceInfo` with two new fields: `hostname` (emitted as the `host.name` resource attribute when set) and `process_tags` (comma-separated `key:value` pairs, each becoming a `dd.<key>` resource attribute). The struct is `#[non_exhaustive]`, so adding fields is forward-compatible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Make the span concentrator accumulate exact per-cell (ok/error) duration totals and min/max in nanoseconds alongside the existing combined `duration` that the /v0.6/stats agent payload uses, and publish them on a new sidecar that the OTLP trace-metrics path can read. - `GroupedStats` gains six pub(super) accumulators (`ok_duration`/`ok_min`/ `ok_max` + the error trio) updated inside `insert`. They are seeded on the first span in each cell (count == 1) so the natural `0` default cannot masquerade as a real minimum. - New public types `OtlpExactCell`, `OtlpExactGroup`, `OtlpStatsBucket` carry the exact scalars alongside an unmodified `pb::ClientStatsBucket`. The `grpc_method` field on `OtlpExactGroup` is intentionally introduced here but only ever populated with `String::new()`; a later commit fills it in. - `StatsBucket::flush` now delegates to a new `flush_with_otlp_exact` which produces both the protobuf bucket (identical bytes) and the parallel sidecar. `SpanConcentrator::flush` and `flush_with_otlp_exact` share a generic `drain_due_buckets` helper so the bucket-window/buffer-len logic stays in one place. - A new concentrator test drives the full path through `add_span` for 3 ok + 2 error spans and asserts each cell's count/duration/min/max plus `ok_duration + error_duration == group.duration` (the agent field). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nfig Prepare the data-pipeline OTLP layer to host a second exporter (trace metrics) without changing the existing trace path's behavior or public API. - `otlp/exporter.rs`: factor the actual POST + retry plumbing into a new crate-private `send_otlp_http(endpoint_url, headers, timeout, ...)` helper. `send_otlp_traces_http` becomes a thin wrapper that pulls fields out of `OtlpTraceConfig` and calls it; the existing public function signature is unchanged, so external callers see no diff. Two new pub(crate) constants (`OTLP_MAX_ATTEMPTS`, `OTLP_SHUTDOWN_MAX_ATTEMPTS`) replace the previous `OTLP_MAX_RETRIES` literal so the trace-metrics worker can use a single attempt on shutdown. - `otlp/config.rs`: add `OtlpMetricsConfig` mirroring `OtlpTraceConfig` plus an `otel_trace_semantics_enabled` flag for `DD_TRACE_OTEL_SEMANTICS_ENABLED`. Annotated `#[allow(dead_code)]` until a follow-up commit consumes it. - `trace_exporter/builder.rs`: factor the inline OTLP header-map builder out of `build_async` into a small `build_otlp_header_map` helper and refactor the existing OTLP traces config building to use it. No behavior change; this dedup makes the metrics-config branch trivial when it lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…metrics
Wire up the actual OTLP trace-metrics exporter on top of the foundation
pieces from earlier commits.
- New `libdd-data-pipeline/src/otlp/metrics.rs`:
- `map_stats_to_otlp_metrics` builds an `ExportMetricsServiceRequest`
JSON value from `&[OtlpStatsBucket]` (one histogram data point per
aggregation-key (ok|error) cell). `count`/`sum`/`min`/`max` come from
the sidecar's exact accumulators (ns -> s); `bucketCounts` is projected
from the per-cell DDSketch onto a fixed 17-bucket spanmetrics-style
layout. Empty cells are suppressed.
- `OtlpStatsExporter<C>` runs as a `libdd_shared_runtime::Worker`:
`trigger` waits one flush interval, `run` flushes + sends with
`OTLP_MAX_ATTEMPTS`, `shutdown` force-flushes with
`OTLP_SHUTDOWN_MAX_ATTEMPTS` (single attempt) so the final bucket is
delivered inside the bounded shutdown window.
- The mapper consumes `exact.grpc_method` (always empty here) so the
later breaking-change commit only has to fill it in.
- `otlp/mod.rs`: declare the new `metrics` module, re-export
`OtlpMetricsConfig` and `OtlpStatsExporter`, and extend the module-level
doc to describe the trace-metrics path.
- `trace_exporter/builder.rs`: add `otlp_metrics_endpoint`,
`otlp_metrics_headers` and `otel_trace_semantics_enabled` fields with
matching setters (`set_otlp_metrics_endpoint`, `set_otlp_metrics_headers`,
`enable_otel_trace_semantics`). When both an OTLP metrics endpoint and a
stats bucket size are configured, spawn an `OtlpStatsExporter` worker on
the shared runtime against an unconditionally-started
`SpanConcentrator`; set a new `otlp_stats_enabled` flag on `TraceExporter`
so the agent-info gate cannot later disable stats. The agent /v0.6/stats
payload bytes are unchanged when no OTLP metrics endpoint is set.
- `trace_exporter/mod.rs`: add the `otlp_stats_enabled` field on
`TraceExporter`.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add the gRPC method name to the aggregation key so spans sharing the same
service/resource/etc. but different `grpc.method.name` aggregate into
distinct groups, and surface the value via the OTLP trace-metrics sidecar
introduced earlier on this branch.
- `aggregation.rs`:
- New `GRPC_METHOD_FIELD` lookup list (`grpc.method.name`, fallback
`rpc.method`) consumed by a new `get_grpc_method` helper.
- New `FixedAggregationKey<T>.grpc_method` field, appended at the END of
the struct so the `PartialOrd` derive's field order (and therefore the
ordering of any existing comparisons) is unaffected for the pre-existing
fields.
- `BorrowedAggregationKey::from_obfuscated_span` now picks up
`grpc_method`; `OwnedAggregationKey::From<pb::ClientGroupedStats>` sets
it to `""` (the agent stats protobuf does not carry it).
- `StatsBucket::flush_with_otlp_exact` does `std::mem::take` on the key's
`grpc_method` and moves it into `OtlpExactGroup.grpc_method` before
encoding the agent payload, so the OTLP path reads it from the sidecar
while the /v0.6/stats wire format stays byte-for-byte unchanged.
- Aggregation test gains a case asserting that `grpc.method.name` (and
by fallthrough, `rpc.method`) are extracted into the key.
- `datadog-ipc/src/shm_stats.rs`: the SHM concentrator's
`FixedAggregationKey` test fixture grows a `grpc_method: ""` field.
BREAKING CHANGE: `FixedAggregationKey<T>` (re-exported from
`libdd_trace_stats::span_concentrator`) gains a public `grpc_method: T`
field. External callers that construct it via a struct literal must add
the field; callers using `Default::default()` are unaffected. The /v0.6/stats
agent protobuf wire format and behavior are unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
43fdf9b to
f330f8b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f330f8bee6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…n SDK computes stats When otlp_stats_enabled, add _dd.stats_computed="true" to the OTLP ResourceSpans resource attributes and Datadog-Client-Computed-Stats: yes to the HTTP request headers. The Agent's OTLP receiver already checks both signals (otlp.go:372, otlp.go:272) and skips its concentrator when either is set, preventing double-counted APM metrics. The resource attribute survives Collector hops (unlike HTTP headers); the header covers direct SDK→Agent connections. Both are backwards compatible: older Agents and non-Datadog OTLP receivers silently ignore unknown resource attributes and headers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… negotiation with OTLP stats grpc_method was part of FixedAggregationKey (Hash+PartialEq), splitting same-service gRPC spans into separate buckets that encode_grouped_stats then serialised with an empty method — producing duplicate indistinguishable ClientGroupedStats rows on the /v0.6/stats path. Move it to GroupedStats (value side), set on group creation, and surface it to OtlpExactGroup from there. This also removes the one breaking change introduced by the prior commit. check_agent_info returned before refresh_v1_active when otlp_stats_enabled, preventing V1 protocol negotiation for exporters that combine enable_v1_protocol with OTLP metrics. Move the early return to after the V1 refresh so only stats enable/disable is skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… negotiation with OTLP stats grpc_method was part of FixedAggregationKey (Hash+PartialEq), splitting same-service gRPC spans into separate buckets that encode_grouped_stats then serialised with an empty method — producing duplicate indistinguishable ClientGroupedStats rows on the /v0.6/stats path. Move it to GroupedStats (value side), set on group creation, and surface it to OtlpExactGroup from there. This also removes the one breaking change introduced by the prior commit. check_agent_info returned before refresh_v1_active when otlp_stats_enabled, preventing V1 protocol negotiation for exporters that combine enable_v1_protocol with OTLP metrics. Move the early return to after the V1 refresh so only stats enable/disable is skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| const OTLP_MAX_RETRIES: u32 = 4; | ||
| /// Initial backoff between retries (milliseconds). | ||
| /// Max total attempts for OTLP export (initial + retries on transient failures). | ||
| pub(crate) const OTLP_MAX_ATTEMPTS: u32 = 5; |
There was a problem hiding this comment.
nit: Any reason to use attempts instead of retry. I'm trying to converge to using retries as mixing the two can cause bugs.
There was a problem hiding this comment.
Fair nit — happy to rename back to retries for consistency. The rename to attempts was to make the semantics clearer (OTLP_SHUTDOWN_MAX_ATTEMPTS = 1 means one try, no retries), but retries is fine too if that's the established convention.
ichinaski
left a comment
There was a problem hiding this comment.
Looks good from the Data Pipeline point of view
- Rename OTLP_MAX_ATTEMPTS/OTLP_SHUTDOWN_MAX_ATTEMPTS to OTLP_MAX_RETRIES/ OTLP_SHUTDOWN_MAX_RETRIES and rename the max_attempts parameter to max_retries throughout, converging on the retries convention used elsewhere - Add TraceExporterBuilder::set_runtime_id so callers can supply the language tracer's existing runtime_id; falls back to a generated UUID when not set, ensuring OTLP trace exports and OTLP trace-metrics share the same runtime_id for backend correlation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Overview
Adds native OTLP trace-metrics export to
libdd-data-pipeline. The span concentrator's per-cell stats are encoded as atraces.span.sdk.metrics.durationdelta histogram and POSTed to a configured/v1/metricsendpoint on a 10s cadence. This is the shared Rust implementation consumed by dd-trace-py and other Datadog tracers — tracers only supply configuration.Non-obvious details for reviewers
count/sum/min/maxare exact (from per-cell accumulators, ns→s), not reconstructed from the DDSketch. The bucket distribution is projected from the DDSketch onto a fixed 17-bucket layout matching the OTel spanmetrics-connector defaults so payloads are comparable across tracers.check_agent_infostill runs V1 protocol negotiation but skips stats enable/disable — the concentrator starts unconditionally and is not affected by agent info._dd.stats_computed: "true"is added to every outbound OTLP traceResourceSpans(andDatadog-Client-Computed-Stats: yesto the HTTP request) when stats are enabled. The Agent's OTLP receiver checks this attribute to skip its concentrator, preventing double-counted APM metrics. Set unconditionally regardless of OTel semantics mode. Both signals are backwards compatible with released Agents.grpc_methodis stored inGroupedStats(value side), not inFixedAggregationKey. This keeps it out ofHash/PartialEqso gRPC spans with different methods merge into one agent stats bucket — matching pre-existing behaviour. The method is still surfaced onOtlpExactGroupfor OTLP data points. No breaking changes.Testing: 154 nextest tests pass across
libdd-trace-statsandlibdd-data-pipeline. Companion: DataDog/dd-trace-py#18354.