Skip to content

fix(telemetry): concurrency issue on endpoint unreachable#213

Closed
pablomartinezbernardo wants to merge 1 commit intomainfrom
pmartinez/unreachable-telemetry-concurrency
Closed

fix(telemetry): concurrency issue on endpoint unreachable#213
pablomartinezbernardo wants to merge 1 commit intomainfrom
pmartinez/unreachable-telemetry-concurrency

Conversation

@pablomartinezbernardo
Copy link
Copy Markdown
Contributor

@pablomartinezbernardo pablomartinezbernardo commented May 9, 2025

Description

Apparent when upgrading dd-trace-cpp for rum injection IIS e2e tests. This situation pops up when the request to telemetry takes longer than the drain time in the constructor.

The issue is sort of hiding in this part of the code

TelemetryProxy make_telemetry(const Ctor_param& init) {
  if (!init.configuration.enabled) return NoopTelemetry{};
  return Telemetry{init.configuration, init.logger,    init.client,
                   init.scheduler,     init.agent_url, init.clock};
}

TelemetryProxy& instance(
    const tracing::Optional<Ctor_param>& init = tracing::nullopt) {
  static TelemetryProxy telemetry(make_telemetry(*init));
  return telemetry;
}

void init(FinalizedConfiguration configuration,
          std::shared_ptr<tracing::Logger> logger,
          std::shared_ptr<tracing::HTTPClient> client,
          std::shared_ptr<tracing::EventScheduler> event_scheduler,
          tracing::HTTPClient::URL agent_url, tracing::Clock clock) {
  instance(Ctor_param{configuration, logger, client, event_scheduler, agent_url,
                      clock});
}

A first Telemetry instance is created in make_telemetry. When constructing TelemetryProxy that instance is not passed as a reference, but instead the Telemetry(Telemetry&&) constructor is invoked, which moves members from the first instance to the second. The initial Telemetry instance which attempted to send the app-started event, calls back on error a bit over 2 seconds after (In my Windows machine, it takes ~2200ms to error), accessing a variable (counters_ in this case) that was moved from and now in an invalid state.

If the request takes shorter than the drain time, it is not an issue because when drain() exits all callbacks are already complete, and no more callbacks will be done to the original Telemetry object.

This is not a great solution, because the update to counters_ will be ignored, but created the PR for illustration purposes. A better solution might be to not do send_telemetry("app-started", app_started()); in the constructor and instead do it in the init function, which will avoid receiving callbacks in a moved-from instance.

Motivation

Additional Notes

Jira ticket: [PROJ-IDENT]

@pablomartinezbernardo pablomartinezbernardo requested a review from a team as a code owner May 9, 2025 12:14
@pablomartinezbernardo pablomartinezbernardo requested review from dmehala and removed request for a team May 9, 2025 12:14
@pablomartinezbernardo pablomartinezbernardo marked this pull request as draft May 9, 2025 12:14
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 9, 2025

Codecov Report

Attention: Patch coverage is 64.28571% with 5 lines in your changes missing coverage. Please review.

Project coverage is 86.96%. Comparing base (e14a5a4) to head (8743842).

Files with missing lines Patch % Lines
src/datadog/telemetry/telemetry_impl.cpp 64.28% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #213      +/-   ##
==========================================
- Coverage   87.02%   86.96%   -0.07%     
==========================================
  Files          80       80              
  Lines        5140     5154      +14     
==========================================
+ Hits         4473     4482       +9     
- Misses        667      672       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pablomartinezbernardo pablomartinezbernardo marked this pull request as ready for review May 9, 2025 12:33
@pablomartinezbernardo pablomartinezbernardo marked this pull request as draft May 9, 2025 15:21
@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented May 9, 2025

Benchmarks

Benchmark execution time: 2025-05-13 13:04:43

Comparing candidate commit 8743842 in PR branch pmartinez/unreachable-telemetry-concurrency with baseline commit e14a5a4 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics.

@pablomartinezbernardo pablomartinezbernardo force-pushed the pmartinez/unreachable-telemetry-concurrency branch 6 times, most recently from 39ce617 to 1b46b3c Compare May 12, 2025 10:36
@datadog-datadog-prod-us1
Copy link
Copy Markdown

Datadog Summary

✅ Code Quality    ✅ Code Security    ✅ Dependencies


Was this helpful? Give us feedback!

@pablomartinezbernardo pablomartinezbernardo force-pushed the pmartinez/unreachable-telemetry-concurrency branch from 0e3853f to 8743842 Compare May 13, 2025 13:00
@pablomartinezbernardo pablomartinezbernardo marked this pull request as ready for review May 13, 2025 13:21
@dmehala dmehala deleted the pmartinez/unreachable-telemetry-concurrency branch September 14, 2025 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants