Skip to content

Fix trace ID collision in DDEntryPoint.continue()#1270

Merged
kubukoz merged 5 commits intotypelevel:mainfrom
brianmowen:fix/ddentrypoint-ignore-active-span
Feb 4, 2026
Merged

Fix trace ID collision in DDEntryPoint.continue()#1270
kubukoz merged 5 commits intotypelevel:mainfrom
brianmowen:fix/ddentrypoint-ignore-active-span

Conversation

@brianmowen
Copy link
Contributor

When using natchez-datadog with cats-effect in a high-concurrency scenario, multiple unrelated requests can end up with the same trace ID. This causes their spans to appear as siblings in Datadog, making distributed tracing unreliable.

Root Cause

In DDEntryPoint.scala, the root() method correctly uses ignoreActiveSpan():

override def root(name: String, options: Span.Options): Resource[F, Span[F]] = {
  def initSpan(): F[ot.Span] =
    Sync[F]
      .delay(tracer.buildSpan(name).ignoreActiveSpan())  // ✓ Correct
      // ...
}

However, continue() does not use ignoreActiveSpan():

override def continue(name: String, kernel: Kernel, options: Span.Options): Resource[F, Span[F]] = {
  def initSpan(): F[ot.Span] =
    Sync[F]
      .delay {
        val spanContext = tracer.extract(Format.Builtin.HTTP_HEADERS, new TextMapAdapter(kernel.toJava))
        tracer.buildSpan(name).asChildOf(spanContext)  // ✗ Missing ignoreActiveSpan()
      }
      // ...
}

The Datadog tracer uses thread-local storage for the "active span". When continue() is called:

  1. If the kernel has no tracing headers (common for external requests), tracer.extract() returns null
  2. asChildOf(null) is a no-op for setting the parent from headers
  3. Without ignoreActiveSpan(), the tracer falls back to using whatever span is currently "active" on that thread
  4. In fiber-based runtimes like cats-effect, fibers don't have a 1:1 mapping to threads
  5. A fiber may run on a thread that has another fiber's span as the "active span"
  6. Result: The new span inherits the wrong trace ID

Why continueOrElseRoot Doesn't Help

override def continueOrElseRoot(name: String, kernel: Kernel, options: Span.Options): Resource[F, Span[F]] =
  continue(name, kernel, options).flatMap {
    case null => root(name, options)  // Fallback never triggers!
    case span => span.pure[Resource[F, *]]
  }

The fallback to root() only happens if the span itself is null. But continue() always returns a valid span - it just has the wrong parent. The null check doesn't trigger for missing/invalid headers.

Reproduction

We created a minimal reproduction that demonstrates the bug:

test("DDTracer trace ID collision under concurrency") {
  DDTracer.entryPoint[IO] { builder =>
    IO.delay(builder.serviceName("test").build())
  }.use { ep =>
    for {
      traceIds <- Ref.of[IO, List[Option[String]]](List.empty)
      
      // Process 100 concurrent requests
      _ <- (1 to 100).toList.parTraverse { _ =>
        ep.continueOrElseRoot("request", Kernel(Map.empty)).use { span =>
          span.traceId.flatMap(id => traceIds.update(_ :+ id))
        }
      }
      
      ids <- traceIds.get
    } yield {
      val uniqueIds = ids.flatten.distinct
      // BUG: uniqueIds.size is ~20, not 100!
      assertEquals(uniqueIds.size, 100)
    }
  }
}

Results:

  • Expected: 100 unique trace IDs
  • Actual: ~20 unique trace IDs (varies based on thread scheduling)

Proposed Fix

Add ignoreActiveSpan() to continue() and only set the parent from the extracted spanContext if it's not null:

override def continue(name: String, kernel: Kernel, options: Span.Options): Resource[F, Span[F]] = {
  def initSpan(): F[ot.Span] =
    Sync[F]
      .delay {
        val spanContext = tracer.extract(
          Format.Builtin.HTTP_HEADERS,
          new TextMapAdapter(kernel.toJava)
        )
        val builder = tracer.buildSpan(name).ignoreActiveSpan()  // ← ADD THIS
        if (spanContext != null) builder.asChildOf(spanContext)
        else builder
      }
      .flatTap(addSpanKind(_, options.spanKind))
      .flatMap(options.links.foldM(_)(addLink[F](tracer)))
      .flatMap(builder => Sync[F].delay(builder.start()))

  // ... rest unchanged
}

Environment

  • natchez-datadog version: 0.3.8
  • Scala version: 2.13.18
  • cats-effect version: 3.x
  • Runtime: Fiber-based (cats-effect IO)

Impact

This bug affects any application using natchez-datadog with:

  • Concurrent request handling
  • Requests that don't carry incoming trace headers (external requests)
  • Fiber-based runtimes (cats-effect, ZIO via interop, etc.)

Add ignoreActiveSpan() to the continue() method to prevent the
Datadog tracer from inheriting thread-local active spans.

In fiber-based runtimes (cats-effect, ZIO), fibers don't have a
1:1 mapping to threads. Without ignoreActiveSpan(), when continue()
is called with an empty or invalid kernel, the tracer falls back
to the thread's currently active span, causing unrelated requests
to share trace IDs.

The root() method already correctly uses ignoreActiveSpan(), but
continue() was missing it. This fix ensures both methods behave
consistently and prevent trace ID collisions under high concurrency.
@mergify mergify bot added the datadog label Jan 30, 2026
Brian Owen added 3 commits January 30, 2026 13:48
Add ignoreActiveSpan() to the continue() method to prevent the
Datadog tracer from inheriting thread-local active spans.

In fiber-based runtimes (cats-effect, ZIO), fibers don't have a
1:1 mapping to threads. Without ignoreActiveSpan(), when continue()
is called with an empty or invalid kernel, the tracer falls back
to the thread's currently active span, causing unrelated requests
to share trace IDs.

The root() method already correctly uses ignoreActiveSpan(), but
continue() was missing it. This fix ensures both methods behave
consistently and prevent trace ID collisions under high concurrency.

Added test that verifies:
- Without fix: ~13 unique trace IDs for 100 concurrent requests
- With fix: 100 unique trace IDs (each request gets its own trace)
Copy link
Member

@kubukoz kubukoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good, just one nitpick for the tests.

}
.use { ep =>
for {
traceIds <- AtomicCell[IO].of(List.empty[Option[String]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: would a Ref suffice here? It doesn't looks like we're using evalModify or others, here nor in the other test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, updated here d184575

@kubukoz kubukoz merged commit 03993df into typelevel:main Feb 4, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants