Skip to content

OTEP: Thread Context: Sharing Thread-Level Information with external readers#4947

Open
scottgerring wants to merge 28 commits into
open-telemetry:mainfrom
scottgerring:profiling-tls-context
Open

OTEP: Thread Context: Sharing Thread-Level Information with external readers#4947
scottgerring wants to merge 28 commits into
open-telemetry:mainfrom
scottgerring:profiling-tls-context

Conversation

@scottgerring

@scottgerring scottgerring commented Mar 16, 2026

Copy link
Copy Markdown
Member

Changes

External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot collect information about active OpenTelemetry traces running within the process they are sampling. We (@ivoanjo and @scottgerring) propose a mechanism for OpenTelemetry SDKs to publish thread-level attributes — including trace ID, span ID, and configurable custom attributes — through a standard format based on the ELF Thread Local Storage (TLSDESC) dialect.

Because this mechanism relies on having a native component and knowing when a runtime switches contexts, we consider it optional for SDKs to support, as some runtimes (or even runtime versions) may not be able to feasibly/efficiently (or undesirable, maintenance-wise) to implement it.

When a request context is attached or detached from a thread, the SDK publishes select information to a thread-local variable that external readers such as the eBPF profiler can discover and read. This enables correlation of profiling samples with request context, even when the active span was not sampled by the SDK.

This builds on and extends OTEP 4719: Process Context, using its process context mechanism to store the static, process-scoped reference data (schema version and attribute key map) that the thread-local records reference.

Why open as draft: OTEP 4719 is a dependency for this OTEP, and thus we'll need to wait for that OTEP to land to ensure we have a solid underpinning to build on.

This OTEP is based and heavily inspired on the custom-labels work by [Polar Signals](https://www.polarsignals.com/) and the universal profiling integration by Elastic — big thanks to them for the prior art and inspiration.and everyone that collaborated with us on the draft google doc this is based on.

@scottgerring scottgerring force-pushed the profiling-tls-context branch from 3cf6426 to 047db5a Compare March 26, 2026 09:47
@scottgerring

Copy link
Copy Markdown
Member Author

Now that #4719 has been accepted, this one is ready for review!

@scottgerring scottgerring marked this pull request as ready for review April 8, 2026 07:49
Comment thread oteps/profiles/4947-thread-ctx.md Outdated
Comment thread oteps/profiles/4947-thread-ctx.md Outdated
Comment thread oteps/profiles/4947-thread-ctx.md Outdated
@scottgerring scottgerring changed the title OTEP: Thread Context: Sharing Thread-Level Information with the OpenTelemetry eBPF Profiler OTEP: Thread Context: Sharing Thread-Level Information with external readers Apr 9, 2026
Co-authored-by: Attila Szegedi <szegedi@users.noreply.github.com>
@linux-foundation-easycla

linux-foundation-easycla Bot commented Apr 13, 2026

Copy link
Copy Markdown

CLA Signed

The committers listed above are authorized under a signed CLA.

Comment thread oteps/profiles/4947-thread-ctx.md Outdated

Because this mechanism relies on having a native component and knowing when a runtime switches contexts, we consider it optional for SDKs to support, as some runtimes (or even runtime versions) may not be able to feasibly/efficiently implement it.

The TLSDESC-based publication mechanism and the in-memory **Thread Local Context Record** format are intentionally separable. Runtimes that cannot efficiently expose context through TLSDESC are not required to implement this mechanism. However, we highly recommend reusing the same in-memory record format when publishing equivalent context through a runtime-specific discovery mechanism, rather than defining a runtime-specific payload format. Such mechanisms are out of scope for this OTEP, but reusing the record layout where practical allows readers to share parsing logic across runtimes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both go and node do end up mapping their own high-level things onto OS threads, so there is a world where they could use this thread-local context as well.

Thus at least for me the distinction is about what we can be implemented efficiently as a library -- it doesn't matter that "theoretically" someone could change go/node directly to update the thread-local context when they swap async execution scopes or goroutines, because we need a solution that works today.

This to say, in terms of wording, I think that's what's interesting to get across?

Comment thread oteps/profiles/4947-thread-ctx.md
Comment thread oteps/profiles/4947-thread-ctx.md
Comment thread oteps/profiles/4947-thread-ctx.md Outdated

### Thread Local Context Record

This is the attached thread record itself. SDK-side implementations may choose to hold multiple instances of this for active spans, and attach/detach them by setting the TLS to point to the appropriate entry. We err on the side of simplicity and support stringy (utf-8 bytes) attributes only.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either one works... And I have a small suggestion -- perhaps we could include a "hexdump" of a tiny example to clarify how it looks in practice?

We didn't do this for process context since that's mostly protobuf, but folks bootstrapping their own readers could use as a first test to check their parsing.

| :-------------- | :--------- | :--------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------- |
| trace-id | | uint8[16] | In W3C Trace Context format. Zeroes can be used to indicate none active. If either of trace-id/span-id are set, both must be set. |
| span-id | | uint8[8] | In W3C Trace Context format. |
| valid | | uint8 | This value is set to 1 when the record is valid. Consumers should ignore this record if any other value is set when they read. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this :D

We actually already have some text about this later on in the doc:

A SDK should choose to either set/unset the TLS pointer itself, or the valid flag, but not both.

The intention of this design is to enable flexibility for the writers. We envision that some writers may choose to keep a fixed record for a given thread and can thus mutate it in-place, and that other writers may instead keep the record associated with some other high-level concept (coroutine, request, etc) and swap out the pointer as needed when they become active on any given thread.

Comment thread oteps/profiles/4947-thread-ctx.md Outdated
| attrs-data | | uint8[] | A byte buffer containing the attributes themselves. Its total length is given by `attrs-data-size`. |
| | [x].key | uint8 (*See below for alternative) | Index into the key table |
| | [x].length | uint8 (*See below for alternative) | Length of val string |
| | [x].val | uint8[length] (utf-8 bytes) | Inline array of the string value itself. Exactly 'length' bytes appear here, before the next record begins. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was discussing this with Scott and I realized I wasn't quite sure I fully understood this suggestion.

Are you suggesting:

  1. That the overall "Thread-Local Context Record" should have a total size that is aligned (so that e.g. people don't create one that is "exactly 613 bytes")?
  2. That inside attrs-data the key/length/bytes entries should be aligned?
  3. (Something else?)

Comment thread oteps/profiles/4947-thread-ctx.md Outdated

When a request context is attached to a thread, the SDK:

1. Sets the TLS pointer to 0 to ensure readers see no record during construction

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could restrict this a bit; yet I think allowing "no record" does make things easier on the writer. It allows a new webserver request to start by setting a record and to end by clearing it (before a thread goes back to the idle thread pool or similar).

Requiring always an active record would disallow an implementation of this kind. So I'd classify the "be able to go back to no record" as a nice to have, although yes we could make do without it.

Comment thread oteps/profiles/4947-thread-ctx.md Outdated
Comment thread oteps/profiles/4947-thread-ctx.md

## Explanation

We propose a mechanism for OpenTelemetry SDKs to publish thread-level information reflecting the context of the active request, if any, through a standard format based on the Linux-specific ELF Thread-Local Storage (TLS) TLSDESC dialect.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean it will be targeting Linux only? (e.g., what about MacOS and Windows)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this first version yes, for a few reasons:

  • We expect the first consumers of this to be the eBPF Profiler and OBI both of which are Linux-specific
  • This mechanism is quite low-level and with very tight efficiency goals, so we believe it's better to evaluate how exactly they would map to other OS with more care, than just going "they also have some thread-locals right? just use whatever they have there"
  • While all three OS provide thread-locals, there is a very different security model across them (especially on macOS), and because of that it's not clear that this option would be the best one there

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread oteps/profiles/4947-thread-ctx.md Outdated
Comment thread oteps/profiles/4947-thread-ctx.md Outdated
Comment thread oteps/profiles/4947-thread-ctx.md Outdated
Comment thread oteps/profiles/4947-thread-ctx.md Outdated

**Why this layout**: It is scalable; if a writer is configured to publish only the required attributes and no custom fields, it can set `attrs-data-size` to 0; having read the first portion of the structure, the reader stops without having to read the rest.

**Cache Impact:** Likewise, a frugal writer may aim to keep the entire record under 64 bytes (the typical size of a cache line) in which case we might expect the entire record is in cache after the first read. This format takes 28 bytes for the lead in, then an overhead of 2 bytes per key-value pair. For a single pair, we'd have 28 bytes for the value; for two, 26 bytes total. If we expected to track attributes such as `path` and `method`, this means we'd realistically expect to fit in a single cache line of 64 bytes. We recommend keeping at or under 640 bytes for the total record [to match the limit](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/tree/main/design-docs/00002-custom-labels#proposed-solution) on the OTel eBPF Profiler.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a single pair, we'd have 28 bytes for the value; for two, 26 bytes total.
This sounds unexpected to me. Could we rephrase it to make it clear how we get to 28 bytes for a single value but 26 bytes for two values?

Comment thread oteps/profiles/4947-thread-ctx.md
Comment thread oteps/profiles/4947-thread-ctx.md
| :-------------- | :--------- | :--------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------- |
| trace-id | | uint8[16] | In W3C Trace Context format. Zeroes can be used to indicate none active. If either of trace-id/span-id are set, both must be set. |
| span-id | | uint8[8] | In W3C Trace Context format. |
| valid | | uint8 | This value is set to 1 when the record is valid. Consumers should ignore this record if any other value is set when they read. |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about making valid future proof and allow us to make some changes in the future by adding more flags?

Suggested change
| valid | | uint8 | This value is set to 1 when the record is valid. Consumers should ignore this record if any other value is set when they read. |
| flag | | uint8 | This flag is set to 1 when the record is valid. Consumers should ignore this record if any other value is set when they read. All other values are reserved and treated as invalid |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, we already have the _reserved we could use for flags, so that could be an alternative as well. (Although we're not mandating it should be all zeros for the first version of the spec and perhaps we should?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just adding that i'm actively not going to touch this one until we're all aligned on the need for valid)

Comment thread oteps/profiles/4947-thread-ctx.md Outdated
Comment thread oteps/profiles/4947-thread-ctx.md
Comment thread oteps/profiles/4947-thread-ctx.md
Co-authored-by: Ivo Anjo <ivo@ivoanjo.me>
Co-authored-by: Florian Lehner <florianl@users.noreply.github.com>
@scottgerring scottgerring requested a review from a team as a code owner June 9, 2026 11:39

> **Note:** The `threadlocal.*` keys are defined here rather than as semantic conventions because they are inter-process coordination metadata, not telemetry attributes, and are not expected to appear in OTLP exports.

The exact format used will be the `repeated KeyValue` protobuf structure from the `ProcessContext.attributes` field standardized in OTEP-4719. A stringified representation of this showing the usage of the elements of that schema along with some example values:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be preferring a write-optimised format here?

Protobufs are size-prefixed - so if you go to mutate this dictionary, you may need to rewrite the entire block or otherwise keep the block in memory with various optimisations. Is that desired?

```

**Why:** this mechanism separates static, process-scoped data from the TLS storage, so that a reader can read it once and not every time it samples a thread.
This reduces the cost of both writing and reading thread samples, while retaining flexibility to store an arbitrary set of extra attributes on samples as required.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is whether or not we know the dictionary ahead of time - What does that look like in instrumentation?

I realize we may be able to do this in OBI - how does this work generically in an API/SDK component?

| :-------------- | :--------- | :--------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------- |
| trace-id | | uint8[16] | In W3C Trace Context format. Zeroes can be used to indicate none active. If either of trace-id/span-id are set, both must be set. |
| span-id | | uint8[8] | In W3C Trace Context format. |
| valid | | uint8 | This value is set to 1 when the record is valid. Consumers should ignore this record if any other value is set when they read. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also include trace flags here somewhere

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point: This didn't come up so far as none of the readers or similar prior implementations we've seen wanted trace flags.

I'm kinda leaning towards stating it could be included in the (optional) attributes section in the future if readers start wanting this info... 👀

The other alternative would be to take over the _reserved byte that we added for alignment after valid, since that'd be enough to carry traceflags.

| _reserved | | uint8 | One spare byte here to align attrs-data-size at two byte boundary. |
| attrs-data-size | | uint16 | Size of `attrs-data`. This lets the reader know when it has consumed all `attrs-data` records within the TLS buffer. The total record is recommended to stay at or under 640 bytes. |
| attrs-data | | uint8[] | A byte buffer containing the attributes themselves. Its total length is given by `attrs-data-size`. |
| | [x].key | uint8 (*See below for alternative) | Index into the key table. Readers MUST ignore entries whose key index is outside `threadlocal.attribute_key_map`. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC - this means we need to update the overall dictionary from ANY thread.

I have concerns about that - how do we do it safely and notfiy any reader that this memory has changed? How often does this lead to a cache flush in hot paths?

* If it does, collect them
* Note down the TLS offset for **Thread-Local Reference Data** discovered

If the `threadlocal.*` keys are not present in the process context at this point, the reader should defer TLS symbol discovery and re-run step 1.2 when the process context is next updated. Readers can detect process context updates using the polling or prctl-hook mechanisms described in OTEP-4719.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bit that makes me the most nervous - Does this mean we need to understand all possible keys at startup? Our APIs are dynamic here for what can go into context - does this come with a mechanism to lock that down?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.