Provide ability to capture continuations using the Context API#11663
Provide ability to capture continuations using the Context API#11663mcculls wants to merge 16 commits into
Conversation
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR extends the datadog.context API to support capturing/resuming execution context via a new ContextContinuation abstraction, and wires dd-trace’s scope continuation mechanism to also implement that API (so Context.current().capture() can hold traces similarly to the legacy tracer continuation).
Changes:
- Introduces
ContextContinuation+ContextListenerand addsContext.capture()/ContextManager.capture(...)/ listener registration APIs. - Implements
capture(...)for both the defaultThreadLocalContextManagerand dd-trace’sContinuableScopeManager, and makesAgentScope.Continuationalso implementContextContinuation. - Adds extensive tests for continuation semantics, noop scopes, and listener lifecycle events.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal-api/src/main/java/datadog/trace/bootstrap/instrumentation/api/AgentScope.java | Makes AgentScope.Continuation also implement ContextContinuation and maps methods (resume→activate, release→cancel). |
| dd-trace-core/src/main/java/datadog/trace/core/scopemanager/ContinuableScopeManager.java | Adds ContextManager.capture(Context) implementation and an addListener(...) stub. |
| dd-trace-core/src/test/java/datadog/trace/core/scopemanager/ScopeManagerTest.java | Adds tests verifying Context API capture/continuation behavior holds/unblocks traces and activates spans. |
| components/context/src/main/java/datadog/context/Context.java | Adds Context.capture() default method. |
| components/context/src/main/java/datadog/context/ContextManager.java | Adds capture + listener APIs and a static register(ContextListener) helper. |
| components/context/src/main/java/datadog/context/ContextContinuation.java | New public API for capturing/resuming context across execution units. |
| components/context/src/main/java/datadog/context/ContextListener.java | New public listener API for attach/detach/capture/release events. |
| components/context/src/main/java/datadog/context/ThreadLocalContextManager.java | Implements capture/resume/release semantics and listener notification for the default manager. |
| components/context/src/main/java/datadog/context/NoopContextScope.java | New cached noop ContextScope implementation for same-context attach/resume no-ops. |
| components/context/src/main/java/datadog/context/EmptyContextContinuation.java | New noop continuation for root/empty context. |
| components/context/src/main/java/datadog/context/EmptyContext.java | Makes EmptyContext constructor private (enforces singleton). |
| components/context/src/main/java/datadog/context/TestContextManager.java | Delegates new capture and addListener APIs to the underlying manager. |
| components/context/src/test/java/datadog/context/ContextContinuationTest.java | New comprehensive tests for continuation lifecycle, concurrency, holds, and event ordering. |
| components/context/src/test/java/datadog/context/ContextManagerTest.java | Adds tests for noop-scope caching/collisions and listener behaviors. |
| components/context/src/test/java/datadog/context/ContextProvidersForkedTest.java | Updates custom manager stub to implement new APIs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
There was a problem hiding this comment.
Early review of the API with a lot of refinement comments - trying to nail the API right before merging it. The implementation feels quite close to the current one so far. It's also nice to see the AgentSpan.Continuation nicely retrofitted to the new ContextContinuation.
Glad to see this piece of context coming! 🤝
396e8a0 to
c8de592
Compare
|
By the way, my comments only are early feedback. I don't expect to have them to get answer in timely manner. |
2754977 to
d40fa11
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d40fa11a82
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@dougqh I'd like to merge this today unless there are obvious blockers response from Doug (which somehow ended up edited into my comment?)
|
| * }); | ||
| * }</pre> | ||
| * | ||
| * <p>If a continuation is never resumed (e.g. a task is cancelled before it runs), you must release |
There was a problem hiding this comment.
I'm concerned by the requirement to explicitly release.
There are places where that's hard to do.
I also think it complicates handling of dispatch loops. For dispatch loops, I rather guard the loop with ignoring incoming context - rather than stopping it on the originating end.
I say that because dispatch loops are often set-up lazily with a varied set of construction points, so defending against all of them seems messier.
I think if we could come up with a system without this requirement; it would clean-up instrumentation considerably. Right now, I feel like we're taking a step backwards.
There was a problem hiding this comment.
TBH that decision is best left up to the consumer of the capture and release events.
The code here as it stands won't cause any resource leak per se (the continuation is not kept in any internal cache) - what it will do is delay calling the release event until the conditions have been met.
Now the tracer, as a consumer of that event, can decide to forgo waiting on the release event - if it determines that there is either no need to hold back the trace, or that continuing to wait will cause resource issues. It is better placed to make that decision because it knows how many traces are pending, etc.
This avoids overcomplicating the context layer while still allowing downstream consumers to decide how to handle long-running situations. The alternative would be to try and use some kind of weak reference, which would introduce other issues around liveliness.
There was a problem hiding this comment.
I think it would be better to complicate the context layer rather than complicating the numerous instrumentations. But mostly, I just wish we explored this as a group before the decision was made.
There was a problem hiding this comment.
No, this is about the consumer of the events - i.e. the tracer core - not instrumentations.
Instrumentations are what trigger these events, which is why I want to keep the context layer simple - rather than second guessing when to bypass the "rules" and send additional events to the tracer core.
There was a problem hiding this comment.
I'm not proposing that the context layer second guess when to bypass the rules.
But what I'm saying is that a rule that assumes that the user code is well-behaved is problematic, I don't regard that as a bypass. For me, it is just about maintaining our requirement that we "do no harm" even when the user code is misbehaving.
So I'm just asking for the system to be resilient to misuse, that doesn't necessarily mean doing some fix-up -- just not allowing for runaway memory consumption.
|
|
||
| ContextListener[] ls = listeners; | ||
| notifyDetach(previous, ls); | ||
| holder.current = context; |
There was a problem hiding this comment.
It looks like this can grow in and unbounded fashion.
We need to add a guard against that.
There was a problem hiding this comment.
Technically this is guarded by the call stack size - also I would prefer not to add book-keeping until we verify it is needed because it a) can end up consuming resources more than protecting them and b) can lead to objects escaping from the stack which is worse then if we'd kept things simple.
There was a problem hiding this comment.
Doesn't that presume that the end user uses the API correctly, I don't think that's a safe assumption.
We're really not suppose to have any unbounded data structures, so the Context chain is worrying from my perspective. The fact that is now Context and not Scope doesn't change the situation.
There was a problem hiding this comment.
Context scopes are not collected in this layer because the scope stack has gone away - the only place they can accumulate is on the call stack. If a method erroneously calls activate multiple times and throws the scope away then there won't be a leak of scopes because there is no scope stack pinning them in the heap.
To leak context scopes you would need to call the API multiple times with different contexts, and keep the scopes alive in a way where they escape onto the heap. I'm just wary that by re-introducing state back into scopes (or any kind of "stack") that we'll go back to heap-allocating scopes which results in more allocations and more garbage to collect vs. the simple approach used here.
There was a problem hiding this comment.
Yes, what you are saying is true for Scopes; that's correct. However, the Context linked list is itself unbounded. While misuse may be rare or hard, I still think we should guard against it.
And I'm not proposing putting state back into Scope or Continuation. Continuation need only be an immutable reference to Context.
I'm actually trying to move to a model with less state in the user visible objects by moving the linked list out of Context.
There was a problem hiding this comment.
there is no linked list in Context - Context is an immutable object with no links to other contexts
| @Override | ||
| public void close() { | ||
| if (!closed) { | ||
| // check for out-of-order close to avoid corrupting the current state |
There was a problem hiding this comment.
Even if we don't have the same degree of forgiveness in the API, I'd at least like to see the strict mode restored so we can diagnose problems.
There was a problem hiding this comment.
Strict mode will be added in a separate PR
There was a problem hiding this comment.
What's the time frame? Before the next release?
There was a problem hiding this comment.
Strict mode will be added before ThreadLocalContextManager becomes the default implementation.
The default manager for the next release will still be ContinuableScopeManager. This PR is a migration step to get instrumentations away from AgentScope and related structures and towards a lighter API.
This PR does not change the default implementation. It is extending the API to help move instrumentations away from the complex Agent API onto a slimmer more focused API. It does extend I also disagree that the data structure is unbounded, but that's best discussed off-line. Benchmarks so far show that the simpler approach actually reduces memory and overhead due to fewer objects escaping onto the heap. |
Benchmarking shows these scopes aren't alive long enough to pay for the caching
db47043 to
1d73e24
Compare
|
For the most part, I think we should settle the design before discussing performance, but I did want to raise some concerns about the escape analysis claims. Escape analysis in C2 is actually a very brittle optimization, so unfortunately, a microbenchmark doesn't prove much. I fully expect in real world usage the Scope will get allocated because the Scope creation and Scope closing will often sit around I/O work that the JIT will fail to inline fully through. To really verify the optimization is happening, I'd want to see an allocation profile from a real application (for instance Spring PetClinic). If we really want to avoid the allocation, I think a claim check system is likely a better solution. Rather than return an object, we return just a number. Then on close, we provide the same number again, so the stack can be verified. And then we can wrap the claim check mechanism with a Scope object for public and legacy APIs. |
|
I think most the majority of the problems that I’ve raised can be solved with two key changes. Make Context Immutable By making Context immutable, we can solve the continuation leaking problems But doing so requires that Context not contain a pointer to another Context The simplest solution would be a Context[] in ContextHolder and an index into the Context[] The primary downside is that it makes the ContextHolder more expensive to construct which is bad for virtual threads While that is a bit ugly, it is at least nicely contained in one place Adding a NO_CONTEXT mechanism From a survey of the “leak” PRs, the primary problem is context leaking into a dispatch loop I think the cleanest way to that is introduce a NO_CONTEXT mechanism that ignores any surrounding Context already on the stack. |
8760c68 to
df32a37
Compare
I would need to have an example of the dispatch loop code pattern to help me visualize. But adding a mechanism might not fix wrong context API usage from customers because we won't be able to expose it. |
This helps reduce allocations when attaching the same context multiple times.
As stated elsewhere, I agree that the Scopes are bounded, but the Contexts are not. As for benchmarks, I would put API safety - avoiding the unhappy case ahead of maximizing performance in the happy case. I'd like to think we can find approach that can balance safety and performance. |
By dispatch loop, I mean anything of the form... This applies to servers, executors, message listener loops, etc. The trouble with dispatch loops is that they are often set-up lazily and capture the surrounding context. I had Claude do a sweep and this matches most (all) of the PRs where we're fixing context leaks currently. As for API, you are quite right that it isn't something for customers per se. |
If I can add something, the leakage fix PR we're doing so far is to fill the technical debts we have with instrumentations that have not addressed that well in the past. But I think that also is not directly related to this PR in particular but a wider topic |
The way the new context api is supposed to work is by attaching context to those work item using
That's the kind of issue we would have to deal with as part of rolling out the new API. Making sure the new API is correctly leverage across all the instrumentations. Hence the tool for leakage detection from test runs - note that requires having tests with decent coverage. |
|
Thanks for clarifying the implementation during the meeting. |
Thanks - I'll think of a field name with fewer connotations (OTel uses |
bd26491 to
4f2f209
Compare
…urther reduce allocations
4f2f209 to
eb4bec9
Compare
What Does This Do
Introduces basic continuations and events to the Context API
Motivation
This is meant as a stepping stone from the old API towards a simpler foundation
Additional Notes
Benchmarking summary from Claude:
ThreadLocalContextManager: nested attach() calls add zero bytes because theContextScopeImplobjects are stack-allocated by escape analysis.ContinuableScopeManager: eachContinuableScopeobject adds ≈32 B because they escape to the heap via theScopeStacklinked structure.Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issue/merge. You can also:/merge --commit-message "..."/merge -c/merge -f --reason "reason"; please use this judiciously, as some checks do not run at the PR-level (note: the PR still needs to be mergeable, this will only skip the pre-merge build)Jira ticket: [PROJ-IDENT]