Hi there! As discussed on discord, this is a proposal/design to add OTel traces to the SDK so workflow runs and step attempts show up alongside the rest of a user's traces, with trace context + baggage flowing across replays and child workflows.
The proposal covers workflow traces only, does not include OTEL for metrics, logging, backend spans, DB migrations etc.
Traces
Example trace
A process-order workflow that calls a child workflow, waits up to an hour for a confirmation signal, cools down for 30 seconds, and then notifies the customer (with a retry). The parent runs across five executions; the child finishes in one.
defineWorkflow({ name: "process-order" }, async ({ step, input }) => {
await step.run({ name: "validate-order" }, () => { ... });
await step.runWorkflow(fulfillShipment, { orderId: input.orderId });
await step.waitForSignal({
signal: `confirmed:${input.orderId}`,
timeout: "1h",
});
await step.sleep("cooldown", "30s");
await step.run({ name: "notify-customer" }, () => { ... });
});
Everything lives under the trace_id of the caller that invoked the workflow.
app.orders.handler caller's HTTP span
└── workflow.start process-order creates run-A
│
├── workflow.execute process-order [run-A execution 1]
│ ├── step.run validate-order
│ └── step.run_workflow fulfill-shipment creates run-B, parks
│ └── workflow.execute fulfill-shipment [run-B execution 1]
│ ├── step.run reserve-inventory
│ └── step.run dispatch-courier
│
├── workflow.execute process-order [run-A execution 2, replay=true]
│ ├── step.run_workflow fulfill-shipment child done, OK
│ └── step.wait_for_signal parks until signal or timeout
│
├── workflow.execute process-order [run-A execution 3, replay=true]
│ ├── step.wait_for_signal signal received, OK
│ └── step.sleep cooldown parks until t+30s
│
├── workflow.execute process-order [run-A execution 4, replay=true] ERROR
│ └── step.run notify-customer fails, schedules retry
│
└── workflow.execute process-order [run-A execution 5, replay=true]
└── step.run notify-customer prior_failures=1, OK
Things to note:
- All executions of a run are siblings under the original
workflow.start span, not nested in each other. The carrier captured at run creation is what every execution extracts.
- Replayed steps emit no spans. Execution 2 doesn't re-emit
validate-order, and execution 5 only re-emits notify-customer because that's the one that failed and needs another attempt.
step.wait_for_signal emits twice — once when set up (execution 2) and once when the signal is observed (execution 3). step.sleep only emits when set up (execution 3); the resume is harness-driven and silent.
Decisions
I've made the following initial decisions, happy to be challenged on any of them :)
Caller is the remote parent of every execution.
All executions share the caller's trace ID, People would expect "this request kicked off this workflow" to render as parent/child in their trace viewer.
Long sleeps produce late spans.
A workflow that sleeps for a week produces spans a week after the caller's span ended. Backends handle this gracefully but the trace view may split. workflow_run.id is on every span for cross-trace queries.
Carrier lives in workflow_runs.context.telemetry.
We use propagation.inject/extract, so traceparent + tracestate + baggage all ride along, dispatched to whatever propagators the user has registered. (I think this is the column you mentioned in discord?). Context is captured at the time when the workflow is invoked, in the above example:
// run-A.context.telemetry
{
"traceparent": "00-<trace_id>-<caller_span_id>-01",
"baggage": "tenant.id=acme,user.id=123"
}
// run-B.context.telemetry
{
"traceparent": "00-<trace_id>-<run_A_step_run_workflow_span_id>-01",
"baggage": "tenant.id=acme,user.id=123"
}
Replayed steps don't emit spans.
Cached history hits would just duplicate the work that already showed up on an earlier execution.
Spans
Span names are <operation> <name>, where <name> is the workflow or step
name.
| Span |
When |
workflow.start <workflow.name> |
client.runWorkflow() is called — wraps carrier injection + run creation |
workflow.execute <workflow.name> |
Once per execution |
step.run <step.name> |
A step.run actually runs (not a cache hit) |
step.sleep <step.name> |
First encounter — creates the attempt, ends OK |
step.run_workflow <step.name> |
First encounter (creates child) and on the execution that observes the child finish |
step.wait_for_signal (or step.wait_for_signal <step.name> when user-supplied) |
First encounter (creates the attempt) and on the execution that observes delivery or timeout . |
Attributes
Common: workflow.name, workflow.version, workflow_run.id, workflow_run.namespace_id,
workflow_run.attempt, workflow_run.replay, worker.id.
On step spans: step.name, step.kind (function / sleep / workflow /
signal-wait), step_attempt.id, step_attempt.prior_failures.
Kind-specific:
sleep adds step.resume_at
run_workflow adds child_workflow.name, child_workflow_run.id, step.timeout_at
wait_for_signal adds signal.name, step.timeout_at.
Inputs, outputs, signal data etc are not on spans as these any may PII that we don't
want to ship to user's vendors without their knowledge.
Status
workflow.execute — OK on completion or park; ERROR on fail or cancel.
step.* — OK on success or first-encounter park; ERROR +recordException on failure.
- A
run_workflow resolution is ERROR if the child failed, canceled, or timed out.
Questions
- Do we want to have a hard dependency on
@opentelemetry/api in the core, or do we want to ship a @openworkflow/otel sidecar package that wires all this up with some kind of middleware/interceptor pattern? Guidance here would be appreciated.
I've been looking for something like openworkflow for a while now, so you thank you for actually making it happen. Any feedback is much appreciated. Disclaimer that I went back and forth with Opus 4.7 for a couple of hours to generate above, I've tried to verify any assumptions etc it has made but apologies if I missed anything 🙏
Hi there! As discussed on discord, this is a proposal/design to add OTel traces to the SDK so workflow runs and step attempts show up alongside the rest of a user's traces, with trace context + baggage flowing across replays and child workflows.
The proposal covers workflow traces only, does not include OTEL for metrics, logging, backend spans, DB migrations etc.
Traces
Example trace
A
process-orderworkflow that calls a child workflow, waits up to an hour for a confirmation signal, cools down for 30 seconds, and then notifies the customer (with a retry). The parent runs across five executions; the child finishes in one.Everything lives under the
trace_idof the caller that invoked the workflow.Things to note:
workflow.startspan, not nested in each other. The carrier captured at run creation is what every execution extracts.validate-order, and execution 5 only re-emitsnotify-customerbecause that's the one that failed and needs another attempt.step.wait_for_signalemits twice — once when set up (execution 2) and once when the signal is observed (execution 3).step.sleeponly emits when set up (execution 3); the resume is harness-driven and silent.Decisions
I've made the following initial decisions, happy to be challenged on any of them :)
Caller is the remote parent of every execution.
All executions share the caller's trace ID, People would expect "this request kicked off this workflow" to render as parent/child in their trace viewer.
Long sleeps produce late spans.
A workflow that sleeps for a week produces spans a week after the caller's span ended. Backends handle this gracefully but the trace view may split.
workflow_run.idis on every span for cross-trace queries.Carrier lives in
workflow_runs.context.telemetry.We use
propagation.inject/extract, so traceparent + tracestate + baggage all ride along, dispatched to whatever propagators the user has registered. (I think this is the column you mentioned in discord?). Context is captured at the time when the workflow is invoked, in the above example:Replayed steps don't emit spans.
Cached history hits would just duplicate the work that already showed up on an earlier execution.
Spans
Span names are
<operation> <name>, where<name>is the workflow or stepname.
workflow.start <workflow.name>client.runWorkflow()is called — wraps carrier injection + run creationworkflow.execute <workflow.name>step.run <step.name>step.runactually runs (not a cache hit)step.sleep <step.name>step.run_workflow <step.name>step.wait_for_signal(orstep.wait_for_signal <step.name>when user-supplied)Attributes
Common:
workflow.name,workflow.version,workflow_run.id,workflow_run.namespace_id,workflow_run.attempt,workflow_run.replay,worker.id.On step spans:
step.name,step.kind(function/sleep/workflow/signal-wait),step_attempt.id,step_attempt.prior_failures.Kind-specific:
sleepaddsstep.resume_atrun_workflowaddschild_workflow.name,child_workflow_run.id,step.timeout_atwait_for_signaladdssignal.name,step.timeout_at.Inputs, outputs, signal data etc are not on spans as these any may PII that we don't
want to ship to user's vendors without their knowledge.
Status
workflow.execute—OKon completion or park;ERRORon fail or cancel.step.*—OKon success or first-encounter park;ERROR+recordExceptionon failure.run_workflowresolution isERRORif the child failed, canceled, or timed out.Questions
@opentelemetry/apiin the core, or do we want to ship a@openworkflow/otelsidecar package that wires all this up with some kind of middleware/interceptor pattern? Guidance here would be appreciated.I've been looking for something like openworkflow for a while now, so you thank you for actually making it happen. Any feedback is much appreciated. Disclaimer that I went back and forth with Opus 4.7 for a couple of hours to generate above, I've tried to verify any assumptions etc it has made but apologies if I missed anything 🙏