Emit OTEL Traces for workflow executions

Hi there! As discussed on discord, this is a proposal/design to add OTel traces to the SDK so workflow runs and step attempts show up alongside the rest of a user's traces, with trace context + baggage flowing across replays and child workflows. 

The proposal covers workflow traces only, does not include OTEL for metrics, logging, backend spans, DB migrations etc. 

## Traces
### Example trace

A `process-order` workflow that calls a child workflow, waits up to an hour for a confirmation signal, cools down for 30 seconds, and then notifies the customer (with a retry). The parent runs across five executions; the child finishes in one.

```ts
defineWorkflow({ name: "process-order" }, async ({ step, input }) => {
  await step.run({ name: "validate-order" }, () => { ... });
  await step.runWorkflow(fulfillShipment, { orderId: input.orderId });
  await step.waitForSignal({
    signal: `confirmed:${input.orderId}`,
    timeout: "1h",
  });
  await step.sleep("cooldown", "30s");
  await step.run({ name: "notify-customer" }, () => { ... });
});
```

Everything lives under the `trace_id` of the caller that invoked the workflow.

```
app.orders.handler                                  caller's HTTP span
└── workflow.start process-order                    creates run-A
    │
    ├── workflow.execute process-order              [run-A execution 1]
    │   ├── step.run validate-order
    │   └── step.run_workflow fulfill-shipment      creates run-B, parks
    │       └── workflow.execute fulfill-shipment   [run-B execution 1]
    │           ├── step.run reserve-inventory
    │           └── step.run dispatch-courier
    │
    ├── workflow.execute process-order              [run-A execution 2, replay=true]
    │   ├── step.run_workflow fulfill-shipment      child done, OK
    │   └── step.wait_for_signal                 parks until signal or timeout
    │
    ├── workflow.execute process-order              [run-A execution 3, replay=true]
    │   ├── step.wait_for_signal                 signal received, OK
    │   └── step.sleep cooldown                     parks until t+30s
    │
    ├── workflow.execute process-order              [run-A execution 4, replay=true]   ERROR
    │   └── step.run notify-customer                fails, schedules retry
    │
    └── workflow.execute process-order              [run-A execution 5, replay=true]
        └── step.run notify-customer                prior_failures=1, OK
```

Things to note:

- All executions of a run are siblings under the original `workflow.start` span, not nested in each other. The carrier captured at run creation is what every execution extracts.
- Replayed steps emit no spans. Execution 2 doesn't re-emit `validate-order`, and execution 5 only re-emits `notify-customer` because that's the one that failed and needs another attempt.
- `step.wait_for_signal` emits twice — once when set up (execution 2) and once when the signal is observed (execution 3). `step.sleep` only emits when set up (execution 3); the resume is harness-driven and silent.

### Decisions

I've made the following initial decisions, happy to be challenged on any of them :) 

**Caller is the remote parent of every execution.** 
All executions share the caller's trace ID, People would expect "this request kicked off this workflow" to render as parent/child in their trace viewer.

**Long sleeps produce late spans.** 
A workflow that sleeps for a week produces spans a week after the caller's span ended.  Backends handle this gracefully but the trace view may split. `workflow_run.id` is on every span for cross-trace queries.

**Carrier lives in `workflow_runs.context.telemetry`.** 
 We use `propagation.inject`/`extract`, so traceparent + tracestate + baggage all ride along, dispatched to whatever propagators the user has registered. (I think this is the column you mentioned in discord?). Context is captured at the time when the workflow is invoked, in the above example:

```json
// run-A.context.telemetry
{
  "traceparent": "00-<trace_id>-<caller_span_id>-01",
  "baggage": "tenant.id=acme,user.id=123"
}

// run-B.context.telemetry
{
  "traceparent": "00-<trace_id>-<run_A_step_run_workflow_span_id>-01",
  "baggage": "tenant.id=acme,user.id=123"
}
```
**Replayed steps don't emit spans.** 
Cached history hits would just duplicate the work that already showed up on an earlier execution.

## Spans

Span names are `<operation> <name>`, where `<name>` is the workflow or step
name. 

| Span | When |
|---|---|
| `workflow.start <workflow.name>` | `client.runWorkflow()` is called — wraps carrier injection + run creation |
| `workflow.execute <workflow.name>` | Once per execution |
| `step.run <step.name>` | A `step.run` actually runs (not a cache hit) |
| `step.sleep <step.name>` | First encounter — creates the attempt, ends OK |
| `step.run_workflow <step.name>` | First encounter (creates child) and on the execution that observes the child finish |
| `step.wait_for_signal` (or `step.wait_for_signal <step.name>` when user-supplied) | First encounter (creates the attempt) and on the execution that observes delivery or timeout .

### Attributes

Common: `workflow.name`, `workflow.version`, `workflow_run.id`, `workflow_run.namespace_id`,
`workflow_run.attempt`, `workflow_run.replay`, `worker.id`.

On step spans: `step.name`, `step.kind` (`function` / `sleep` / `workflow` /
`signal-wait`), `step_attempt.id`, `step_attempt.prior_failures`.

Kind-specific: 
- `sleep` adds `step.resume_at`
- `run_workflow` adds `child_workflow.name`, `child_workflow_run.id`, `step.timeout_at`
- `wait_for_signal` adds `signal.name`, `step.timeout_at`.

Inputs, outputs, signal data etc are not on spans as these any may PII that we don't
want to ship to user's vendors without their knowledge.

### Status

- `workflow.execute` — `OK` on completion or park;  `ERROR` on fail or cancel.
- `step.*` — `OK` on success or first-encounter park; `ERROR` +`recordException` on failure. 
- A `run_workflow` resolution is `ERROR` if the child failed, canceled, or timed out.

## Questions

- Do we want to have a hard dependency on `@opentelemetry/api` in the core, or do we want to ship a `@openworkflow/otel` sidecar package that wires all this up with some kind of middleware/interceptor pattern? Guidance here would be appreciated.

I've been looking for something like openworkflow for a while now, so you thank you for actually making it happen. Any feedback is much appreciated. Disclaimer that I went back and forth with Opus 4.7 for a couple of hours to generate above, I've tried to verify any assumptions etc it has made but apologies if I missed anything 🙏 

Span	When
`workflow.start <workflow.name>`	`client.runWorkflow()` is called — wraps carrier injection + run creation
`workflow.execute <workflow.name>`	Once per execution
`step.run <step.name>`	A `step.run` actually runs (not a cache hit)
`step.sleep <step.name>`	First encounter — creates the attempt, ends OK
`step.run_workflow <step.name>`	First encounter (creates child) and on the execution that observes the child finish
`step.wait_for_signal` (or `step.wait_for_signal <step.name>` when user-supplied)	First encounter (creates the attempt) and on the execution that observes delivery or timeout .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit OTEL Traces for workflow executions #501

Traces

Example trace

Decisions

Spans

Attributes

Status

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Emit OTEL Traces for workflow executions #501

Description

Traces

Example trace

Decisions

Spans

Attributes

Status

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions