.NET: Fix DurableTask CustomStatus 16 KB overflow on multi-executor workflows#6775
.NET: Fix DurableTask CustomStatus 16 KB overflow on multi-executor workflows#6775kshyju wants to merge 3 commits into
Conversation
…ws (#5745) Publish only a bounded trailing window of events to the orchestration CustomStatus (tagged with an absolute EventsStartIndex) instead of the full cumulative event log, which could exceed the Durable Functions 16 KB CustomStatus cap and fail the orchestration. The complete event log still flows through the uncapped orchestration output and is drained by the streaming consumer at completion, so no event is lost or reordered. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses Durable Functions’ 16 KB CustomStatus limit by publishing a bounded trailing window of workflow events (instead of the full cumulative event log) and introducing an absolute EventsStartIndex so streaming consumers can map window positions back to the full event sequence. This prevents orchestrations from failing when cumulative serialized events exceed the CustomStatus cap, while still returning the complete event log via orchestration output at completion.
Changes:
- Add event-windowing logic in
DurableWorkflowRunnerwith conservative size budgeting andEventsStartIndextracking. - Update
DurableStreamingWorkflowRunto drain events using(window, startIndex)semantics and backfill from final output on completion. - Add unit tests for window sizing and streaming behavior across sliding windows / behind-window scenarios; update DurableTask changelog.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| dotnet/src/Microsoft.Agents.AI.DurableTask/Workflows/DurableWorkflowRunner.cs | Implements bounded trailing event window selection and cost estimation for live CustomStatus publishing. |
| dotnet/src/Microsoft.Agents.AI.DurableTask/Workflows/DurableWorkflowLiveStatus.cs | Adds EventsStartIndex to support absolute indexing of windowed live events. |
| dotnet/src/Microsoft.Agents.AI.DurableTask/Workflows/DurableStreamingWorkflowRun.cs | Updates event draining logic to respect EventsStartIndex and safely defer behind-window gaps to completion backfill. |
| dotnet/tests/Microsoft.Agents.AI.DurableTask.UnitTests/Workflows/DurableWorkflowRunnerEventWindowTests.cs | Adds targeted unit tests for event window sizing and cap compliance. |
| dotnet/tests/Microsoft.Agents.AI.DurableTask.UnitTests/Workflows/DurableStreamingWorkflowRunTests.cs | Adds streaming tests validating no duplicates/gaps with sliding windows and completion backfill behavior. |
| dotnet/src/Microsoft.Agents.AI.DurableTask/CHANGELOG.md | Adds an unreleased entry describing the fix. |
There was a problem hiding this comment.
Automated Code Review
Reviewers: 5 | Confidence: 86%
✓ Correctness
This PR correctly implements a bounded trailing window for live workflow status events to avoid the 16 KB Durable Functions custom status overflow. The DrainNewEvents logic correctly maps absolute indices, defers when behind the window, and recovers all events at completion. The BuildEventWindow logic correctly selects a trailing window within budget, handles oversized events by returning an empty window, and accounts for pending events. The SerializedElementCost estimation is conservative (uses JsonEncodedText.Encode for exact escaping plus overhead). No correctness bugs found; the invariants (no duplicates, no gaps, no skips, order-preserving) hold across all scenarios including sliding windows, consumer-behind-window, and empty-window cases.
✓ Security Reliability
The PR correctly implements a bounded trailing event window to prevent the 16 KB custom status overflow. The drain/defer invariant in DrainNewEvents is sound, and the windowing logic in BuildEventWindow is correct. One reliability gap: EstimatePendingEventsCost uses raw string Length for the Input field instead of accounting for JSON escaping (unlike SerializedElementCost which correctly uses JsonEncodedText.Encode). Since Input contains serialized request data (likely JSON with escapable characters like quotes and backslashes), the actual serialized cost can exceed the estimate, potentially narrowing the 592-char safety margin in edge cases.
✓ Test Coverage
The test coverage for the new windowed live-status feature is strong. Both the producer side (BuildEventWindow with 5 unit tests covering within-budget, oversized single events, oversized mid-sequence, and pending-events budget reduction) and the consumer side (DrainNewEvents via 3 integration tests covering sliding windows, consumer-behind-window, and empty-window recovery) are well-exercised. The assertions are meaningful (checking exact event ordering, no duplicates, no gaps, and cap compliance). One notable gap is the absence of a backward-compatibility deserialization test: the PR claims adding EventsStartIndex is backward compatible (defaults to 0 for old payloads), but no test guards that contract.
✓ Failure Modes
The PR correctly implements a bounded trailing window for live status events to prevent the 16 KB custom status overflow. The core invariants are sound: SerializedElementCost uses the same default JavaScriptEncoder as the serializer (verified via DurableSerialization.cs using default options), DrainNewEvents correctly defers when behind the window and backfills from the full output at completion, and BuildEventWindow's budget math is conservative (7600 target vs 8192 cap). The fast-path aliasing of state.AccumulatedEvents is safe because SetCustomStatus is always called before any subsequent mutation in the same synchronous block. No silent failures, lost errors, or unrecoverable race conditions are introduced.
✗ Design Approach
The trailing-window/backfill design looks correct for the normal runner publish path, but one request-port path still bypasses that windowing entirely, so the overflow can still happen when a workflow reaches a pending input after already accumulating a near-limit live window. I also found that the pending-event budget math is not conservative for escaped JSON content, which can still let
CustomStatusexceed the 16 KB cap for quote/backslash-heavy request payloads.
Flagged Issues
-
BuildEventWindow(..., state.LiveStatus.PendingEvents)only protects writes that flow throughPublishEventsToLiveStatus, butExecuteRequestPortAsyncstill publishesliveStatusdirectly afterPendingEvents.Add(...)(DurableExecutorDispatcher.cs:117-118). A workflow with an already-full event window can still overflowSetCustomStatusthe moment it hits aRequestPort, so the design does not fully cover the pending-input code path it now budgets for. -
EstimatePendingEventsCostreserves onlypending.EventName.Length + pending.Input.Lengtheven thoughPendingRequestPortStatus.Inputis serialized request data. Quotes, backslashes, and control characters in that JSON are escaped again whenDurableWorkflowLiveStatusis serialized, so the estimate can undercount and still overflow the cap.
Automated review by kshyju's agents
|
Flagged issue
Source: automated DevFlow PR review |
|
Flagged issue
Source: automated DevFlow PR review |
…t windowing - Add UTF-8 BOM to the new test file to satisfy the repo .editorconfig charset rule (fixes check-format). - EstimatePendingEventsCost now measures JSON-escaped length (JsonEncodedText.Encode) for EventName/Input so the reserve is conservative for quote/backslash-heavy payloads. - Re-window the live status on the RequestPort publish path (DurableExecutorDispatcher) via new DurableWorkflowRunner.TrimLiveStatusToBudget so adding a large pending input to a near-full window cannot overflow the 16 KB custom status cap. - Point the CHANGELOG bullet at the PR per the DurableTask area convention. Add covering tests for TrimLiveStatusToBudget. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation & Context
Microsoft.Agents.AI.DurableTaskrepublishes the workflow's full cumulative event log to the orchestrationCustomStatusafter every executor/superstep so that streaming clients can observe events while the orchestration is still running.Durable Functions caps
CustomStatusat 16 KB (UTF-16). On workflows with enough executors and/or large typed outputs, the accumulated event list grows past that cap, and the next status write throws — failing the entire orchestration. This makes otherwise-valid workflows crash purely as a function of how many events they emit, which is a correctness/reliability problem rather than a usage error.Description & Review Guide
What are the major changes?
CustomStatus, the runner now publishes a bounded trailing window of the most recent events that fits a conservative char budget, tagged with a new absoluteEventsStartIndexonDurableWorkflowLiveStatusso consumers can map window positions back to absolute indices.DurableWorkflowResult.Events), which is not subject to the 16 KBCustomStatuscap. The streaming consumer drains it from its last-read index at completion.DurableStreamingWorkflowRundrains relative toEventsStartIndexand defers (does not advance its read cursor) whenever it is positioned behind the start of the published window — guaranteeing it never skips an event that scrolled out of the live window before it was delivered.SuperstepStatefor an O(1) fast path in the common under-budget case (avoids re-costing every event on each publish).What is the impact of these changes?
EventsStartIndexdefaults to0, so an older consumer reading new status (or a new consumer reading old status) behaves exactly as before.Inputafter the final publish, or a single event that also exceeds the much larger orchestration-output/backend limit, are separate concerns not addressed by this PR.What do you want reviewers to focus on?
DurableStreamingWorkflowRun.DrainNewEvents(the consumer must never advance pastEventsStartIndexwhile behind the window), and the char-budget math inBuildEventWindow/SerializedElementCost(the estimate is intentionally conservative so the serialized status stays ≤ 8192 chars).Related Issue
Fixes #5745
Contribution Checklist
breaking changelabel (or add "[BREAKING]" to the title prefix, before or after any language prefix) — a workflow keeps the label and title prefix in sync automatically.