Skip to content

.NET: Fix DurableTask CustomStatus 16 KB overflow on multi-executor workflows#6775

Open
kshyju wants to merge 3 commits into
mainfrom
kshyju-fix-durabletask-customstatus-overflow
Open

.NET: Fix DurableTask CustomStatus 16 KB overflow on multi-executor workflows#6775
kshyju wants to merge 3 commits into
mainfrom
kshyju-fix-durabletask-customstatus-overflow

Conversation

@kshyju

@kshyju kshyju commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Motivation & Context

Microsoft.Agents.AI.DurableTask republishes the workflow's full cumulative event log to the orchestration CustomStatus after every executor/superstep so that streaming clients can observe events while the orchestration is still running.

Durable Functions caps CustomStatus at 16 KB (UTF-16). On workflows with enough executors and/or large typed outputs, the accumulated event list grows past that cap, and the next status write throws — failing the entire orchestration. This makes otherwise-valid workflows crash purely as a function of how many events they emit, which is a correctness/reliability problem rather than a usage error.

Description & Review Guide

  • What are the major changes?

    • Instead of writing the whole cumulative event log to CustomStatus, the runner now publishes a bounded trailing window of the most recent events that fits a conservative char budget, tagged with a new absolute EventsStartIndex on DurableWorkflowLiveStatus so consumers can map window positions back to absolute indices.
    • The complete, untrimmed event log still flows through the orchestration output (DurableWorkflowResult.Events), which is not subject to the 16 KB CustomStatus cap. The streaming consumer drains it from its last-read index at completion.
    • DurableStreamingWorkflowRun drains relative to EventsStartIndex and defers (does not advance its read cursor) whenever it is positioned behind the start of the published window — guaranteeing it never skips an event that scrolled out of the live window before it was delivered.
    • Window sizing uses a running char-cost on SuperstepState for an O(1) fast path in the common under-budget case (avoids re-costing every event on each publish).
    • Oversized single events are excluded from the window (an empty window is valid) rather than force-included — force-including would immediately re-overflow the cap. They are delivered via the output at completion like any other event.
  • What is the impact of these changes?

    • The overflow crash is eliminated; affected workflows now complete successfully.
    • Lossless and order-preserving: no event is dropped or reordered. For workflows that were already under budget, live-streaming behavior is unchanged. For oversized workflows, events that scroll out of the live window are delivered at completion instead of mid-run — a timing shift, not a loss.
    • Backward compatible wire contract: EventsStartIndex defaults to 0, so an older consumer reading new status (or a new consumer reading old status) behaves exactly as before.
    • Known out-of-scope limitations (pre-existing, not introduced here): a request port attaching a very large Input after the final publish, or a single event that also exceeds the much larger orchestration-output/backend limit, are separate concerns not addressed by this PR.
  • What do you want reviewers to focus on?

    • The drain/defer invariant in DurableStreamingWorkflowRun.DrainNewEvents (the consumer must never advance past EventsStartIndex while behind the window), and the char-budget math in BuildEventWindow/SerializedElementCost (the estimate is intentionally conservative so the serialized status stays ≤ 8192 chars).

Related Issue

Fixes #5745

Contribution Checklist

  • The code builds clean without any errors or warnings
  • All unit tests pass, and I have added new tests where possible
  • The PR follows the Contribution Guidelines
  • This PR is linked to an issue and there is no other open PR for this issue (see Related Issue above).
  • This is not a breaking change. If it is a breaking change, add the breaking change label (or add "[BREAKING]" to the title prefix, before or after any language prefix) — a workflow keeps the label and title prefix in sync automatically.

…ws (#5745)

Publish only a bounded trailing window of events to the orchestration CustomStatus (tagged with an absolute EventsStartIndex) instead of the full cumulative event log, which could exceed the Durable Functions 16 KB CustomStatus cap and fail the orchestration. The complete event log still flows through the uncapped orchestration output and is drained by the streaming consumer at completion, so no event is lost or reordered.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 26, 2026 18:37
@moonbox3 moonbox3 added documentation Usage: [Issues, PRs], Target: documentation in the code base and learn docs .NET Usage: [Issues, PRs], Target: .Net labels Jun 26, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Durable Functions’ 16 KB CustomStatus limit by publishing a bounded trailing window of workflow events (instead of the full cumulative event log) and introducing an absolute EventsStartIndex so streaming consumers can map window positions back to the full event sequence. This prevents orchestrations from failing when cumulative serialized events exceed the CustomStatus cap, while still returning the complete event log via orchestration output at completion.

Changes:

  • Add event-windowing logic in DurableWorkflowRunner with conservative size budgeting and EventsStartIndex tracking.
  • Update DurableStreamingWorkflowRun to drain events using (window, startIndex) semantics and backfill from final output on completion.
  • Add unit tests for window sizing and streaming behavior across sliding windows / behind-window scenarios; update DurableTask changelog.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
dotnet/src/Microsoft.Agents.AI.DurableTask/Workflows/DurableWorkflowRunner.cs Implements bounded trailing event window selection and cost estimation for live CustomStatus publishing.
dotnet/src/Microsoft.Agents.AI.DurableTask/Workflows/DurableWorkflowLiveStatus.cs Adds EventsStartIndex to support absolute indexing of windowed live events.
dotnet/src/Microsoft.Agents.AI.DurableTask/Workflows/DurableStreamingWorkflowRun.cs Updates event draining logic to respect EventsStartIndex and safely defer behind-window gaps to completion backfill.
dotnet/tests/Microsoft.Agents.AI.DurableTask.UnitTests/Workflows/DurableWorkflowRunnerEventWindowTests.cs Adds targeted unit tests for event window sizing and cap compliance.
dotnet/tests/Microsoft.Agents.AI.DurableTask.UnitTests/Workflows/DurableStreamingWorkflowRunTests.cs Adds streaming tests validating no duplicates/gaps with sliding windows and completion backfill behavior.
dotnet/src/Microsoft.Agents.AI.DurableTask/CHANGELOG.md Adds an unreleased entry describing the fix.

Comment thread dotnet/src/Microsoft.Agents.AI.DurableTask/CHANGELOG.md Outdated

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 5 | Confidence: 86%

✓ Correctness

This PR correctly implements a bounded trailing window for live workflow status events to avoid the 16 KB Durable Functions custom status overflow. The DrainNewEvents logic correctly maps absolute indices, defers when behind the window, and recovers all events at completion. The BuildEventWindow logic correctly selects a trailing window within budget, handles oversized events by returning an empty window, and accounts for pending events. The SerializedElementCost estimation is conservative (uses JsonEncodedText.Encode for exact escaping plus overhead). No correctness bugs found; the invariants (no duplicates, no gaps, no skips, order-preserving) hold across all scenarios including sliding windows, consumer-behind-window, and empty-window cases.

✓ Security Reliability

The PR correctly implements a bounded trailing event window to prevent the 16 KB custom status overflow. The drain/defer invariant in DrainNewEvents is sound, and the windowing logic in BuildEventWindow is correct. One reliability gap: EstimatePendingEventsCost uses raw string Length for the Input field instead of accounting for JSON escaping (unlike SerializedElementCost which correctly uses JsonEncodedText.Encode). Since Input contains serialized request data (likely JSON with escapable characters like quotes and backslashes), the actual serialized cost can exceed the estimate, potentially narrowing the 592-char safety margin in edge cases.

✓ Test Coverage

The test coverage for the new windowed live-status feature is strong. Both the producer side (BuildEventWindow with 5 unit tests covering within-budget, oversized single events, oversized mid-sequence, and pending-events budget reduction) and the consumer side (DrainNewEvents via 3 integration tests covering sliding windows, consumer-behind-window, and empty-window recovery) are well-exercised. The assertions are meaningful (checking exact event ordering, no duplicates, no gaps, and cap compliance). One notable gap is the absence of a backward-compatibility deserialization test: the PR claims adding EventsStartIndex is backward compatible (defaults to 0 for old payloads), but no test guards that contract.

✓ Failure Modes

The PR correctly implements a bounded trailing window for live status events to prevent the 16 KB custom status overflow. The core invariants are sound: SerializedElementCost uses the same default JavaScriptEncoder as the serializer (verified via DurableSerialization.cs using default options), DrainNewEvents correctly defers when behind the window and backfills from the full output at completion, and BuildEventWindow's budget math is conservative (7600 target vs 8192 cap). The fast-path aliasing of state.AccumulatedEvents is safe because SetCustomStatus is always called before any subsequent mutation in the same synchronous block. No silent failures, lost errors, or unrecoverable race conditions are introduced.

✗ Design Approach

The trailing-window/backfill design looks correct for the normal runner publish path, but one request-port path still bypasses that windowing entirely, so the overflow can still happen when a workflow reaches a pending input after already accumulating a near-limit live window. I also found that the pending-event budget math is not conservative for escaped JSON content, which can still let CustomStatus exceed the 16 KB cap for quote/backslash-heavy request payloads.

Flagged Issues

  • BuildEventWindow(..., state.LiveStatus.PendingEvents) only protects writes that flow through PublishEventsToLiveStatus, but ExecuteRequestPortAsync still publishes liveStatus directly after PendingEvents.Add(...) (DurableExecutorDispatcher.cs:117-118). A workflow with an already-full event window can still overflow SetCustomStatus the moment it hits a RequestPort, so the design does not fully cover the pending-input code path it now budgets for.
  • EstimatePendingEventsCost reserves only pending.EventName.Length + pending.Input.Length even though PendingRequestPortStatus.Input is serialized request data. Quotes, backslashes, and control characters in that JSON are escaped again when DurableWorkflowLiveStatus is serialized, so the estimate can undercount and still overflow the cap.

Automated review by kshyju's agents

@github-actions

Copy link
Copy Markdown
Contributor

Flagged issue

BuildEventWindow(..., state.LiveStatus.PendingEvents) only protects writes that flow through PublishEventsToLiveStatus, but ExecuteRequestPortAsync still publishes liveStatus directly after PendingEvents.Add(...) (DurableExecutorDispatcher.cs:117-118). A workflow with an already-full event window can still overflow SetCustomStatus the moment it hits a RequestPort, so the design does not fully cover the pending-input code path it now budgets for.


Source: automated DevFlow PR review

@github-actions

Copy link
Copy Markdown
Contributor

Flagged issue

EstimatePendingEventsCost reserves only pending.EventName.Length + pending.Input.Length even though PendingRequestPortStatus.Input is serialized request data. Quotes, backslashes, and control characters in that JSON are escaped again when DurableWorkflowLiveStatus is serialized, so the estimate can undercount and still overflow the cap.


Source: automated DevFlow PR review

…t windowing

- Add UTF-8 BOM to the new test file to satisfy the repo .editorconfig charset rule (fixes check-format).

- EstimatePendingEventsCost now measures JSON-escaped length (JsonEncodedText.Encode) for EventName/Input so the reserve is conservative for quote/backslash-heavy payloads.

- Re-window the live status on the RequestPort publish path (DurableExecutorDispatcher) via new DurableWorkflowRunner.TrimLiveStatusToBudget so adding a large pending input to a near-full window cannot overflow the 16 KB custom status cap.

- Point the CHANGELOG bullet at the PR per the DurableTask area convention. Add covering tests for TrimLiveStatusToBudget.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread dotnet/src/Microsoft.Agents.AI.DurableTask/Workflows/DurableWorkflowRunner.cs Outdated
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kshyju kshyju marked this pull request as ready for review June 26, 2026 20:43
@kshyju kshyju requested a review from a team June 26, 2026 20:43

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 5 | Confidence: 89% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Failure Modes, Design Approach


Automated review by kshyju's agents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Usage: [Issues, PRs], Target: documentation in the code base and learn docs .NET Usage: [Issues, PRs], Target: .Net

Projects

None yet

Development

Successfully merging this pull request may close these issues.

.NET: [Bug]: DurableTask: SuperstepState.AccumulatedEvents overflows CustomStatus 16 KB cap on multi-executor workflows with typed outputs

3 participants