Summary
RetryContext.getLastAttemptNumber() violates its documented contract when the underlying retriable task is awaited indirectly via a CompoundTask (e.g. TaskOrchestrationContext.anyOf(...)). The counter increments on every CompoundTask.await() regardless of whether any task has failed, so a custom RetryHandler sees a lastAttemptNumber much higher than the actual number of failed attempts.
Documented contract
From RetryContext.getLastAttemptNumber():
Gets the previous retry attempt number. This number starts at 1 and increments each time the retry handler is invoked for a particular task failure.
Observed behavior
In a bounded sliding-window fan-out pattern that awaits retriable activities via ctx.anyOf(...), tasks that enter the in-flight window after the first batch has burned through its retries observe lastAttemptNumber ≈ window size in their retry handler — on the first real failure, before any retry has happened.
Root cause
The inflated counter originates somewhere in RetriableTask.attemptNumber accounting when a freshly-scheduled RetriableTask joins an anyOf membership that is already being drained. Relevant entry points in TaskOrchestrationExecutor.java (1.9.0):
// line ~2055
void init() {
this.startTime = this.startTime == null ? this.context.getCurrentInstant() : this.startTime;
this.attemptNumber++;
}
// line ~2194
private void initSubTasks() {
for (Task<V> subTask : this.subTasks) {
if (subTask instanceof RetriableTask) ((RetriableTask<V>)subTask).init();
}
}
What we can pin down from the observed behaviour:
- The retry handler is invoked exactly once per real failure (no double-firing).
totalRetryTimeMs is consistent with the real number of attempts — it is 0 on the inflated invocations, confirming zero real retries have happened.
attemptNumber on a RetriableTask scheduled mid-drain is already much greater than 1 by the time its first failure reaches the handler.
RetriableTasks that exist before the drain loop starts retain a correct counter, even though they participate in many CompoundTask.await() cycles while still pending. So init() cannot be incrementing unconditionally on every initSubTasks() pass; there is effectively an "already initialised" guard for them.
The remaining unknown is what causes a newly constructed RetriableTask to begin life with attemptNumber already inflated — or to be subjected to init() calls that skip the guard the existing tasks benefit from — when it is added to an anyOf membership that has already cycled through several CompoundTask.await() calls. That interaction is the failure surface.
Minimal reproduction
The bug surfaces under these conditions:
- Bounded sliding window: tasks are added to the in-flight list mid-drain (one new task scheduled each time one completes).
- Outer
try/catch wrapping ctx.anyOf(inFlight).await(). The CompoundTask returned by anyOf() re-throws the underlying TaskFailedException from its await(), rather than returning the failed task. Without an outer catch the orchestrator dies on the first failure and the bug never gets a chance to manifest on later tasks.
- Enough items to fill a second window after the first has completed. With
N total items and a window of W, you need N > W. The first window goes through retries normally (1→max); the second window is where the bug shows up.
Kotlin, Azure Functions Java worker:
class LastAttemptReproOrchestratorFunction {
companion object {
const val ORCHESTRATOR_NAME = "LastAttemptReproOrchestratorFunction"
const val ACTIVITY_NAME = "AlwaysFailsActivityFunction"
private const val NUM_ACTIVITIES = 20
private const val MAX_IN_FLIGHT = 10
private const val MAX_ATTEMPTS = 5
private const val SLOW_TASK_STEP_MILLIS = 200L
}
@FunctionName(ORCHESTRATOR_NAME)
fun run(
@DurableOrchestrationTrigger(name = "ctx") ctx: TaskOrchestrationContext,
executionContext: ExecutionContext,
) {
val log = executionContext.logger
val opts = TaskOptions(
RetryHandler { retryCtx ->
if (!retryCtx.orchestrationContext.isReplaying) {
log.log(
Level.WARNING,
"RetryHandler invoked: lastAttemptNumber=${retryCtx.lastAttemptNumber} " +
"(max=$MAX_ATTEMPTS), totalRetryTimeMs=${retryCtx.totalRetryTime.toMillis()}",
)
}
retryCtx.lastAttemptNumber < MAX_ATTEMPTS
},
)
// Bounded sliding-window fan-out: keep at most MAX_IN_FLIGHT in flight at any time,
// and start a replacement each time one completes.
val inFlight = mutableListOf<Task<*>>()
var nextIndex = 0
while (nextIndex < MAX_IN_FLIGHT && nextIndex < NUM_ACTIVITIES) {
inFlight += ctx.callActivity(ACTIVITY_NAME, nextIndex, opts, String::class.java)
nextIndex++
}
while (inFlight.isNotEmpty()) {
try {
val completedTask: Task<*> = ctx.anyOf(inFlight).await()
inFlight.remove(completedTask)
try {
completedTask.await()
} catch (e: TaskFailedException) {
// expected — every activity always fails
}
} catch (e: TaskFailedException) {
// anyOf().await() can throw the underlying TaskFailedException directly
// rather than returning the failed task — identify the done task and continue.
val done = inFlight.firstOrNull { it.isDone } ?: throw e
inFlight.remove(done)
}
if (nextIndex < NUM_ACTIVITIES) {
inFlight += ctx.callActivity(ACTIVITY_NAME, nextIndex, opts, String::class.java)
nextIndex++
}
}
}
@FunctionName(ACTIVITY_NAME)
fun alwaysFails(@DurableActivityTrigger(name = "index") index: Int): String {
Thread.sleep(index * SLOW_TASK_STEP_MILLIS)
throw RuntimeException("Always fails (for lastAttemptNumber repro)")
}
}
(Plus a trivial HTTP trigger that calls durableContext.client.scheduleNewOrchestrationInstance(ORCHESTRATOR_NAME) to start it.)
The staggered Thread.sleep(index * 200ms) in the activity makes the first batch of activities fail quickly and the second batch slowly, ensuring many anyOf().await() iterations run with the second batch still pending.
Observed output
With NUM_ACTIVITIES=20, MAX_IN_FLIGHT=10, MAX_ATTEMPTS=5, all RetryHandler invoked log lines from one orchestration run:
"20/05/2026, 12:41:38.178","RetryHandler invoked: lastAttemptNumber=1 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:38.251","RetryHandler invoked: lastAttemptNumber=2 (max=5), totalRetryTimeMs=570"
"20/05/2026, 12:41:38.313","RetryHandler invoked: lastAttemptNumber=1 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:38.313","RetryHandler invoked: lastAttemptNumber=3 (max=5), totalRetryTimeMs=661"
"20/05/2026, 12:41:38.465","RetryHandler invoked: lastAttemptNumber=4 (max=5), totalRetryTimeMs=722"
"20/05/2026, 12:41:38.554","RetryHandler invoked: lastAttemptNumber=1 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:38.554","RetryHandler invoked: lastAttemptNumber=5 (max=5), totalRetryTimeMs=873"
"20/05/2026, 12:41:38.623","RetryHandler invoked: lastAttemptNumber=2 (max=5), totalRetryTimeMs=722"
"20/05/2026, 12:41:38.763","RetryHandler invoked: lastAttemptNumber=1 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:38.891","RetryHandler invoked: lastAttemptNumber=3 (max=5), totalRetryTimeMs=1030"
"20/05/2026, 12:41:38.891","RetryHandler invoked: lastAttemptNumber=1 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:39.072","RetryHandler invoked: lastAttemptNumber=2 (max=5), totalRetryTimeMs=961"
"20/05/2026, 12:41:39.135","RetryHandler invoked: lastAttemptNumber=1 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:39.201","RetryHandler invoked: lastAttemptNumber=4 (max=5), totalRetryTimeMs=1298"
"20/05/2026, 12:41:39.340","RetryHandler invoked: lastAttemptNumber=1 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:39.437","RetryHandler invoked: lastAttemptNumber=2 (max=5), totalRetryTimeMs=1170"
"20/05/2026, 12:41:39.556","RetryHandler invoked: lastAttemptNumber=5 (max=5), totalRetryTimeMs=1608"
"20/05/2026, 12:41:39.556","RetryHandler invoked: lastAttemptNumber=3 (max=5), totalRetryTimeMs=1479"
"20/05/2026, 12:41:39.556","RetryHandler invoked: lastAttemptNumber=1 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:39.748","RetryHandler invoked: lastAttemptNumber=1 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:39.818","RetryHandler invoked: lastAttemptNumber=2 (max=5), totalRetryTimeMs=1298"
"20/05/2026, 12:41:39.952","RetryHandler invoked: lastAttemptNumber=1 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:40.021","RetryHandler invoked: lastAttemptNumber=4 (max=5), totalRetryTimeMs=1964"
"20/05/2026, 12:41:40.109","RetryHandler invoked: lastAttemptNumber=3 (max=5), totalRetryTimeMs=1846"
"20/05/2026, 12:41:40.215","RetryHandler invoked: lastAttemptNumber=2 (max=5), totalRetryTimeMs=1543"
"20/05/2026, 12:41:40.512","RetryHandler invoked: lastAttemptNumber=5 (max=5), totalRetryTimeMs=2430"
"20/05/2026, 12:41:40.620","RetryHandler invoked: lastAttemptNumber=2 (max=5), totalRetryTimeMs=1748"
"20/05/2026, 12:41:40.698","RetryHandler invoked: lastAttemptNumber=3 (max=5), totalRetryTimeMs=2227"
"20/05/2026, 12:41:40.782","RetryHandler invoked: lastAttemptNumber=4 (max=5), totalRetryTimeMs=2518"
"20/05/2026, 12:41:41.043","RetryHandler invoked: lastAttemptNumber=2 (max=5), totalRetryTimeMs=1964"
"20/05/2026, 12:41:41.283","RetryHandler invoked: lastAttemptNumber=3 (max=5), totalRetryTimeMs=2623"
"20/05/2026, 12:41:41.425","RetryHandler invoked: lastAttemptNumber=2 (max=5), totalRetryTimeMs=2154"
"20/05/2026, 12:41:41.503","RetryHandler invoked: lastAttemptNumber=5 (max=5), totalRetryTimeMs=3191"
"20/05/2026, 12:41:41.565","RetryHandler invoked: lastAttemptNumber=4 (max=5), totalRetryTimeMs=3106"
"20/05/2026, 12:41:41.828","RetryHandler invoked: lastAttemptNumber=2 (max=5), totalRetryTimeMs=2361"
"20/05/2026, 12:41:41.899","RetryHandler invoked: lastAttemptNumber=3 (max=5), totalRetryTimeMs=3029"
"20/05/2026, 12:41:42.356","RetryHandler invoked: lastAttemptNumber=4 (max=5), totalRetryTimeMs=3692"
"20/05/2026, 12:41:42.437","RetryHandler invoked: lastAttemptNumber=5 (max=5), totalRetryTimeMs=3974"
"20/05/2026, 12:41:42.557","RetryHandler invoked: lastAttemptNumber=3 (max=5), totalRetryTimeMs=3452"
"20/05/2026, 12:41:43.109","RetryHandler invoked: lastAttemptNumber=3 (max=5), totalRetryTimeMs=3834"
"20/05/2026, 12:41:43.179","RetryHandler invoked: lastAttemptNumber=4 (max=5), totalRetryTimeMs=4308"
"20/05/2026, 12:41:43.436","RetryHandler invoked: lastAttemptNumber=5 (max=5), totalRetryTimeMs=4766"
"20/05/2026, 12:41:43.717","RetryHandler invoked: lastAttemptNumber=3 (max=5), totalRetryTimeMs=4238"
"20/05/2026, 12:41:44.034","RetryHandler invoked: lastAttemptNumber=4 (max=5), totalRetryTimeMs=4965"
"20/05/2026, 12:41:44.486","RetryHandler invoked: lastAttemptNumber=5 (max=5), totalRetryTimeMs=5589"
"20/05/2026, 12:41:44.800","RetryHandler invoked: lastAttemptNumber=4 (max=5), totalRetryTimeMs=5519"
"20/05/2026, 12:41:45.517","RetryHandler invoked: lastAttemptNumber=5 (max=5), totalRetryTimeMs=6444"
"20/05/2026, 12:41:45.596","RetryHandler invoked: lastAttemptNumber=4 (max=5), totalRetryTimeMs=6125"
"20/05/2026, 12:41:46.480","RetryHandler invoked: lastAttemptNumber=5 (max=5), totalRetryTimeMs=7206"
"20/05/2026, 12:41:47.480","RetryHandler invoked: lastAttemptNumber=5 (max=5), totalRetryTimeMs=8005"
"20/05/2026, 12:41:49.563","RetryHandler invoked: lastAttemptNumber=10 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:49.792","RetryHandler invoked: lastAttemptNumber=10 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:50.045","RetryHandler invoked: lastAttemptNumber=10 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:50.203","RetryHandler invoked: lastAttemptNumber=10 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:50.405","RetryHandler invoked: lastAttemptNumber=10 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:50.623","RetryHandler invoked: lastAttemptNumber=10 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:50.837","RetryHandler invoked: lastAttemptNumber=10 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:51.017","RetryHandler invoked: lastAttemptNumber=10 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:51.255","RetryHandler invoked: lastAttemptNumber=10 (max=5), totalRetryTimeMs=0"
"20/05/2026, 12:41:51.463","RetryHandler invoked: lastAttemptNumber=10 (max=5), totalRetryTimeMs=0"
The first batch of 10 activities (indices 0–9) retries normally — lastAttemptNumber progresses 1→5 and totalRetryTimeMs rises monotonically per task. The second batch (indices 10–19) each see lastAttemptNumber=10 with totalRetryTimeMs=0 on their first failure — equal to MAX_IN_FLIGHT, matching the root-cause analysis above. The retry handler returns false (10 < 5 is false), so the second batch gets zero retries and their first failure propagates. The totalRetryTimeMs=0 confirms no real retry has happened, even though lastAttemptNumber already says 10.
Expected: each task's retry handler sees lastAttemptNumber rising from 1 to 5 as it actually retries.
Actual: tasks entering a refilled window see lastAttemptNumber ≈ window size on first failure — exceeding the retry budget before any real retry happens.
Impact
Any custom RetryHandler that gates retries on lastAttemptNumber (the obvious choice given the docs) silently rejects retries that should fire, under common parallel fan-out patterns. We hit this in production when running 10 items in parallel via anyOf with a 5-attempt cap: the last items in the window saw inflated lastAttemptNumber on their first failure, retries were rejected, the activity failure propagated to the orchestrator, and the orchestration failed instead of recovering.
Workaround: gate on RetryContext.getTotalRetryTime() instead — that field is only written inside RetriableTask.tryRetry() after the timer await and is not affected by initSubTasks.
Suggested fix
Without a precise diagnosis it's premature to prescribe a code change, but the goal is clear: RetryContext.getLastAttemptNumber() must reflect only real handler invocations for a given task, matching totalRetryTime. Whatever code path bumps attemptNumber for a freshly-scheduled RetriableTask joining an active CompoundTask needs to be removed or guarded so that the counter starts at 1 on the first real failure for every task, regardless of when in the orchestrator's lifetime the task was scheduled.
A useful invariant to enforce in tests: for every retry-handler invocation, lastAttemptNumber == 1 || totalRetryTime > 0. Any invocation with lastAttemptNumber > 1 and totalRetryTime == 0 is by definition inconsistent with the documented contract.
Environment
-
com.microsoft:durabletask-client:1.9.0
-
Azure Functions Java worker on Flex Consumption plan, region norwayeast, Maximum instance count = 10
-
Relevant host.json:
{
"version": "2.0",
"functionTimeout": "00:10:00",
"extensions": {
"durableTask": {
"hubName": "MyHub",
"storageProvider": {
"partitionCount": 16
},
"maxConcurrentActivityFunctions": 10,
"maxConcurrentOrchestratorFunctions": 20
}
},
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[4.*, 5.0.0)"
}
}
Summary
RetryContext.getLastAttemptNumber()violates its documented contract when the underlying retriable task is awaited indirectly via aCompoundTask(e.g.TaskOrchestrationContext.anyOf(...)). The counter increments on everyCompoundTask.await()regardless of whether any task has failed, so a customRetryHandlersees alastAttemptNumbermuch higher than the actual number of failed attempts.Documented contract
From
RetryContext.getLastAttemptNumber():Observed behavior
In a bounded sliding-window fan-out pattern that awaits retriable activities via
ctx.anyOf(...), tasks that enter the in-flight window after the first batch has burned through its retries observelastAttemptNumber ≈ window sizein their retry handler — on the first real failure, before any retry has happened.Root cause
The inflated counter originates somewhere in
RetriableTask.attemptNumberaccounting when a freshly-scheduledRetriableTaskjoins ananyOfmembership that is already being drained. Relevant entry points inTaskOrchestrationExecutor.java(1.9.0):What we can pin down from the observed behaviour:
totalRetryTimeMsis consistent with the real number of attempts — it is0on the inflated invocations, confirming zero real retries have happened.attemptNumberon aRetriableTaskscheduled mid-drain is already much greater than 1 by the time its first failure reaches the handler.RetriableTasks that exist before the drain loop starts retain a correct counter, even though they participate in manyCompoundTask.await()cycles while still pending. Soinit()cannot be incrementing unconditionally on everyinitSubTasks()pass; there is effectively an "already initialised" guard for them.The remaining unknown is what causes a newly constructed
RetriableTaskto begin life withattemptNumberalready inflated — or to be subjected toinit()calls that skip the guard the existing tasks benefit from — when it is added to ananyOfmembership that has already cycled through severalCompoundTask.await()calls. That interaction is the failure surface.Minimal reproduction
The bug surfaces under these conditions:
try/catchwrappingctx.anyOf(inFlight).await(). TheCompoundTaskreturned byanyOf()re-throws the underlyingTaskFailedExceptionfrom itsawait(), rather than returning the failed task. Without an outer catch the orchestrator dies on the first failure and the bug never gets a chance to manifest on later tasks.Ntotal items and a window ofW, you needN > W. The first window goes through retries normally (1→max); the second window is where the bug shows up.Kotlin, Azure Functions Java worker:
(Plus a trivial HTTP trigger that calls
durableContext.client.scheduleNewOrchestrationInstance(ORCHESTRATOR_NAME)to start it.)The staggered
Thread.sleep(index * 200ms)in the activity makes the first batch of activities fail quickly and the second batch slowly, ensuring manyanyOf().await()iterations run with the second batch still pending.Observed output
With
NUM_ACTIVITIES=20,MAX_IN_FLIGHT=10,MAX_ATTEMPTS=5, allRetryHandler invokedlog lines from one orchestration run:The first batch of 10 activities (indices 0–9) retries normally —
lastAttemptNumberprogresses 1→5 andtotalRetryTimeMsrises monotonically per task. The second batch (indices 10–19) each seelastAttemptNumber=10withtotalRetryTimeMs=0on their first failure — equal toMAX_IN_FLIGHT, matching the root-cause analysis above. The retry handler returnsfalse(10 < 5 is false), so the second batch gets zero retries and their first failure propagates. ThetotalRetryTimeMs=0confirms no real retry has happened, even thoughlastAttemptNumberalready says 10.Expected: each task's retry handler sees
lastAttemptNumberrising from 1 to 5 as it actually retries.Actual: tasks entering a refilled window see
lastAttemptNumber ≈ window sizeon first failure — exceeding the retry budget before any real retry happens.Impact
Any custom
RetryHandlerthat gates retries onlastAttemptNumber(the obvious choice given the docs) silently rejects retries that should fire, under common parallel fan-out patterns. We hit this in production when running 10 items in parallel viaanyOfwith a 5-attempt cap: the last items in the window saw inflatedlastAttemptNumberon their first failure, retries were rejected, the activity failure propagated to the orchestrator, and the orchestration failed instead of recovering.Workaround: gate on
RetryContext.getTotalRetryTime()instead — that field is only written insideRetriableTask.tryRetry()after the timer await and is not affected byinitSubTasks.Suggested fix
Without a precise diagnosis it's premature to prescribe a code change, but the goal is clear:
RetryContext.getLastAttemptNumber()must reflect only real handler invocations for a given task, matchingtotalRetryTime. Whatever code path bumpsattemptNumberfor a freshly-scheduledRetriableTaskjoining an activeCompoundTaskneeds to be removed or guarded so that the counter starts at1on the first real failure for every task, regardless of when in the orchestrator's lifetime the task was scheduled.A useful invariant to enforce in tests: for every retry-handler invocation,
lastAttemptNumber == 1 || totalRetryTime > 0. Any invocation withlastAttemptNumber > 1andtotalRetryTime == 0is by definition inconsistent with the documented contract.Environment
com.microsoft:durabletask-client:1.9.0Azure Functions Java worker on Flex Consumption plan, region
norwayeast, Maximum instance count = 10Relevant
host.json:{ "version": "2.0", "functionTimeout": "00:10:00", "extensions": { "durableTask": { "hubName": "MyHub", "storageProvider": { "partitionCount": 16 }, "maxConcurrentActivityFunctions": 10, "maxConcurrentOrchestratorFunctions": 20 } }, "extensionBundle": { "id": "Microsoft.Azure.Functions.ExtensionBundle", "version": "[4.*, 5.0.0)" } }