|
| 1 | +--- |
| 2 | +title: "Retry Loop ASL Pattern" |
| 3 | +weight: 10 |
| 4 | +--- |
| 5 | + |
| 6 | +# Step Function Retry & Readiness Polling Pattern |
| 7 | + |
| 8 | +This guide shows the recommended ASL patterns for implementing retry loops and readiness polling in your Step Function state machine. |
| 9 | + |
| 10 | +## Prerequisites |
| 11 | + |
| 12 | +These patterns require interlock v0.2.2+ which adds: |
| 13 | + |
| 14 | +- `failureCategory` in run-checker responses (classifies failures as `TRANSIENT`, `TIMEOUT`, or `PERMANENT`) |
| 15 | +- `retryable` and `retryBackoffSeconds` in orchestrator `logResult` responses |
| 16 | +- `not_ready` result with `pollAdvised` from orchestrator `checkReadiness` |
| 17 | + |
| 18 | +## Failure Retry Loop |
| 19 | + |
| 20 | +When a job fails, the orchestrator's `logResult` action returns retry metadata. The ASL can use this to loop back and retry. |
| 21 | + |
| 22 | +```json |
| 23 | +{ |
| 24 | + "LogRunFailed": { |
| 25 | + "Type": "Task", |
| 26 | + "Resource": "${OrchestratorArn}", |
| 27 | + "Parameters": { |
| 28 | + "action": "logResult", |
| 29 | + "pipelineID.$": "$.pipelineID", |
| 30 | + "scheduleID.$": "$.scheduleID", |
| 31 | + "payload": { |
| 32 | + "status": "FAILED", |
| 33 | + "runID.$": "$.runID", |
| 34 | + "message.$": "$.failureMessage", |
| 35 | + "failureCategory.$": "$.failureCategory" |
| 36 | + } |
| 37 | + }, |
| 38 | + "ResultPath": "$.logResult", |
| 39 | + "Next": "IsRetryable" |
| 40 | + }, |
| 41 | + |
| 42 | + "IsRetryable": { |
| 43 | + "Type": "Choice", |
| 44 | + "Choices": [ |
| 45 | + { |
| 46 | + "Variable": "$.logResult.payload.retryable", |
| 47 | + "BooleanEquals": true, |
| 48 | + "Next": "WaitRetryBackoff" |
| 49 | + } |
| 50 | + ], |
| 51 | + "Default": "ReleaseLockFailed" |
| 52 | + }, |
| 53 | + |
| 54 | + "WaitRetryBackoff": { |
| 55 | + "Type": "Wait", |
| 56 | + "SecondsPath": "$.logResult.payload.retryBackoffSeconds", |
| 57 | + "Next": "AcquireLock" |
| 58 | + } |
| 59 | +} |
| 60 | +``` |
| 61 | + |
| 62 | +### How it works |
| 63 | + |
| 64 | +1. `LogRunFailed` calls the orchestrator with `failureCategory` from the run-checker |
| 65 | +2. The orchestrator computes `retryable` (based on category + attempt count + max attempts) and `retryBackoffSeconds` |
| 66 | +3. `IsRetryable` branches: if retryable, wait and loop back to `AcquireLock`; otherwise, proceed to final cleanup |
| 67 | +4. `WaitRetryBackoff` uses `SecondsPath` for dynamic exponential backoff |
| 68 | + |
| 69 | +### Backward compatibility |
| 70 | + |
| 71 | +If the ASL does not pass `failureCategory`, the orchestrator defaults it to `TRANSIENT`, making the failure retryable. This ensures existing deployments get retry behavior without ASL changes. |
| 72 | + |
| 73 | +## Readiness Polling |
| 74 | + |
| 75 | +When traits fail (data not ready), the orchestrator returns `not_ready` with poll metadata. The ASL can use this to wait and re-evaluate. |
| 76 | + |
| 77 | +```json |
| 78 | +{ |
| 79 | + "CheckReadiness": { |
| 80 | + "Type": "Task", |
| 81 | + "Resource": "${OrchestratorArn}", |
| 82 | + "Parameters": { |
| 83 | + "action": "checkReadiness", |
| 84 | + "pipelineID.$": "$.pipelineID", |
| 85 | + "payload": { |
| 86 | + "traitResults.$": "$.traitResults" |
| 87 | + } |
| 88 | + }, |
| 89 | + "ResultPath": "$.readiness", |
| 90 | + "Next": "IsReady" |
| 91 | + }, |
| 92 | + |
| 93 | + "IsReady": { |
| 94 | + "Type": "Choice", |
| 95 | + "Choices": [ |
| 96 | + { |
| 97 | + "Variable": "$.readiness.result", |
| 98 | + "StringEquals": "proceed", |
| 99 | + "Next": "TriggerJob" |
| 100 | + }, |
| 101 | + { |
| 102 | + "Variable": "$.readiness.result", |
| 103 | + "StringEquals": "not_ready", |
| 104 | + "Next": "WaitReadiness" |
| 105 | + } |
| 106 | + ], |
| 107 | + "Default": "HandleEvaluatorError" |
| 108 | + }, |
| 109 | + |
| 110 | + "WaitReadiness": { |
| 111 | + "Type": "Wait", |
| 112 | + "Seconds": 60, |
| 113 | + "Next": "AcquireLock" |
| 114 | + } |
| 115 | +} |
| 116 | +``` |
| 117 | + |
| 118 | +### How it works |
| 119 | + |
| 120 | +1. `CheckReadiness` evaluates trait results and returns `proceed`, `not_ready`, or `error` |
| 121 | +2. `IsReady` branches on the result: |
| 122 | + - `proceed`: all required traits pass, trigger the job |
| 123 | + - `not_ready`: data not ready yet, wait and re-evaluate (loops back to `AcquireLock`) |
| 124 | + - `error`: evaluator infrastructure failure, handle separately |
| 125 | +3. `WaitReadiness` pauses before re-evaluation (use a fixed interval or compute dynamically) |
| 126 | + |
| 127 | +### Backward compatibility |
| 128 | + |
| 129 | +The previous `skip` result is replaced by `not_ready`. Existing ASL templates that check `result == "proceed"` with a default fallback will treat `not_ready` the same as `skip` — both hit the default path. No ASL changes are required for existing deployments to continue working. |
| 130 | + |
| 131 | +## Complete Flow |
| 132 | + |
| 133 | +The recommended state machine flow combining both patterns: |
| 134 | + |
| 135 | +``` |
| 136 | +AcquireLock → CheckRunLog → ResolvePipeline → EvaluateTraits → CheckReadiness |
| 137 | + │ |
| 138 | + ┌──────────┼──────────┐ |
| 139 | + │ │ │ |
| 140 | + proceed not_ready error |
| 141 | + │ │ │ |
| 142 | + TriggerJob Wait(60s) Alert+Skip |
| 143 | + │ │ |
| 144 | + PollStatus → AcquireLock |
| 145 | + │ |
| 146 | + ┌─────┼─────┐ |
| 147 | + │ │ |
| 148 | + succeeded failed |
| 149 | + │ │ |
| 150 | + LogCompleted LogFailed |
| 151 | + │ |
| 152 | + ┌─────┼─────┐ |
| 153 | + │ │ |
| 154 | + retryable non-retryable |
| 155 | + │ │ |
| 156 | + Wait(backoff) Cleanup |
| 157 | + │ |
| 158 | + AcquireLock |
| 159 | +``` |
0 commit comments