Description
Enables the TaskRunner to save its execution state and resume from that state later, allowing for recovery from failures or pausing long-running workflows without re-executing completed tasks.
Proposed Solution
Feature: Workflow State Persistence (Checkpoint/Resume)
🎯 User Story
"As a DevOps engineer, I want to save the state of a running workflow and resume it later from where it left off (e.g., after a server crash, deployment, or manual pause), so that I don't have to re-run expensive or side-effect-heavy tasks."
❓ Why
- Cost Efficiency: Re-running tasks like AI model training, large data ingestion, or paid API calls wastes money and resources.
- Safety & Idempotency: Some tasks are not idempotent (e.g., "Charge Credit Card", "Send Email"). If a workflow crashes after these steps but before completion, re-running from scratch is dangerous.
- Resilience: Long-running workflows (minutes to hours) are vulnerable to transient infrastructure failures. Resuming from the last successful step allows recovery without total data loss.
🛠️ What Changes
- State Exposure:
TaskRunner and TaskStateManager need to expose the current execution state (results of completed tasks).
- Hydration:
TaskRunnerBuilder and TaskStateManager need a way to initialize with a pre-existing state (the snapshot).
- Execution Logic:
WorkflowExecutor needs to respect the hydrated state—skipping tasks that are already marked as success in the snapshot, while treating them as satisfied dependencies for downstream tasks.
✅ Acceptance Criteria
⚠️ Constraints
- The
TContext object is often non-serializable (contains functions, sockets, etc.). Therefore, this feature only persists the execution graph state (which tasks finished). The user is responsible for re-hydrating the context to a state suitable for resumption if necessary.
Alternatives Considered
No response
Description
Enables the
TaskRunnerto save its execution state and resume from that state later, allowing for recovery from failures or pausing long-running workflows without re-executing completed tasks.Proposed Solution
Feature: Workflow State Persistence (Checkpoint/Resume)
🎯 User Story
"As a DevOps engineer, I want to save the state of a running workflow and resume it later from where it left off (e.g., after a server crash, deployment, or manual pause), so that I don't have to re-run expensive or side-effect-heavy tasks."
❓ Why
🛠️ What Changes
TaskRunnerandTaskStateManagerneed to expose the current execution state (results of completed tasks).TaskRunnerBuilderandTaskStateManagerneed a way to initialize with a pre-existing state (the snapshot).WorkflowExecutorneeds to respect the hydrated state—skipping tasks that are already marked assuccessin the snapshot, while treating them as satisfied dependencies for downstream tasks.✅ Acceptance Criteria
TaskRunner(orTaskStateManager) must expose a method to get a serializable snapshot of the current state (results).TaskRunnerBuildermust accept a snapshot to initialize the runner.executeis called with a hydrated state:successin the snapshot MUST NOT run again.successin the snapshot MUST be treated as completed dependencies for pending tasks.failure,cancelled, orskippedin the snapshot SHOULD be re-evaluated (run again).TContext) changes made by tasks in the previous run must be manually restored by the user (since context can contain non-serializable objects), OR the snapshot must include a mechanism to warn/handle context.contextto theTaskRunnerBuilder. The state snapshot only tracks task status/results. If the context needs to be in a certain state for step N+1, the user must provide that context.Record<string, TaskResult>.TContextobject is often non-serializable (contains functions, sockets, etc.). Therefore, this feature only persists the execution graph state (which tasks finished). The user is responsible for re-hydrating thecontextto a state suitable for resumption if necessary.Alternatives Considered
No response