Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,15 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.9.0] - 2026-03-12

### Added

- **Dry-run / shadow mode** (`dryRun: true`) — observation-only mode for evaluating Interlock against running pipelines without executing jobs. The stream-router evaluates trigger conditions, validation rules, and SLA projections inline, publishing all observations as EventBridge events. No Step Function executions, no job triggers, no rerun requests. New events: `DRY_RUN_WOULD_TRIGGER`, `DRY_RUN_LATE_DATA`, `DRY_RUN_SLA_PROJECTION`, `DRY_RUN_DRIFT`. DRY_RUN# markers stored with 7-day TTL for dedup and late-data detection. Post-run drift detection captures baseline at trigger time and compares when sensors update. SLA projection reuses production `handleSLACalculate` for consistent deadline resolution. Requires `schedule.trigger` and `job.type`. Calendar exclusions honored.
- **`DryRunSK` key helper** for DynamoDB DRY_RUN# sort keys.
- **`WriteDryRunMarker` / `GetDryRunMarker` store methods** with conditional write (idempotent dedup) and consistent read.
- **Config validation** for dry-run: requires both `job.type` and `schedule.trigger`.

## [0.8.0] - 2026-03-10

### Added
Expand Down Expand Up @@ -378,6 +387,7 @@ Initial release of the Interlock STAMP-based safety framework for data pipeline

Released under the [Elastic License 2.0](LICENSE).

[0.9.0]: https://github.com/dwsmith1983/interlock/releases/tag/v0.9.0
[0.8.0]: https://github.com/dwsmith1983/interlock/releases/tag/v0.8.0
[0.7.4]: https://github.com/dwsmith1983/interlock/releases/tag/v0.7.4
[0.7.3]: https://github.com/dwsmith1983/interlock/releases/tag/v0.7.3
Expand Down
49 changes: 49 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@ STAMP-based safety framework for data pipeline reliability. Interlock prevents p

The framework applies [Leveson's Systems-Theoretic Accident Model](https://mitpress.mit.edu/9780262016629/engineering-a-safer-world/) to data engineering: pipelines have **declarative validation rules** (feedback), **sensor data in DynamoDB** (process models), and **conditional execution** (safe control actions).

## What Interlock Is (and Isn't)

Interlock is **not** an orchestrator and **not** a scheduler. It's a safety controller — the layer that decides whether a pipeline *should* run, not just whether it's *scheduled* to run. It works with whatever you already have: schedulers like cron, Airflow, Databricks Workflows, or EventBridge, and orchestrators like Dagster, Prefect, or Step Functions.

A scheduler fires because the clock says so. An orchestrator sequences tasks once they're kicked off. Neither asks whether the data your pipeline needs is actually present, fresh, and correct before executing. **Interlock does.** You route the trigger path through Interlock: sensor data lands in DynamoDB, Interlock evaluates readiness against declarative YAML rules, and only triggers the job when preconditions are met. Your scheduler can still provide the clock signal — an EventBridge cron writing a sensor tick, for example — but Interlock decides whether that tick becomes a job run.

**After a run completes**, Interlock keeps watching. It detects post-completion drift (source data changed after your job succeeded), late data arrival, SLA breaches, and silently missed schedules — failure modes that schedulers and orchestrators don't address because they stop paying attention once a job finishes.

## How It Works

External processes push sensor data into a DynamoDB **control table**. When a trigger condition is met (cron schedule or sensor arrival), a Step Functions workflow evaluates all validation rules against the current sensor state. If all rules pass, the pipeline job is triggered. EventBridge events provide observability at every stage.
Expand Down Expand Up @@ -143,6 +151,47 @@ job:
maxRetries: 2
```

## Dry-Run / Shadow Mode

Evaluate Interlock against running pipelines without executing any jobs. Set `dryRun: true` in your pipeline config — the stream-router observes the real sensor stream and records what it *would* do as EventBridge events, while your existing orchestrator continues running as-is.

```yaml
pipeline:
id: gold-revenue-dryrun
schedule:
trigger:
key: upstream-complete
check: equals
field: status
value: ready
sla:
deadline: "10:00"
expectedDuration: 30m
validation:
trigger: "ALL"
rules:
- key: upstream-complete
check: equals
field: status
value: ready
job:
type: glue
config:
jobName: gold-revenue-etl
dryRun: true
```

Dry-run publishes four observation events:

| Event | Meaning |
|-------|---------|
| `DRY_RUN_WOULD_TRIGGER` | All validation rules passed — Interlock would have triggered the job |
| `DRY_RUN_LATE_DATA` | Sensor updated after the trigger point — would have triggered a re-run |
| `DRY_RUN_SLA_PROJECTION` | Estimated completion vs. deadline — would the SLA be met or breached? |
| `DRY_RUN_DRIFT` | Post-run sensor data changed — would have detected drift and re-run |

No Step Function executions, no job triggers, no rerun requests. Remove `dryRun: true` to switch to live mode — `DRY_RUN#` markers have a 7-day TTL and don't interfere with `TRIGGER#` rows.

## Trigger Types

| Type | SDK/Protocol | Use Case |
Expand Down
2 changes: 2 additions & 0 deletions deploy/terraform/eventbridge.tf
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,9 @@ resource "aws_cloudwatch_event_rule" "alert_events" {
"POST_RUN_DRIFT_INFLIGHT",
"POST_RUN_FAILED",
"POST_RUN_SENSOR_MISSING",
"RERUN_ACCEPTED",
"RERUN_REJECTED",
"JOB_COMPLETED",
"LATE_DATA_ARRIVAL",
"TRIGGER_RECOVERED",
"BASELINE_CAPTURE_FAILED",
Expand Down
12 changes: 12 additions & 0 deletions docs/content/docs/configuration/pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,18 @@ Time-based constraints for pipeline completion. See [Schedules](../schedules/) f

Declarative rules that determine pipeline readiness. See [Validation Rules](#validation-rules) below.

### `dryRun`

Enables observation-only mode. When `true`, Interlock evaluates trigger conditions and validation rules against real sensor data but never starts a Step Function execution or triggers any job. All observations are published as EventBridge events (`DRY_RUN_WOULD_TRIGGER`, `DRY_RUN_LATE_DATA`, `DRY_RUN_SLA_PROJECTION`, `DRY_RUN_DRIFT`).

| Field | Type | Default | Description |
|---|---|---|---|
| `dryRun` | bool | `false` | Enable dry-run / shadow mode |

Requires `schedule.trigger` (sensor-driven evaluation) and `job.type` to be configured. Calendar exclusions are still honored. Remove `dryRun: true` to switch to live mode — `DRY_RUN#` markers have a 7-day TTL and don't interfere with `TRIGGER#` rows.

See [Alerting](../../reference/alerting/) for the full dry-run event reference.

### `job`

Defines how to start the downstream job when validation passes. See [Triggers](../triggers/) for all 8 supported job types.
Expand Down
12 changes: 12 additions & 0 deletions docs/content/docs/guides/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: Guides
weight: 5
description: Operational guides for deploying and managing Interlock pipelines.
---

Step-by-step guides for common Interlock workflows.

{{< cards >}}
{{< card link="retry-loop-asl" title="Step Functions State Machine" subtitle="ASL patterns for evaluation, triggering, job polling, and SLA scheduling." >}}
{{< card link="dry-run" title="Dry-Run / Shadow Mode" subtitle="Evaluate pipelines against real data without triggering jobs." >}}
{{< /cards >}}
150 changes: 150 additions & 0 deletions docs/content/docs/guides/dry-run.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
---
title: "Dry-Run / Shadow Mode"
weight: 20
---

# Dry-Run / Shadow Mode

Dry-run mode lets you evaluate a pipeline against real sensor data without triggering any jobs or starting Step Function executions. Interlock records what it *would* do as EventBridge events, giving you a complete picture of trigger timing, SLA projections, late data arrivals, and drift detection — all without side effects.

## When to Use Dry-Run

- **Onboarding a new pipeline.** Validate that sensor data, validation rules, and scheduling work correctly before committing compute resources.
- **Testing config changes.** Modify thresholds, rules, or SLA deadlines on a dry-run copy and compare its decisions against the live pipeline.
- **Comparing against an existing orchestrator.** Run Interlock in shadow mode alongside Airflow, Dagster, or Prefect to verify it makes the same triggering decisions.

## Setup

Add `dryRun: true` to your pipeline config. The pipeline must have `schedule.trigger` (sensor-driven evaluation) and `job.type` configured.

```yaml
id: bronze-hourly-dryrun
dryRun: true

schedule:
trigger: "sensor"
timezone: "UTC"

validation:
trigger: all_pass
rules:
- field: "files_processed"
operator: ">="
value: 4

job:
type: glue
name: bronze-etl

sla:
deadline: "06:00"
expectedDuration: "45m"

postRun:
rules:
- field: "count"
operator: ">="
value: 1000
driftField: "count"
```

Deploy the config the same way as any live pipeline — via `jsonencode(yamldecode(...))` in Terraform or direct DynamoDB writes.

## What Happens

When sensor data arrives, the stream-router Lambda evaluates the dry-run pipeline through the same validation path as a live pipeline. Instead of starting a Step Function execution, it writes a `DRY_RUN#` marker to the control table and publishes EventBridge events.

```
Sensor arrives → Stream-router
→ Check calendar exclusions (still enforced)
→ Check for existing DRY_RUN# marker
→ Marker exists: publish DRY_RUN_LATE_DATA event
→ No marker: evaluate validation rules
→ Rules fail: no action
→ Rules pass:
→ Write DRY_RUN# marker (7-day TTL)
→ Capture post-run baseline (if PostRun configured)
→ Publish DRY_RUN_WOULD_TRIGGER event
→ Compute and publish DRY_RUN_SLA_PROJECTION (if SLA configured)
```

`DRY_RUN#` markers have a 7-day TTL and are automatically cleaned up by DynamoDB.

## Monitoring Events

Dry-run pipelines emit four event types to EventBridge:

| Event | Meaning |
|---|---|
| `DRY_RUN_WOULD_TRIGGER` | All validation rules passed — Interlock would have triggered the job |
| `DRY_RUN_LATE_DATA` | Sensor data arrived after the trigger point was already recorded |
| `DRY_RUN_SLA_PROJECTION` | Estimated completion time vs. deadline — `met` or `breach` status |
| `DRY_RUN_DRIFT` | Post-run sensor data changed from the baseline captured at trigger time |

Create an EventBridge rule to capture these events:

```json
{
"source": ["interlock"],
"detail-type": [
{ "prefix": "DRY_RUN_" }
]
}
```

Route events to CloudWatch Logs, SNS, or an SQS queue for analysis. See the [Alerting reference](../../reference/alerting/) for full event payload details.

## SLA Projection

When `sla.expectedDuration` and `sla.deadline` are configured, Interlock computes a projected completion time at the moment validation rules pass. The projection reuses the same SLA calculation logic as production monitoring (including hourly pipeline T+1 adjustment).

The `DRY_RUN_SLA_PROJECTION` event includes:

| Field | Description |
|---|---|
| `status` | `"met"` or `"breach"` |
| `estimatedCompletion` | Trigger time + `expectedDuration` |
| `deadline` | Resolved breach deadline |
| `marginSeconds` | Seconds between estimated completion and deadline (negative = breach) |

This lets you tune `expectedDuration` before going live, avoiding false SLA alerts on day one.

## Drift Detection

If `postRun` is configured, Interlock captures a baseline snapshot of the drift field at the moment validation passes. When the post-run sensor updates later, it compares the new value against the baseline.

If the delta exceeds `driftThreshold`, a `DRY_RUN_DRIFT` event is published with the previous count, current count, delta, and the sensor key that changed. This tells you whether your production pipeline would have triggered a drift re-run.

## Going Live

To promote a dry-run pipeline to live:

1. Remove `dryRun: true` from the pipeline config (or set it to `false`).
2. Deploy the updated config.

No cleanup is needed. Existing `DRY_RUN#` markers expire naturally via their 7-day TTL and do not interfere with `TRIGGER#` rows that live pipelines use.

## Side-by-Side Deployment

You can run a dry-run pipeline alongside a live pipeline watching the same sensors. Use different pipeline IDs:

```yaml
# Live pipeline
id: bronze-hourly

# Dry-run shadow
id: bronze-hourly-dryrun
dryRun: true
```

Both pipelines receive the same sensor events. The live pipeline triggers jobs; the dry-run pipeline records observations. This is useful for validating config changes before applying them to the live pipeline.

## Troubleshooting

**No events appearing.** Verify that `schedule.trigger` is set to `"sensor"`. Dry-run requires sensor-driven evaluation — cron-only pipelines won't produce events.

**Stale markers.** `DRY_RUN#` markers have a 7-day TTL. If you need to re-evaluate the same date, wait for the marker to expire or delete it manually from the control table.

**Calendar exclusions blocking evaluation.** Calendar exclusions apply to dry-run pipelines the same way as live pipelines. Check your calendar config if events stop appearing on expected dates.

**SLA projection shows `met` but you expected `breach`.** The projection uses the same deadline resolution as production, including timezone handling and T+1 adjustment for hourly pipelines. Verify that `sla.deadline` and `sla.timezone` match your expectations.
13 changes: 13 additions & 0 deletions docs/content/docs/reference/alerting.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,19 @@ Published by the **stream-router** when processing rerun requests and late data
| `RERUN_REJECTED` | Rerun request rejected by circuit breaker | No new sensor data since last job completion |
| `RERUN_ACCEPTED` | Rerun request accepted | Rerun passed circuit breaker validation and trigger lock was reset |

### Dry-Run Events

Published by the **stream-router** Lambda for pipelines with `dryRun: true`. These events record what Interlock *would* do without executing any jobs or starting Step Functions.

| Detail Type | Meaning | When |
|---|---|---|
| `DRY_RUN_WOULD_TRIGGER` | All validation rules passed | Interlock would have triggered the pipeline job at this time |
| `DRY_RUN_LATE_DATA` | Sensor updated after trigger point | Sensor data arrived after the dry-run trigger was recorded — would have triggered a re-run |
| `DRY_RUN_SLA_PROJECTION` | Estimated completion vs. deadline | Projects whether the SLA would be met or breached based on `expectedDuration` and `deadline` |
| `DRY_RUN_DRIFT` | Post-run sensor data changed | Sensor value drifted from baseline captured at trigger time — would have triggered a drift re-run |

The `DRY_RUN_SLA_PROJECTION` detail includes `status` (`"met"` or `"breach"`), `estimatedCompletion`, `deadline`, and `marginSeconds` fields.

### Watchdog Events

Published by the **watchdog** Lambda, invoked on an EventBridge schedule (default: every 5 minutes).
Expand Down
Loading
Loading