Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,34 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.7.2] - 2026-03-08

### Added

- **Configurable sensor trigger deadline** (`trigger.deadline`) — closes auto-trigger window after expiry, publishes `SENSOR_DEADLINE_EXPIRED`.
- **TOCTOU-safe `CreateTriggerIfAbsent` store method** using DynamoDB conditional writes.
- **CloudWatch alarms**: Per-function Lambda error alarms, Step Functions failure alarm, DLQ depth alarms (control, joblog, alert queues), and DynamoDB Stream iterator age alarms. All alarm actions conditionally route to an SNS topic via `sns_alarm_topic_arn`.
- **EventBridge input transformers for alarm routing**: CloudWatch alarm state changes are reshaped into `INFRA_ALARM` InterlockEvent format and routed to both event-sink and alert-dispatcher — zero Go code changes required.
- **Lambda concurrency limits**: Per-function reserved concurrent executions via `lambda_concurrency` object variable (defaults: stream-router=10, orchestrator=10, sla-monitor=5, watchdog=2, event-sink=5, alert-dispatcher=3).
- **Secrets Manager Slack token**: `slack_secret_arn` variable enables alert-dispatcher to read the Slack bot token from Secrets Manager instead of an environment variable. Falls back to `SLACK_BOT_TOKEN` env var if not configured.
- **Lambda trigger IAM scoping**: `enable_lambda_trigger` and `lambda_trigger_arns` variables grant orchestrator `lambda:InvokeFunction` permission scoped to specific function ARNs.

### Changed

- **Sensor-triggered pipelines now receive proactive SLA scheduling** (removed cron-only guard).
- **Trigger deadline check extracted into independent `checkTriggerDeadlines` watchdog scan**.
- **Env var expansion restricted to `INTERLOCK_` prefix**: `os.ExpandEnv` in trigger config (Airflow, Databricks, Lambda) now only expands variables prefixed with `INTERLOCK_`, preventing unintended system variable substitution.
- **`time.Now()` → `d.now()` across all handlers**: All Lambda handlers use dependency-injected time for consistent testability.
- **Config cache deep copy via JSON round-trip**: `GetAll()` returns a deep copy preventing callers from mutating shared cache state.
- **Single-instant rule evaluation**: All validation rules within an evaluation cycle use the same timestamp for temporal consistency.

### Fixed

- **Trigger lock release on SFN start failure**: Both rerun and job-failure retry paths release the trigger lock if `StartExecution` fails, preventing permanently stuck pipelines (previously caused 4.5h deadlock).
- **`scheduleSLAAlerts` skip-on-error**: SLA alert scheduling now correctly skips on error instead of falling through to the next handler.
- **9 silent audit write error discards → WARN logging**: All `publishEvent` call sites across stream-router and orchestrator now log errors at WARN level instead of silently discarding them.
- **Missing `EVENTS_TABLE`/`EVENTS_TTL_DAYS` envcheck for alert-dispatcher**: Startup validation now checks for required environment variables.

## [0.7.1] - 2026-03-08

### Fixed
Expand Down Expand Up @@ -315,6 +343,8 @@ Initial release of the Interlock STAMP-based safety framework for data pipeline

Released under the [Elastic License 2.0](LICENSE).

[0.7.2]: https://github.com/dwsmith1983/interlock/releases/tag/v0.7.2
[0.7.1]: https://github.com/dwsmith1983/interlock/releases/tag/v0.7.1
[0.7.0]: https://github.com/dwsmith1983/interlock/releases/tag/v0.7.0
[0.6.2]: https://github.com/dwsmith1983/interlock/releases/tag/v0.6.2
[0.6.1]: https://github.com/dwsmith1983/interlock/releases/tag/v0.6.1
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ job:
| `emr-serverless` | AWS SDK | EMR Serverless job runs |
| `step-function` | AWS SDK | AWS Step Functions executions |
| `databricks` | HTTP (REST 2.1) | Databricks job runs |
| `lambda` | AWS SDK | Direct Lambda invocation |

## Deployment

Expand All @@ -170,7 +171,7 @@ module "interlock" {
}
```

The module creates all required infrastructure: DynamoDB tables, Lambda functions, Step Functions state machine, EventBridge rules, and IAM roles.
The module creates all required infrastructure: DynamoDB tables, Lambda functions, Step Functions state machine, EventBridge rules, CloudWatch alarms, and IAM roles. See the [deployment docs](https://dwsmith1983.github.io/interlock/docs/deployment/terraform/) for the full variable reference.

## Example

Expand Down Expand Up @@ -208,7 +209,7 @@ interlock/

```bash
make test # Run all tests
make build-lambda # Build 4 Lambda handlers (linux/arm64)
make build-lambda # Build 6 Lambda handlers (linux/arm64)
make lint # go fmt + go vet
```

Expand Down
21 changes: 19 additions & 2 deletions docs/content/docs/architecture/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ Multi-mode dispatcher invoked by Step Functions. Each invocation specifies a `mo
| `job-poll-exhausted` | Publish `JOB_POLL_EXHAUSTED` event, write timeout joblog entry, set trigger to `FAILED_FINAL` when the job poll window expires |
| `complete-trigger` | Set trigger status to `COMPLETED` (on success) or `FAILED_FINAL` (on failure/timeout). Ensures trigger records reflect terminal state |

Supported trigger types: `http`, `command`, `airflow`, `glue`, `emr`, `emr-serverless`, `step-function`, `databricks`.
Supported trigger types: `http`, `command`, `airflow`, `glue`, `emr`, `emr-serverless`, `step-function`, `databricks`, `lambda`.

### sla-monitor

Expand Down Expand Up @@ -145,6 +145,8 @@ Processes messages from the SQS alert queue. Formats pipeline events into Slack

**Threading**: looks up existing thread records in the events table (`THREAD#{scheduleId}#{date}`). If a thread exists for the pipeline-day, replies in-thread. Otherwise, posts a new message and saves the thread timestamp for subsequent alerts.

**Secrets Manager**: when `SLACK_SECRET_ARN` is set, the Slack bot token is read from Secrets Manager at cold start instead of the `SLACK_BOT_TOKEN` environment variable. The module conditionally grants `secretsmanager:GetSecretValue` to the alert-dispatcher role.

**Error handling**: Slack API errors return batch item failures so SQS retries individual messages. Thread lookup/save errors are logged but don't fail the message delivery.

## Step Functions State Machine
Expand Down Expand Up @@ -217,6 +219,21 @@ The ASL template at `deploy/statemachine.asl.json` uses substitution variables r
- `${sfn_timeout_seconds}` — global execution timeout (default 14400 / 4h)
- `${trigger_max_attempts}` — trigger infrastructure retry count (default 3)

## CloudWatch Alarms

The Terraform module creates CloudWatch alarms across four categories to monitor infrastructure health:

| Category | Alarms | Metric | Threshold |
|---|---|---|---|
| Lambda errors | 6 (one per function) | `Errors` | `>= 1` per 5-minute period |
| SFN failures | 1 (pipeline state machine) | `ExecutionsFailed` | `>= 1` per 5-minute period |
| DLQ depth | 3 (control, joblog, alert queues) | `ApproximateNumberOfMessagesVisible` | `>= 1` |
| Stream iterator age | 2 (control, joblog streams) | `IteratorAge` | `>= 300,000ms` (5 min) |

Alarm state changes are reshaped into `INFRA_ALARM` events via EventBridge input transformers and routed to event-sink and alert-dispatcher. Optionally, alarms also publish to an SNS topic via the `sns_alarm_topic_arn` variable.

See [Alerting](../../reference/alerting/#cloudwatch-alarms) for details on the event flow and consumer patterns.

## EventBridge

A custom event bus (`{environment}-interlock-events`) receives all framework events. All four Lambda functions have `events:PutEvents` permission on this bus.
Expand Down Expand Up @@ -245,7 +262,7 @@ Each Lambda function has its own IAM role with least-privilege policies:
| sla-monitor | None | PutEvents | `scheduler:CreateSchedule`, `scheduler:DeleteSchedule` |
| watchdog | Read control, joblog, rerun; write control only | PutEvents | -- |
| event-sink | Write events table | -- | -- |
| alert-dispatcher | Read/write events table (thread storage) | -- | SQS ReceiveMessage/DeleteMessage |
| alert-dispatcher | Read/write events table (thread storage) | -- | SQS ReceiveMessage/DeleteMessage, conditional `secretsmanager:GetSecretValue` |

Trigger permissions for the orchestrator are opt-in via Terraform variables (`enable_glue_trigger`, `enable_emr_trigger`, etc.).

Expand Down
63 changes: 63 additions & 0 deletions docs/content/docs/deployment/terraform.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,69 @@ terraform apply
| `enable_emr_trigger` | bool | `false` | Grant orchestrator Lambda permission to submit EMR steps |
| `enable_emr_serverless_trigger` | bool | `false` | Grant orchestrator Lambda permission to start EMR Serverless jobs |
| `enable_sfn_trigger` | bool | `false` | Grant orchestrator Lambda permission to start Step Functions executions |
| `enable_lambda_trigger` | bool | `false` | Grant orchestrator Lambda permission to invoke Lambda functions |
| `lambda_trigger_arns` | list(string) | `["*"]` | Lambda function ARNs the orchestrator may invoke (scoped IAM). Only used when `enable_lambda_trigger` is `true` |
| `slack_bot_token` | string (sensitive) | `""` | Slack Bot API token with `chat:write` scope for alert-dispatcher |
| `slack_channel_id` | string | `""` | Slack channel ID for alert notifications |
| `slack_secret_arn` | string | `""` | AWS Secrets Manager ARN containing the Slack bot token. When set, alert-dispatcher reads the token from Secrets Manager instead of the `slack_bot_token` environment variable |
| `events_table_ttl_days` | number | `90` | TTL in days for events table records |
| `lambda_concurrency` | object | See below | Reserved concurrent executions per Lambda function |
| `sns_alarm_topic_arn` | string | `""` | SNS topic ARN for CloudWatch alarm notifications. When set, all alarms send to this topic |

#### Lambda Concurrency Defaults

The `lambda_concurrency` variable is an object with per-function keys:

```hcl
lambda_concurrency = {
stream_router = 10
orchestrator = 10
sla_monitor = 5
watchdog = 2
event_sink = 5
alert_dispatcher = 3
}
```

Set any key to `-1` to use unreserved concurrency (Lambda default).

## Secrets Manager

By default, alert-dispatcher reads the Slack bot token from the `SLACK_BOT_TOKEN` environment variable. For production deployments, store the token in AWS Secrets Manager and pass the ARN:

```hcl
module "interlock" {
source = "path/to/interlock/deploy/terraform"

slack_secret_arn = "arn:aws:secretsmanager:us-east-1:123456789012:secret:interlock/slack-token-AbCdEf"
slack_channel_id = "C0123456789"
# ...
}
```

When `slack_secret_arn` is set, the module automatically grants `secretsmanager:GetSecretValue` to the alert-dispatcher Lambda role. The token must have the `xoxb-` prefix (Bot token).

## CloudWatch Alarms

The module creates CloudWatch alarms across four categories. All alarms conditionally route to an SNS topic when `sns_alarm_topic_arn` is set.

| Category | Alarms | Threshold |
|---|---|---|
| Lambda errors | One per function (6 total) | `Errors >= 1` per 5-minute period |
| Step Functions failures | One for the pipeline state machine | `ExecutionsFailed >= 1` per 5-minute period |
| DLQ depth | One per dead-letter queue (3 total) | `ApproximateNumberOfMessagesVisible >= 1` |
| Stream iterator age | One per DynamoDB Stream mapping (2 total) | `IteratorAge >= 300,000ms` (5 minutes) |

CloudWatch alarm state changes are reshaped via EventBridge input transformers into `INFRA_ALARM` events and routed to both event-sink and alert-dispatcher. This means infrastructure alerts appear in the events table and Slack alongside pipeline lifecycle events — no additional Go code required.

```hcl
module "interlock" {
source = "path/to/interlock/deploy/terraform"

sns_alarm_topic_arn = aws_sns_topic.ops_alerts.arn
# ...
}
```

## What Gets Created

Expand Down
42 changes: 42 additions & 0 deletions docs/content/docs/reference/alerting.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,12 @@ Published by the **orchestrator** and **stream-router** Lambdas during the pipel
| `INFRA_FAILURE` | Unrecoverable infrastructure error | Step Functions execution reaches Fail state |
| `SFN_TIMEOUT` | Step Functions execution timed out | Global `TimeoutSeconds` exceeded (configurable via `sfn_timeout_seconds` Terraform variable) |
| `DATA_DRIFT` | Post-run drift detected | Post-run evaluation detected data quality drift against baseline |
| `POST_RUN_DRIFT` | Post-run sensor changed after completion | Sensor value drifted from baseline after job completed |
| `POST_RUN_DRIFT_INFLIGHT` | Post-run sensor changed while job running | Informational drift detected during an active execution |
| `POST_RUN_SENSOR_MISSING` | No post-run sensor data received | Watchdog detected no post-run sensor within `sensorTimeout` |
| `BASELINE_CAPTURE_FAILED` | Baseline capture error | Error occurred while capturing the post-run baseline at trigger completion |
| `SENSOR_DEADLINE_EXPIRED` | Sensor trigger window closed without pipeline starting | Sensor trigger window closed without pipeline starting; manual restart required via `RERUN_REQUEST` |
| `PIPELINE_EXCLUDED` | Pipeline excluded by calendar | Sensor, rerun, job-failure, or post-run drift skipped due to calendar exclusion |

### Rerun Events

Expand All @@ -62,6 +68,7 @@ Published by the **stream-router** when processing rerun requests and late data
|---|---|---|
| `LATE_DATA_ARRIVAL` | Sensor updated after job completed | Sensor `updatedAt` is newer than joblog `completedAt` |
| `RERUN_REJECTED` | Rerun request rejected by circuit breaker | No new sensor data since last job completion |
| `RERUN_ACCEPTED` | Rerun request accepted | Rerun passed circuit breaker validation and trigger lock was reset |

### Watchdog Events

Expand Down Expand Up @@ -215,3 +222,38 @@ Since EventBridge supports many target types, you can build any alert delivery p
| Custom processing | Lambda function | Enrich, deduplicate, or aggregate events |
| Queue for batch processing | SQS queue | Downstream systems that process events in batches |
| Cross-account delivery | EventBridge in another account | Centralized observability |

## CloudWatch Alarms

The Terraform module creates CloudWatch alarms that monitor infrastructure health independently of pipeline events. Alarm state changes are reshaped into `INFRA_ALARM` events via EventBridge input transformers and routed to both event-sink and alert-dispatcher — no additional Go code required.

### Alarm Categories

| Category | Count | Metric | Threshold |
|---|---|---|---|
| Lambda errors | 6 (one per function) | `Errors` | `>= 1` per 5-minute period |
| Step Functions failures | 1 | `ExecutionsFailed` | `>= 1` per 5-minute period |
| DLQ depth | 3 (control, joblog, alert) | `ApproximateNumberOfMessagesVisible` | `>= 1` |
| Stream iterator age | 2 (control, joblog) | `IteratorAge` | `>= 300,000ms` (5 minutes) |

### How It Works

1. CloudWatch detects a metric threshold breach and transitions the alarm to `ALARM` state
2. The alarm state change publishes to the default EventBridge bus
3. An EventBridge rule with an **input transformer** reshapes the alarm into an `INFRA_ALARM` event with the standard Interlock event structure
4. The transformed event routes to both event-sink (→ events table) and the SQS alert queue (→ alert-dispatcher → Slack)

### SNS Integration

Optionally route alarm notifications to an SNS topic for external consumers (PagerDuty, email, etc.):

```hcl
module "interlock" {
source = "path/to/interlock/deploy/terraform"

sns_alarm_topic_arn = aws_sns_topic.ops_alerts.arn
# ...
}
```

When `sns_alarm_topic_arn` is set, all alarms add the topic as an alarm action alongside the EventBridge route.
Loading