Releases · dwsmith1983/interlock

29 Mar 06:38

dwsmith1983

v0.9.4

ead5e6e

v0.9.4 Latest

Latest

Split internal/lambda into handler sub-packages

Refactored

Split monolithic internal/lambda/ (46 files) into handler-aligned sub-packages — orchestrator/, stream/, watchdog/, sla/, alert/, sink/
Extracted shared utilities into focused root files — publish, date, exclusion, sensor, schedule, config, terminal
Trigger config registry — replaced buildTriggerConfig switch statement with generic registry map
SLA deadline calculations wired through pkg/sla/ — pure functions decoupled from Lambda handler context

Added

pkg/sla/ package — pure SLA deadline calculation functions
PipelineConfig.DeepCopy() method — safe config cache isolation without JSON roundtrip
EventWatchdogDegraded event type — watchdog health observability
Smoke tests for all 6 cmd/lambda/ packages — ValidateEnv coverage

Fixed

HandleWatchdog silent error suppression — now returns aggregate errors via errors.Join
HandleWatchdog degraded-state signaling — publishes WATCHDOG_DEGRADED event when checks fail
Config cache isolation — typed DeepCopy() replaces JSON marshal/unmarshal roundtrip

Assets 2

14 Mar 06:11

dwsmith1983

v0.9.3

472d942

v0.9.3

Security

Encryption at rest for DynamoDB and SQS — All DynamoDB tables explicitly enable server-side encryption. SQS queues use KMS encryption. New optional kms_key_arn variable for custom CMK; defaults to AWS-managed keys.
SSRF protection on trigger HTTP clients — Custom http.Transport with dial-time IP validation rejects connections to private, loopback, link-local, and multicast addresses. Protects HTTP, Airflow, and Databricks triggers.
EventBridge PutEvents partial failure detection — publishEvent now checks FailedEntryCount on the response. Previously, partial failures were silently discarded.
Command trigger shell injection eliminated — Replaced sh -c with direct exec.CommandContext + strings.Fields argument splitting.
lambda_trigger_arns wildcard default removed — Explicit ARN list required when triggers are enabled.
Slack plaintext token deprecation warning — Terraform check block warns at plan time when plaintext token is used without Secrets Manager.
Trigger IAM policy scoping — New variables glue_job_arns, emr_cluster_arns, emr_serverless_app_arns, sfn_trigger_arns with preconditions requiring non-empty values when the corresponding trigger is enabled.
EventBridge bus resource policy — Restricts PutEvents to Lambda execution roles only.

Bug Fixes

Drift detection silently skipped zero values — New ExtractFloatOk distinguishes absent from zero. Shared DetectDrift function consolidates 3 duplicated drift comparison sites.
RemapPerPeriodSensors map mutation during range — Staging map now collects additions, merged after iteration.
Orphaned rerun burned retry budget — Reordered to lock-first, then write rerun record.
Stream router discarded partial batch failures — Now returns DynamoDBEventResponse with per-record BatchItemFailures.
SLA_MET published when pipeline never ran — Now checks for trigger existence first.
Trigger deadline used SLA timezone instead of schedule timezone — Falls back to SLA timezone if schedule timezone is not set.
Validation mode case-sensitive — Now uses strings.ToUpper(mode).
Epoch timestamp unit mismatch in rerun freshness — Timestamps below 1e12 (seconds) are now converted to milliseconds.
Post-run baseline field collision — Now namespaced by rule key. Existing flat baselines self-heal on next pipeline completion.
publishEvent errors silently discarded in SLA reconcile — Replaced _ = publishEvent(...) with error-logged calls.

Changed

Extracted resolveHTTPClient() replacing identical 7-line blocks in ExecuteHTTP and ExecuteAirflow.
Extracted createSLASchedules() replacing duplicated warning/breach schedule loops in watchdog and sla-monitor.
Split watchdog.go (1079 lines) into 5 focused files by domain (~200 lines each).

Assets 2

13 Mar 13:44

dwsmith1983

v0.9.1

be5b24d

v0.9.1

Added

DRY_RUN_COMPLETED event — terminal event that closes the dry-run observation loop for each evaluation period. Published after DRY_RUN_WOULD_TRIGGER and DRY_RUN_SLA_PROJECTION, carrying the SLA verdict (met, breach, or n/a) so operators can see each period resolve.

Fixed

Dry-run pipelines could start Step Function executions via rerun and job-failure paths — handleRerunRequest and handleJobFailure did not check cfg.DryRun before calling startSFNWithName, allowing rerun requests and job failure retries to start real SFN executions for dry-run pipelines. Added dry-run guards in both handlers and defense-in-depth in startSFNWithName to suppress execution unconditionally. Watchdog reconciliation loop now skips dry-run pipelines to prevent orphaned trigger locks.
Watchdog scheduled real SLA alerts for dry-run pipelines — scheduleSLAAlerts, detectMissedSchedules, detectMissedInclusionSchedules, checkTriggerDeadlines, detectMissingPostRunSensors, and detectRelativeSLABreaches all iterated over dry-run pipelines without checking cfg.DryRun. This caused EventBridge Scheduler entries for SLA_WARNING/SLA_BREACH, SCHEDULE_MISSED events, and RELATIVE_SLA_BREACH alerts to fire for observation-only pipelines. Added cfg.DryRun guard to all six watchdog functions.
Duplicate JOB_COMPLETED alerts for polled jobs — handleCheckJob in the orchestrator published JOB_COMPLETED when polling detected success, but the stream-router's handleJobSuccess also published the same event when the JOB# record arrived via DynamoDB stream. Removed the orchestrator emission; the stream-router is now the single canonical source for JOB_COMPLETED across all job types.
Missing event types in alert EventBridge rule — added SENSOR_DEADLINE_EXPIRED, IRREGULAR_SCHEDULE_MISSED, RELATIVE_SLA_WARNING, RELATIVE_SLA_BREACH, and all DRY_RUN_* events to the alert rule so they route to Slack via alert-dispatcher.

Assets 2

11 Mar 11:58

dwsmith1983

v0.8.0

4924e86

v0.8.0

Added

Inclusion calendar scheduling (schedule.include.dates) — explicit YYYY-MM-DD date lists for pipelines that run on known irregular dates (monthly close, quarterly filing, specific business dates). Mutually exclusive with cron. Watchdog detects missed inclusion dates and publishes IRREGULAR_SCHEDULE_MISSED events.
Relative SLA (sla.maxDuration) — duration-based SLA for ad-hoc pipelines with no predictable schedule. Clock starts at first sensor arrival and covers the entire lifecycle: evaluation → trigger → job → post-run → completion. Warning at 75% of maxDuration (or breachAt - expectedDuration when set). New events: RELATIVE_SLA_WARNING, RELATIVE_SLA_BREACH.
First-sensor-arrival tracking — stream-router records first-sensor-arrival#<date> on lock acquisition (idempotent conditional write). Used as T=0 for relative SLA calculation.
Watchdog defense-in-depth for relative SLA — detectRelativeSLABreaches scans pipelines with maxDuration config and fires RELATIVE_SLA_BREACH if the EventBridge scheduler failed to fire.
WriteSensorIfAbsent store method — conditional PutItem that only writes if the key doesn't exist, used for first-sensor-arrival idempotency.
Config validation for new fields: cron/include mutual exclusion, inclusion date format (YYYY-MM-DD), maxDuration format and 24h cap, maxDuration requires trigger.
Glue false-success detection — verifyGlueRCA now checks both the RCA insight stream (Check 1) and the driver output stream for ERROR/FATAL log4j severity markers (Check 2). Catches Spark failures that Glue reports as SUCCEEDED when the application framework swallows exceptions.

Changed

SLAConfig.Deadline and SLAConfig.ExpectedDuration are now omitempty — relative SLA configs may omit the wall-clock deadline entirely.
SFN ASL passes maxDuration and sensorArrivalAt to CancelSLASchedules and CancelSLAOnCompleteTriggerFailure states.
sla-monitor handleSLACalculate routes to relative path when MaxDuration + SensorArrivalAt are present.

Assets 2

10 Mar 16:13

dwsmith1983

v0.7.4

f7fd273

v0.7.4

Fixed

False SLA warnings/breaches for sensor-triggered daily pipelines — scheduleSLAAlerts resolved the SLA deadline against today's date, but sensor-triggered daily pipelines run T+1 (data for today completes tomorrow). The SLA calculation now shifts the deadline date by +1 day for sensor-triggered daily pipelines.

Assets 2

09 Mar 23:55

dwsmith1983

v0.7.3

34cfc38

v0.7.3

Added

Configurable drift detection field (PostRunConfig.DriftField) — specifies which sensor field to compare for post-run drift detection. Defaults to "sensor_count" for backward compatibility.

Fixed

Post-run drift detection never fired when sensor field was count — drift comparison was hardcoded to "sensor_count" but bronze consumers write "count" in hourly-status sensors, causing both baseline and current values to return 0 and silently suppressing all drift detection.
Two flaky time-sensitive tests — TestSLAMonitor_Reconcile_PastWarningFutureBreach and TestWatchdog_MissedSchedule_DetailFields used real wall-clock time instead of injected NowFunc, causing failures depending on time of day.

Assets 2

09 Mar 16:29

dwsmith1983

v0.7.2

f045ac8

v0.7.2

Added

Sensor trigger deadline: Configurable trigger.deadline field (:MM or HH:MM) defines when the auto-trigger window closes for sensor-triggered pipelines. After expiry, the watchdog writes FAILED_FINAL and publishes SENSOR_DEADLINE_EXPIRED. Manual restart via RERUN_REQUEST remains available.
Proactive SLA scheduling for sensor-triggered pipelines: Removed the cron-only guard that silently skipped sensor-triggered pipelines from proactive SLA alert scheduling. Sensor pipelines now receive warning and breach alerts even when their trigger condition is never met.
TOCTOU-safe conditional write: CreateTriggerIfAbsent uses DynamoDB attribute_not_exists(PK) to prevent race conditions between trigger acquisition and deadline enforcement.
Independent trigger deadline watchdog scan: checkTriggerDeadlines runs as its own scan in the watchdog, decoupled from SLA scheduling. Pipelines with a trigger deadline but no SLA config are now covered.

Changed

CloudWatch alarms: Lambda errors, Step Function failures, DLQ depth, and stream iterator age alarms with EventBridge input transformers for alarm routing.
Secrets Manager integration: slack_secret_arn variable allows the alert-dispatcher to retrieve its Slack token from Secrets Manager instead of environment variables.
Environment variable scoping: os.ExpandEnv restricted to INTERLOCK_ prefixed variables across Airflow, Databricks, and Lambda triggers.
Lambda concurrency limits: lambda_concurrency variable sets reserved concurrent executions per function.
Config cache deep copy: JSON round-trip prevents shared state mutation across concurrent reads.
Time injection: time.Now() replaced with d.now() across all handlers for deterministic testing.
Audit logging: 9 silent audit write error discards now emit WARN-level log entries.

Fixed

Trigger lock release on SFN start failure: Previously caused a 4.5-hour deadlock when StartExecution failed.
SLA alert scheduling skip-on-error: scheduleSLAAlerts no longer falls through to subsequent pipelines on error.
Alert-dispatcher env validation: Missing EVENTS_TABLE and EVENTS_TTL_DAYS now checked at startup.

Assets 2

08 Mar 14:50

dwsmith1983

v0.7.1

f046124

v0.7.1

Fixed

Glue RCA false-positive failure classification (Check 2 removed): The verifyGlueRCA Check 2 filter pattern (?Exception ?Error ?FATAL ...) matched benign JVM startup output in Glue's stderr (/aws-glue/jobs/error), causing every SUCCEEDED Glue job to be reclassified as FAILED. Classpath entries like -XX:OnOutOfMemoryError and Glue's internal AnalyzerLogHelper messages contain "Error" as substrings, producing a 100% false positive rate. Removed Check 2 entirely — Check 1 (GlueExceptionAnalysisJobFailed in the RCA log stream) is Glue's purpose-built mechanism for detecting false successes. Post-run validation provides the application-level safety net for data quality issues. (#60)
sla-monitor missing DynamoDB env vars and IAM: Added CONTROL_TABLE, JOBLOG_TABLE, RERUN_TABLE environment variables and a read-only DynamoDB IAM policy for the sla-monitor Lambda. These are required for alert suppression logic (GetTrigger, GetLatestJobEvent) but were never configured in Terraform. Also removed STATE_MACHINE_ARN from orchestrator's ValidateEnv requirements since the orchestrator never starts SFN executions. (#59)

Assets 2

08 Mar 12:33

dwsmith1983

v0.7.0

96a65ee

v0.7.0

Added

Event-driven post-run monitoring: Post-run drift detection moves from SFN poll loop to DynamoDB Stream processing. Stream-router compares sensor values against date-scoped baselines captured at trigger completion. Drift above a configurable threshold triggers a rerun via the existing circuit breaker.
Configurable drift threshold: PostRunConfig.DriftThreshold (*float64) sets the minimum sensor count change to trigger drift. Defaults to 0 (any change).
Watchdog post-run sensor absence detection: Detects pipelines that completed without receiving post-run sensor data after a configurable SensorTimeout grace period. Publishes POST_RUN_SENSOR_MISSING event.
Typed trigger errors: TriggerError with Category (PERMANENT/TRANSIENT) and Unwrap() support. HTTP triggers return 4xx as PERMANENT, 5xx as TRANSIENT.
ConfigCache deep copy: GetAll() returns a deep copy of cached configs, preventing shared state mutation.
Validation engine string parsing: toFloat64 handles string-typed numeric fields via strconv.ParseFloat.
New EventBridge events: BASELINE_CAPTURE_FAILED, PIPELINE_EXCLUDED, RERUN_ACCEPTED for complete audit trail coverage across all code paths.
E2E test coverage: 8 new end-to-end test groups plus 30+ unit tests covering safety gap scenarios.

Changed

Post-run removed from SFN: 6 ASL states removed. Job success routes directly to CompleteTrigger. State count: 30 → 24.
Post-run removed from orchestrator: handlePostRun removed. Post-run logic is entirely stream-based.
Watchdog uses dependency-injected time: Deps.NowFunc replaces package variable for consistent testability.
SFN execution name truncation: Names truncated to 80 characters (AWS limit) at all construction sites.
Environment variable scoping: os.ExpandEnv restricted to INTERLOCK_ prefixed variables only.

Fixed

Calendar exclusion uses execution date: isExcludedDate checks the job's execution date (not time.Now()), preventing incorrect exclusions on re-runs for previous days. Supports YYYY-MM-DD and composite YYYY-MM-DDTHH formats.
Atomic lock reset: ResetTriggerLock uses single DynamoDB UpdateItem with attribute_exists(PK) condition, eliminating the race window between delete and create.
Lock release on SFN start failure: Both rerun and job-failure retry paths release the trigger lock if StartExecution fails, preventing permanently stuck pipelines.
Terminal trigger status on calendar exclusion: handleJobFailure sets FAILED_FINAL instead of leaving the lock in RUNNING state to silently expire via TTL.
ASL CompleteTrigger failure path: New states ensure SLA schedules are cancelled before entering terminal Fail state.
Event ordering: RERUN_ACCEPTED only publishes after ResetTriggerLock confirms lock atomicity.
publishEvent error logging: All 17 call sites now log errors at WARN level.
SLA monitor error wrapping: createOneTimeSchedule wraps errors with schedule name context.
HTTP response body sanitization: Error bodies truncated to 512 bytes with control characters stripped.
DynamoDB table protection: All 4 tables have deletion_protection_enabled and point-in-time recovery enabled.

Assets 2

08 Mar 05:32

dwsmith1983

v0.6.2

554c8cc

v0.6.2

Added

Lambda trigger type: New TriggerLambda for direct AWS Lambda SDK invocation (RequestResponse). LambdaTriggerConfig with functionName and optional payload (supports env-var expansion). Non-polling CheckStatus.
Non-polling trigger synchronous completion: handleTrigger writes success joblog immediately for non-polling triggers (http, command, lambda) and returns sentinel runId so the Step Functions CheckJob JSONPath resolves.

Fixed

SFN crash on non-polling trigger success: Non-polling triggers returned empty runId (omitted via omitempty), causing $.triggerResult.runId JSONPath failure in the CheckJob state. Previously masked because http triggers always failed (403) into the retry/error path.

Assets 2

Releases: dwsmith1983/interlock

v0.9.4

Split internal/lambda into handler sub-packages

Refactored

Added

Fixed

Uh oh!

v0.9.3

Security

Bug Fixes

Changed

Uh oh!

v0.9.1

Added

Fixed

Uh oh!

v0.8.0

Added

Changed

Uh oh!

v0.7.4

Fixed

Uh oh!

v0.7.3

Added

Fixed

Uh oh!

v0.7.2

Added

Changed

Fixed

Uh oh!

v0.7.1

Fixed

Uh oh!

v0.7.0

Added

Changed

Fixed

Uh oh!

v0.6.2

Added

Fixed

Uh oh!