Skip to content

Releases: dwsmith1983/interlock

v0.9.4

29 Mar 06:38
ead5e6e

Choose a tag to compare

Split internal/lambda into handler sub-packages

Refactored

  • Split monolithic internal/lambda/ (46 files) into handler-aligned sub-packagesorchestrator/, stream/, watchdog/, sla/, alert/, sink/
  • Extracted shared utilities into focused root files — publish, date, exclusion, sensor, schedule, config, terminal
  • Trigger config registry — replaced buildTriggerConfig switch statement with generic registry map
  • SLA deadline calculations wired through pkg/sla/ — pure functions decoupled from Lambda handler context

Added

  • pkg/sla/ package — pure SLA deadline calculation functions
  • PipelineConfig.DeepCopy() method — safe config cache isolation without JSON roundtrip
  • EventWatchdogDegraded event type — watchdog health observability
  • Smoke tests for all 6 cmd/lambda/ packagesValidateEnv coverage

Fixed

  • HandleWatchdog silent error suppression — now returns aggregate errors via errors.Join
  • HandleWatchdog degraded-state signaling — publishes WATCHDOG_DEGRADED event when checks fail
  • Config cache isolation — typed DeepCopy() replaces JSON marshal/unmarshal roundtrip

v0.9.3

14 Mar 06:11
472d942

Choose a tag to compare

Security

  • Encryption at rest for DynamoDB and SQS — All DynamoDB tables explicitly enable server-side encryption. SQS queues use KMS encryption. New optional kms_key_arn variable for custom CMK; defaults to AWS-managed keys.
  • SSRF protection on trigger HTTP clients — Custom http.Transport with dial-time IP validation rejects connections to private, loopback, link-local, and multicast addresses. Protects HTTP, Airflow, and Databricks triggers.
  • EventBridge PutEvents partial failure detectionpublishEvent now checks FailedEntryCount on the response. Previously, partial failures were silently discarded.
  • Command trigger shell injection eliminated — Replaced sh -c with direct exec.CommandContext + strings.Fields argument splitting.
  • lambda_trigger_arns wildcard default removed — Explicit ARN list required when triggers are enabled.
  • Slack plaintext token deprecation warning — Terraform check block warns at plan time when plaintext token is used without Secrets Manager.
  • Trigger IAM policy scoping — New variables glue_job_arns, emr_cluster_arns, emr_serverless_app_arns, sfn_trigger_arns with preconditions requiring non-empty values when the corresponding trigger is enabled.
  • EventBridge bus resource policy — Restricts PutEvents to Lambda execution roles only.

Bug Fixes

  • Drift detection silently skipped zero values — New ExtractFloatOk distinguishes absent from zero. Shared DetectDrift function consolidates 3 duplicated drift comparison sites.
  • RemapPerPeriodSensors map mutation during range — Staging map now collects additions, merged after iteration.
  • Orphaned rerun burned retry budget — Reordered to lock-first, then write rerun record.
  • Stream router discarded partial batch failures — Now returns DynamoDBEventResponse with per-record BatchItemFailures.
  • SLA_MET published when pipeline never ran — Now checks for trigger existence first.
  • Trigger deadline used SLA timezone instead of schedule timezone — Falls back to SLA timezone if schedule timezone is not set.
  • Validation mode case-sensitive — Now uses strings.ToUpper(mode).
  • Epoch timestamp unit mismatch in rerun freshness — Timestamps below 1e12 (seconds) are now converted to milliseconds.
  • Post-run baseline field collision — Now namespaced by rule key. Existing flat baselines self-heal on next pipeline completion.
  • publishEvent errors silently discarded in SLA reconcile — Replaced _ = publishEvent(...) with error-logged calls.

Changed

  • Extracted resolveHTTPClient() replacing identical 7-line blocks in ExecuteHTTP and ExecuteAirflow.
  • Extracted createSLASchedules() replacing duplicated warning/breach schedule loops in watchdog and sla-monitor.
  • Split watchdog.go (1079 lines) into 5 focused files by domain (~200 lines each).

v0.9.1

13 Mar 13:44
be5b24d

Choose a tag to compare

Added

  • DRY_RUN_COMPLETED event — terminal event that closes the dry-run observation loop for each evaluation period. Published after DRY_RUN_WOULD_TRIGGER and DRY_RUN_SLA_PROJECTION, carrying the SLA verdict (met, breach, or n/a) so operators can see each period resolve.

Fixed

  • Dry-run pipelines could start Step Function executions via rerun and job-failure pathshandleRerunRequest and handleJobFailure did not check cfg.DryRun before calling startSFNWithName, allowing rerun requests and job failure retries to start real SFN executions for dry-run pipelines. Added dry-run guards in both handlers and defense-in-depth in startSFNWithName to suppress execution unconditionally. Watchdog reconciliation loop now skips dry-run pipelines to prevent orphaned trigger locks.
  • Watchdog scheduled real SLA alerts for dry-run pipelinesscheduleSLAAlerts, detectMissedSchedules, detectMissedInclusionSchedules, checkTriggerDeadlines, detectMissingPostRunSensors, and detectRelativeSLABreaches all iterated over dry-run pipelines without checking cfg.DryRun. This caused EventBridge Scheduler entries for SLA_WARNING/SLA_BREACH, SCHEDULE_MISSED events, and RELATIVE_SLA_BREACH alerts to fire for observation-only pipelines. Added cfg.DryRun guard to all six watchdog functions.
  • Duplicate JOB_COMPLETED alerts for polled jobshandleCheckJob in the orchestrator published JOB_COMPLETED when polling detected success, but the stream-router's handleJobSuccess also published the same event when the JOB# record arrived via DynamoDB stream. Removed the orchestrator emission; the stream-router is now the single canonical source for JOB_COMPLETED across all job types.
  • Missing event types in alert EventBridge rule — added SENSOR_DEADLINE_EXPIRED, IRREGULAR_SCHEDULE_MISSED, RELATIVE_SLA_WARNING, RELATIVE_SLA_BREACH, and all DRY_RUN_* events to the alert rule so they route to Slack via alert-dispatcher.

v0.8.0

11 Mar 11:58
4924e86

Choose a tag to compare

Added

  • Inclusion calendar scheduling (schedule.include.dates) — explicit YYYY-MM-DD date lists for pipelines that run on known irregular dates (monthly close, quarterly filing, specific business dates). Mutually exclusive with cron. Watchdog detects missed inclusion dates and publishes IRREGULAR_SCHEDULE_MISSED events.
  • Relative SLA (sla.maxDuration) — duration-based SLA for ad-hoc pipelines with no predictable schedule. Clock starts at first sensor arrival and covers the entire lifecycle: evaluation → trigger → job → post-run → completion. Warning at 75% of maxDuration (or breachAt - expectedDuration when set). New events: RELATIVE_SLA_WARNING, RELATIVE_SLA_BREACH.
  • First-sensor-arrival tracking — stream-router records first-sensor-arrival#<date> on lock acquisition (idempotent conditional write). Used as T=0 for relative SLA calculation.
  • Watchdog defense-in-depth for relative SLAdetectRelativeSLABreaches scans pipelines with maxDuration config and fires RELATIVE_SLA_BREACH if the EventBridge scheduler failed to fire.
  • WriteSensorIfAbsent store method — conditional PutItem that only writes if the key doesn't exist, used for first-sensor-arrival idempotency.
  • Config validation for new fields: cron/include mutual exclusion, inclusion date format (YYYY-MM-DD), maxDuration format and 24h cap, maxDuration requires trigger.
  • Glue false-success detectionverifyGlueRCA now checks both the RCA insight stream (Check 1) and the driver output stream for ERROR/FATAL log4j severity markers (Check 2). Catches Spark failures that Glue reports as SUCCEEDED when the application framework swallows exceptions.

Changed

  • SLAConfig.Deadline and SLAConfig.ExpectedDuration are now omitempty — relative SLA configs may omit the wall-clock deadline entirely.
  • SFN ASL passes maxDuration and sensorArrivalAt to CancelSLASchedules and CancelSLAOnCompleteTriggerFailure states.
  • sla-monitor handleSLACalculate routes to relative path when MaxDuration + SensorArrivalAt are present.

v0.7.4

10 Mar 16:13
f7fd273

Choose a tag to compare

Fixed

  • False SLA warnings/breaches for sensor-triggered daily pipelinesscheduleSLAAlerts resolved the SLA deadline against today's date, but sensor-triggered daily pipelines run T+1 (data for today completes tomorrow). The SLA calculation now shifts the deadline date by +1 day for sensor-triggered daily pipelines.

v0.7.3

09 Mar 23:55
34cfc38

Choose a tag to compare

Added

  • Configurable drift detection field (PostRunConfig.DriftField) — specifies which sensor field to compare for post-run drift detection. Defaults to "sensor_count" for backward compatibility.

Fixed

  • Post-run drift detection never fired when sensor field was count — drift comparison was hardcoded to "sensor_count" but bronze consumers write "count" in hourly-status sensors, causing both baseline and current values to return 0 and silently suppressing all drift detection.
  • Two flaky time-sensitive testsTestSLAMonitor_Reconcile_PastWarningFutureBreach and TestWatchdog_MissedSchedule_DetailFields used real wall-clock time instead of injected NowFunc, causing failures depending on time of day.

v0.7.2

09 Mar 16:29
f045ac8

Choose a tag to compare

Added

  • Sensor trigger deadline: Configurable trigger.deadline field (:MM or HH:MM) defines when the auto-trigger window closes for sensor-triggered pipelines. After expiry, the watchdog writes FAILED_FINAL and publishes SENSOR_DEADLINE_EXPIRED. Manual restart via RERUN_REQUEST remains available.
  • Proactive SLA scheduling for sensor-triggered pipelines: Removed the cron-only guard that silently skipped sensor-triggered pipelines from proactive SLA alert scheduling. Sensor pipelines now receive warning and breach alerts even when their trigger condition is never met.
  • TOCTOU-safe conditional write: CreateTriggerIfAbsent uses DynamoDB attribute_not_exists(PK) to prevent race conditions between trigger acquisition and deadline enforcement.
  • Independent trigger deadline watchdog scan: checkTriggerDeadlines runs as its own scan in the watchdog, decoupled from SLA scheduling. Pipelines with a trigger deadline but no SLA config are now covered.

Changed

  • CloudWatch alarms: Lambda errors, Step Function failures, DLQ depth, and stream iterator age alarms with EventBridge input transformers for alarm routing.
  • Secrets Manager integration: slack_secret_arn variable allows the alert-dispatcher to retrieve its Slack token from Secrets Manager instead of environment variables.
  • Environment variable scoping: os.ExpandEnv restricted to INTERLOCK_ prefixed variables across Airflow, Databricks, and Lambda triggers.
  • Lambda concurrency limits: lambda_concurrency variable sets reserved concurrent executions per function.
  • Config cache deep copy: JSON round-trip prevents shared state mutation across concurrent reads.
  • Time injection: time.Now() replaced with d.now() across all handlers for deterministic testing.
  • Audit logging: 9 silent audit write error discards now emit WARN-level log entries.

Fixed

  • Trigger lock release on SFN start failure: Previously caused a 4.5-hour deadlock when StartExecution failed.
  • SLA alert scheduling skip-on-error: scheduleSLAAlerts no longer falls through to subsequent pipelines on error.
  • Alert-dispatcher env validation: Missing EVENTS_TABLE and EVENTS_TTL_DAYS now checked at startup.

v0.7.1

08 Mar 14:50
f046124

Choose a tag to compare

Fixed

  • Glue RCA false-positive failure classification (Check 2 removed): The verifyGlueRCA Check 2 filter pattern (?Exception ?Error ?FATAL ...) matched benign JVM startup output in Glue's stderr (/aws-glue/jobs/error), causing every SUCCEEDED Glue job to be reclassified as FAILED. Classpath entries like -XX:OnOutOfMemoryError and Glue's internal AnalyzerLogHelper messages contain "Error" as substrings, producing a 100% false positive rate. Removed Check 2 entirely — Check 1 (GlueExceptionAnalysisJobFailed in the RCA log stream) is Glue's purpose-built mechanism for detecting false successes. Post-run validation provides the application-level safety net for data quality issues. (#60)

  • sla-monitor missing DynamoDB env vars and IAM: Added CONTROL_TABLE, JOBLOG_TABLE, RERUN_TABLE environment variables and a read-only DynamoDB IAM policy for the sla-monitor Lambda. These are required for alert suppression logic (GetTrigger, GetLatestJobEvent) but were never configured in Terraform. Also removed STATE_MACHINE_ARN from orchestrator's ValidateEnv requirements since the orchestrator never starts SFN executions. (#59)

v0.7.0

08 Mar 12:33
96a65ee

Choose a tag to compare

Added

  • Event-driven post-run monitoring: Post-run drift detection moves from SFN poll loop to DynamoDB Stream processing. Stream-router compares sensor values against date-scoped baselines captured at trigger completion. Drift above a configurable threshold triggers a rerun via the existing circuit breaker.
  • Configurable drift threshold: PostRunConfig.DriftThreshold (*float64) sets the minimum sensor count change to trigger drift. Defaults to 0 (any change).
  • Watchdog post-run sensor absence detection: Detects pipelines that completed without receiving post-run sensor data after a configurable SensorTimeout grace period. Publishes POST_RUN_SENSOR_MISSING event.
  • Typed trigger errors: TriggerError with Category (PERMANENT/TRANSIENT) and Unwrap() support. HTTP triggers return 4xx as PERMANENT, 5xx as TRANSIENT.
  • ConfigCache deep copy: GetAll() returns a deep copy of cached configs, preventing shared state mutation.
  • Validation engine string parsing: toFloat64 handles string-typed numeric fields via strconv.ParseFloat.
  • New EventBridge events: BASELINE_CAPTURE_FAILED, PIPELINE_EXCLUDED, RERUN_ACCEPTED for complete audit trail coverage across all code paths.
  • E2E test coverage: 8 new end-to-end test groups plus 30+ unit tests covering safety gap scenarios.

Changed

  • Post-run removed from SFN: 6 ASL states removed. Job success routes directly to CompleteTrigger. State count: 30 → 24.
  • Post-run removed from orchestrator: handlePostRun removed. Post-run logic is entirely stream-based.
  • Watchdog uses dependency-injected time: Deps.NowFunc replaces package variable for consistent testability.
  • SFN execution name truncation: Names truncated to 80 characters (AWS limit) at all construction sites.
  • Environment variable scoping: os.ExpandEnv restricted to INTERLOCK_ prefixed variables only.

Fixed

  • Calendar exclusion uses execution date: isExcludedDate checks the job's execution date (not time.Now()), preventing incorrect exclusions on re-runs for previous days. Supports YYYY-MM-DD and composite YYYY-MM-DDTHH formats.
  • Atomic lock reset: ResetTriggerLock uses single DynamoDB UpdateItem with attribute_exists(PK) condition, eliminating the race window between delete and create.
  • Lock release on SFN start failure: Both rerun and job-failure retry paths release the trigger lock if StartExecution fails, preventing permanently stuck pipelines.
  • Terminal trigger status on calendar exclusion: handleJobFailure sets FAILED_FINAL instead of leaving the lock in RUNNING state to silently expire via TTL.
  • ASL CompleteTrigger failure path: New states ensure SLA schedules are cancelled before entering terminal Fail state.
  • Event ordering: RERUN_ACCEPTED only publishes after ResetTriggerLock confirms lock atomicity.
  • publishEvent error logging: All 17 call sites now log errors at WARN level.
  • SLA monitor error wrapping: createOneTimeSchedule wraps errors with schedule name context.
  • HTTP response body sanitization: Error bodies truncated to 512 bytes with control characters stripped.
  • DynamoDB table protection: All 4 tables have deletion_protection_enabled and point-in-time recovery enabled.

v0.6.2

08 Mar 05:32
554c8cc

Choose a tag to compare

Added

  • Lambda trigger type: New TriggerLambda for direct AWS Lambda SDK invocation (RequestResponse). LambdaTriggerConfig with functionName and optional payload (supports env-var expansion). Non-polling CheckStatus.
  • Non-polling trigger synchronous completion: handleTrigger writes success joblog immediately for non-polling triggers (http, command, lambda) and returns sentinel runId so the Step Functions CheckJob JSONPath resolves.

Fixed

  • SFN crash on non-polling trigger success: Non-polling triggers returned empty runId (omitted via omitempty), causing $.triggerResult.runId JSONPath failure in the CheckJob state. Previously masked because http triggers always failed (403) into the retry/error path.