Releases: dwsmith1983/interlock
v0.9.4
Split internal/lambda into handler sub-packages
Refactored
- Split monolithic
internal/lambda/(46 files) into handler-aligned sub-packages —orchestrator/,stream/,watchdog/,sla/,alert/,sink/ - Extracted shared utilities into focused root files — publish, date, exclusion, sensor, schedule, config, terminal
- Trigger config registry — replaced
buildTriggerConfigswitch statement with generic registry map - SLA deadline calculations wired through
pkg/sla/— pure functions decoupled from Lambda handler context
Added
pkg/sla/package — pure SLA deadline calculation functionsPipelineConfig.DeepCopy()method — safe config cache isolation without JSON roundtripEventWatchdogDegradedevent type — watchdog health observability- Smoke tests for all 6
cmd/lambda/packages —ValidateEnvcoverage
Fixed
HandleWatchdogsilent error suppression — now returns aggregate errors viaerrors.JoinHandleWatchdogdegraded-state signaling — publishesWATCHDOG_DEGRADEDevent when checks fail- Config cache isolation — typed
DeepCopy()replaces JSON marshal/unmarshal roundtrip
v0.9.3
Security
- Encryption at rest for DynamoDB and SQS — All DynamoDB tables explicitly enable server-side encryption. SQS queues use KMS encryption. New optional
kms_key_arnvariable for custom CMK; defaults to AWS-managed keys. - SSRF protection on trigger HTTP clients — Custom
http.Transportwith dial-time IP validation rejects connections to private, loopback, link-local, and multicast addresses. Protects HTTP, Airflow, and Databricks triggers. - EventBridge PutEvents partial failure detection —
publishEventnow checksFailedEntryCounton the response. Previously, partial failures were silently discarded. - Command trigger shell injection eliminated — Replaced
sh -cwith directexec.CommandContext+strings.Fieldsargument splitting. - lambda_trigger_arns wildcard default removed — Explicit ARN list required when triggers are enabled.
- Slack plaintext token deprecation warning — Terraform
checkblock warns at plan time when plaintext token is used without Secrets Manager. - Trigger IAM policy scoping — New variables
glue_job_arns,emr_cluster_arns,emr_serverless_app_arns,sfn_trigger_arnswith preconditions requiring non-empty values when the corresponding trigger is enabled. - EventBridge bus resource policy — Restricts PutEvents to Lambda execution roles only.
Bug Fixes
- Drift detection silently skipped zero values — New
ExtractFloatOkdistinguishes absent from zero. SharedDetectDriftfunction consolidates 3 duplicated drift comparison sites. - RemapPerPeriodSensors map mutation during range — Staging map now collects additions, merged after iteration.
- Orphaned rerun burned retry budget — Reordered to lock-first, then write rerun record.
- Stream router discarded partial batch failures — Now returns
DynamoDBEventResponsewith per-recordBatchItemFailures. - SLA_MET published when pipeline never ran — Now checks for trigger existence first.
- Trigger deadline used SLA timezone instead of schedule timezone — Falls back to SLA timezone if schedule timezone is not set.
- Validation mode case-sensitive — Now uses
strings.ToUpper(mode). - Epoch timestamp unit mismatch in rerun freshness — Timestamps below 1e12 (seconds) are now converted to milliseconds.
- Post-run baseline field collision — Now namespaced by rule key. Existing flat baselines self-heal on next pipeline completion.
- publishEvent errors silently discarded in SLA reconcile — Replaced
_ = publishEvent(...)with error-logged calls.
Changed
- Extracted
resolveHTTPClient()replacing identical 7-line blocks inExecuteHTTPandExecuteAirflow. - Extracted
createSLASchedules()replacing duplicated warning/breach schedule loops in watchdog and sla-monitor. - Split
watchdog.go(1079 lines) into 5 focused files by domain (~200 lines each).
v0.9.1
Added
DRY_RUN_COMPLETEDevent — terminal event that closes the dry-run observation loop for each evaluation period. Published afterDRY_RUN_WOULD_TRIGGERandDRY_RUN_SLA_PROJECTION, carrying the SLA verdict (met,breach, orn/a) so operators can see each period resolve.
Fixed
- Dry-run pipelines could start Step Function executions via rerun and job-failure paths —
handleRerunRequestandhandleJobFailuredid not checkcfg.DryRunbefore callingstartSFNWithName, allowing rerun requests and job failure retries to start real SFN executions for dry-run pipelines. Added dry-run guards in both handlers and defense-in-depth instartSFNWithNameto suppress execution unconditionally. Watchdog reconciliation loop now skips dry-run pipelines to prevent orphaned trigger locks. - Watchdog scheduled real SLA alerts for dry-run pipelines —
scheduleSLAAlerts,detectMissedSchedules,detectMissedInclusionSchedules,checkTriggerDeadlines,detectMissingPostRunSensors, anddetectRelativeSLABreachesall iterated over dry-run pipelines without checkingcfg.DryRun. This caused EventBridge Scheduler entries for SLA_WARNING/SLA_BREACH, SCHEDULE_MISSED events, and RELATIVE_SLA_BREACH alerts to fire for observation-only pipelines. Addedcfg.DryRunguard to all six watchdog functions. - Duplicate
JOB_COMPLETEDalerts for polled jobs —handleCheckJobin the orchestrator publishedJOB_COMPLETEDwhen polling detected success, but the stream-router'shandleJobSuccessalso published the same event when the JOB# record arrived via DynamoDB stream. Removed the orchestrator emission; the stream-router is now the single canonical source forJOB_COMPLETEDacross all job types. - Missing event types in alert EventBridge rule — added
SENSOR_DEADLINE_EXPIRED,IRREGULAR_SCHEDULE_MISSED,RELATIVE_SLA_WARNING,RELATIVE_SLA_BREACH, and allDRY_RUN_*events to the alert rule so they route to Slack via alert-dispatcher.
v0.8.0
Added
- Inclusion calendar scheduling (
schedule.include.dates) — explicit YYYY-MM-DD date lists for pipelines that run on known irregular dates (monthly close, quarterly filing, specific business dates). Mutually exclusive with cron. Watchdog detects missed inclusion dates and publishesIRREGULAR_SCHEDULE_MISSEDevents. - Relative SLA (
sla.maxDuration) — duration-based SLA for ad-hoc pipelines with no predictable schedule. Clock starts at first sensor arrival and covers the entire lifecycle: evaluation → trigger → job → post-run → completion. Warning at 75% of maxDuration (orbreachAt - expectedDurationwhen set). New events:RELATIVE_SLA_WARNING,RELATIVE_SLA_BREACH. - First-sensor-arrival tracking — stream-router records
first-sensor-arrival#<date>on lock acquisition (idempotent conditional write). Used as T=0 for relative SLA calculation. - Watchdog defense-in-depth for relative SLA —
detectRelativeSLABreachesscans pipelines withmaxDurationconfig and firesRELATIVE_SLA_BREACHif the EventBridge scheduler failed to fire. WriteSensorIfAbsentstore method — conditional PutItem that only writes if the key doesn't exist, used for first-sensor-arrival idempotency.- Config validation for new fields: cron/include mutual exclusion, inclusion date format (YYYY-MM-DD), maxDuration format and 24h cap, maxDuration requires trigger.
- Glue false-success detection —
verifyGlueRCAnow checks both the RCA insight stream (Check 1) and the driver output stream for ERROR/FATAL log4j severity markers (Check 2). Catches Spark failures that Glue reports as SUCCEEDED when the application framework swallows exceptions.
Changed
SLAConfig.DeadlineandSLAConfig.ExpectedDurationare nowomitempty— relative SLA configs may omit the wall-clock deadline entirely.- SFN ASL passes
maxDurationandsensorArrivalAttoCancelSLASchedulesandCancelSLAOnCompleteTriggerFailurestates. - sla-monitor
handleSLACalculateroutes to relative path whenMaxDuration+SensorArrivalAtare present.
v0.7.4
Fixed
- False SLA warnings/breaches for sensor-triggered daily pipelines —
scheduleSLAAlertsresolved the SLA deadline against today's date, but sensor-triggered daily pipelines run T+1 (data for today completes tomorrow). The SLA calculation now shifts the deadline date by +1 day for sensor-triggered daily pipelines.
v0.7.3
Added
- Configurable drift detection field (
PostRunConfig.DriftField) — specifies which sensor field to compare for post-run drift detection. Defaults to"sensor_count"for backward compatibility.
Fixed
- Post-run drift detection never fired when sensor field was
count— drift comparison was hardcoded to"sensor_count"but bronze consumers write"count"in hourly-status sensors, causing both baseline and current values to return 0 and silently suppressing all drift detection. - Two flaky time-sensitive tests —
TestSLAMonitor_Reconcile_PastWarningFutureBreachandTestWatchdog_MissedSchedule_DetailFieldsused real wall-clock time instead of injectedNowFunc, causing failures depending on time of day.
v0.7.2
Added
- Sensor trigger deadline: Configurable
trigger.deadlinefield (:MMorHH:MM) defines when the auto-trigger window closes for sensor-triggered pipelines. After expiry, the watchdog writesFAILED_FINALand publishesSENSOR_DEADLINE_EXPIRED. Manual restart viaRERUN_REQUESTremains available. - Proactive SLA scheduling for sensor-triggered pipelines: Removed the cron-only guard that silently skipped sensor-triggered pipelines from proactive SLA alert scheduling. Sensor pipelines now receive warning and breach alerts even when their trigger condition is never met.
- TOCTOU-safe conditional write:
CreateTriggerIfAbsentuses DynamoDBattribute_not_exists(PK)to prevent race conditions between trigger acquisition and deadline enforcement. - Independent trigger deadline watchdog scan:
checkTriggerDeadlinesruns as its own scan in the watchdog, decoupled from SLA scheduling. Pipelines with a trigger deadline but no SLA config are now covered.
Changed
- CloudWatch alarms: Lambda errors, Step Function failures, DLQ depth, and stream iterator age alarms with EventBridge input transformers for alarm routing.
- Secrets Manager integration:
slack_secret_arnvariable allows the alert-dispatcher to retrieve its Slack token from Secrets Manager instead of environment variables. - Environment variable scoping:
os.ExpandEnvrestricted toINTERLOCK_prefixed variables across Airflow, Databricks, and Lambda triggers. - Lambda concurrency limits:
lambda_concurrencyvariable sets reserved concurrent executions per function. - Config cache deep copy: JSON round-trip prevents shared state mutation across concurrent reads.
- Time injection:
time.Now()replaced withd.now()across all handlers for deterministic testing. - Audit logging: 9 silent audit write error discards now emit WARN-level log entries.
Fixed
- Trigger lock release on SFN start failure: Previously caused a 4.5-hour deadlock when
StartExecutionfailed. - SLA alert scheduling skip-on-error:
scheduleSLAAlertsno longer falls through to subsequent pipelines on error. - Alert-dispatcher env validation: Missing
EVENTS_TABLEandEVENTS_TTL_DAYSnow checked at startup.
v0.7.1
Fixed
-
Glue RCA false-positive failure classification (Check 2 removed): The
verifyGlueRCACheck 2 filter pattern (?Exception ?Error ?FATAL ...) matched benign JVM startup output in Glue's stderr (/aws-glue/jobs/error), causing every SUCCEEDED Glue job to be reclassified as FAILED. Classpath entries like-XX:OnOutOfMemoryErrorand Glue's internalAnalyzerLogHelpermessages contain "Error" as substrings, producing a 100% false positive rate. Removed Check 2 entirely — Check 1 (GlueExceptionAnalysisJobFailed in the RCA log stream) is Glue's purpose-built mechanism for detecting false successes. Post-run validation provides the application-level safety net for data quality issues. (#60) -
sla-monitor missing DynamoDB env vars and IAM: Added
CONTROL_TABLE,JOBLOG_TABLE,RERUN_TABLEenvironment variables and a read-only DynamoDB IAM policy for the sla-monitor Lambda. These are required for alert suppression logic (GetTrigger,GetLatestJobEvent) but were never configured in Terraform. Also removedSTATE_MACHINE_ARNfrom orchestrator'sValidateEnvrequirements since the orchestrator never starts SFN executions. (#59)
v0.7.0
Added
- Event-driven post-run monitoring: Post-run drift detection moves from SFN poll loop to DynamoDB Stream processing. Stream-router compares sensor values against date-scoped baselines captured at trigger completion. Drift above a configurable threshold triggers a rerun via the existing circuit breaker.
- Configurable drift threshold:
PostRunConfig.DriftThreshold(*float64) sets the minimum sensor count change to trigger drift. Defaults to 0 (any change). - Watchdog post-run sensor absence detection: Detects pipelines that completed without receiving post-run sensor data after a configurable
SensorTimeoutgrace period. PublishesPOST_RUN_SENSOR_MISSINGevent. - Typed trigger errors:
TriggerErrorwithCategory(PERMANENT/TRANSIENT) andUnwrap()support. HTTP triggers return 4xx as PERMANENT, 5xx as TRANSIENT. - ConfigCache deep copy:
GetAll()returns a deep copy of cached configs, preventing shared state mutation. - Validation engine string parsing:
toFloat64handles string-typed numeric fields viastrconv.ParseFloat. - New EventBridge events:
BASELINE_CAPTURE_FAILED,PIPELINE_EXCLUDED,RERUN_ACCEPTEDfor complete audit trail coverage across all code paths. - E2E test coverage: 8 new end-to-end test groups plus 30+ unit tests covering safety gap scenarios.
Changed
- Post-run removed from SFN: 6 ASL states removed. Job success routes directly to
CompleteTrigger. State count: 30 → 24. - Post-run removed from orchestrator:
handlePostRunremoved. Post-run logic is entirely stream-based. - Watchdog uses dependency-injected time:
Deps.NowFuncreplaces package variable for consistent testability. - SFN execution name truncation: Names truncated to 80 characters (AWS limit) at all construction sites.
- Environment variable scoping:
os.ExpandEnvrestricted toINTERLOCK_prefixed variables only.
Fixed
- Calendar exclusion uses execution date:
isExcludedDatechecks the job's execution date (nottime.Now()), preventing incorrect exclusions on re-runs for previous days. SupportsYYYY-MM-DDand compositeYYYY-MM-DDTHHformats. - Atomic lock reset:
ResetTriggerLockuses single DynamoDBUpdateItemwithattribute_exists(PK)condition, eliminating the race window between delete and create. - Lock release on SFN start failure: Both rerun and job-failure retry paths release the trigger lock if
StartExecutionfails, preventing permanently stuck pipelines. - Terminal trigger status on calendar exclusion:
handleJobFailuresetsFAILED_FINALinstead of leaving the lock inRUNNINGstate to silently expire via TTL. - ASL CompleteTrigger failure path: New states ensure SLA schedules are cancelled before entering terminal Fail state.
- Event ordering:
RERUN_ACCEPTEDonly publishes afterResetTriggerLockconfirms lock atomicity. - publishEvent error logging: All 17 call sites now log errors at WARN level.
- SLA monitor error wrapping:
createOneTimeSchedulewraps errors with schedule name context. - HTTP response body sanitization: Error bodies truncated to 512 bytes with control characters stripped.
- DynamoDB table protection: All 4 tables have
deletion_protection_enabledand point-in-time recovery enabled.
v0.6.2
Added
- Lambda trigger type: New
TriggerLambdafor direct AWS Lambda SDK invocation (RequestResponse).LambdaTriggerConfigwithfunctionNameand optionalpayload(supports env-var expansion). Non-pollingCheckStatus. - Non-polling trigger synchronous completion:
handleTriggerwrites success joblog immediately for non-polling triggers (http, command, lambda) and returns sentinelrunIdso the Step FunctionsCheckJobJSONPath resolves.
Fixed
- SFN crash on non-polling trigger success: Non-polling triggers returned empty
runId(omitted viaomitempty), causing$.triggerResult.runIdJSONPath failure in theCheckJobstate. Previously masked because http triggers always failed (403) into the retry/error path.