Merged
Conversation
Replace os.ExpandEnv() with os.Expand(str, safeEnvLookup) in Airflow, Databricks, and Lambda triggers. The HTTP trigger already used the safe pattern — these three bypassed it, allowing config authors to exfiltrate secrets like AWS_SECRET_ACCESS_KEY via header/body expansion.
Add ReleaseTriggerLock call when startSFN fails after lock acquisition in handleSensorEvent and reconcileSensorTriggers. Without this, the pipeline is deadlocked until lock TTL expiry (4.5h default). The rerun path already implemented this pattern correctly. Also fix scheduleSLAAlerts to skip pipelines on GetTrigger error instead of falling through to schedule SLA alerts for unknown state.
The alert-dispatcher uses both env vars for DynamoDB thread timestamp storage but envcheck didn't validate them at startup, causing silent runtime failures.
Replace all _ = d.Store.WriteJobEvent(...) silent discards with warn-level logging in rerun.go (6 sites) and orchestrator.go (3 sites). No control flow changes — audit write failures are now visible in CloudWatch Logs without blocking the happy path.
Migrate all direct time.Now() calls in internal/lambda/ production files to d.now(), which delegates to NowFunc when set. This enables deterministic time testing across all handlers — previously only watchdog used the injectable pattern. Adds now time.Time parameter to ResolveExecutionDate and handleSLACalculate to propagate injectable time to standalone functions. Also adds JobEventJobPollExhausted to isJobTerminal switch.
Replace shallow struct copy (cp := *v) with JSON marshal/unmarshal to prevent nested pointer fields (SLA, PostRun, Job.Config maps) from being shared between cache and callers. InjectDateArgs mutates Glue.Arguments in-place, which previously corrupted the cache for subsequent reads.
Add metric alarms for: Lambda errors (6 functions), SFN execution failures, DLQ message depth (3 queues), DynamoDB stream iterator age (2 streams). All use for_each to avoid repetition. Optional sns_alarm_topic_arn variable for SNS notifications.
Set reserved_concurrent_executions on all 6 Lambda functions via configurable lambda_concurrency variable. Add conditional lambda:InvokeFunction IAM policy for the Lambda trigger type (enable_lambda_trigger variable, default false).
Add EventBridge rules on the default bus to capture CloudWatch alarm state changes for interlock alarms (prefix-filtered). Input transformers reshape CW alarm events into InterlockEvent format so event-sink and alert-dispatcher handle them natively with zero Go logic changes. All state changes logged to events table; only ALARM transitions route to Slack via the existing alert SQS queue. Updates SQS queue policy to allow both the framework alert rule and the new CW alarm alert rule. Also adds missing event types to the framework alert rule: POST_RUN_DRIFT, RERUN_REJECTED, LATE_DATA_ARRIVAL, TRIGGER_RECOVERED, and 5 others.
Add optional SLACK_SECRET_ARN env var for alert-dispatcher. When set, reads the Slack bot token from Secrets Manager at cold start instead of from plaintext SLACK_BOT_TOKEN env var. Falls back to env var when ARN is empty for backward compatibility. Includes Terraform resources: conditional IAM policy for secretsmanager:GetSecretValue, slack_secret_arn variable. Removes SLACK_BOT_TOKEN from envcheck required vars since it's optional when using Secrets Manager.
- Inline Secrets Manager policy document to eliminate dangling data block that evaluates even when count=0 (S1) - Scope lambda:InvokeFunction to configurable lambda_trigger_arns variable instead of Resource=* (S2) - Conditionally omit SLACK_BOT_TOKEN from Lambda env when SLACK_SECRET_ARN is set (S4)
Add xoxb-/xoxe- prefix check after reading secret value. Logs a warning if the format doesn't match expected Slack bot token prefix, preventing silent auth failures from JSON-encoded or malformed secrets.
- Reuse `now` variable in handleSensorEvent instead of calling d.now() multiple times (avoids time drift between calendar check and date resolution) - Capture time.Now() once in ResolveExecutionDate tests to prevent date-boundary race at midnight
…anager Instead of silently continuing with an empty Slack token when SecretString is nil, exit with an error. This prevents the dispatcher from running in a broken state.
Move the now capture above EvaluateRule so rule evaluation, calendar exclusion, and execution date resolution all use the same instant. Eliminates the last temporal inconsistency in handleSensorEvent.
Guard against starting with an empty token when neither SLACK_BOT_TOKEN nor SLACK_SECRET_ARN resolves to a non-empty value. Prevents the dispatcher from silently dropping all alerts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses safety, security, and operational hardness findings from the 3-panel expert codebase evaluation.
Safety & Security
os.ExpandEnvtoINTERLOCK_prefix only — prevents credential exfiltration via pipeline config env vars (airflow, databricks, lambda triggers)WriteJobEvent,publishEvent,WriteSensor) — previously silently discarded with_ =time.Now()with injectabled.now()— enables deterministic time in testsEVENTS_TABLEandEVENTS_TTL_DAYSto alert-dispatcher env validationOperational Hardness
ConfigCache.GetAll()via JSON round-trip — prevents mutation of cached datalambda_concurrencyvariable)lambda_trigger_arns)Security hardening (from reviewer pass)
jsonencode(eliminates danglingdatablock)SLACK_BOT_TOKENenv var (only set when not using Secrets Manager)Coverage
internal/lambdainternal/storeinternal/trigger