Skip to content

fix: safety & operational hardening#61

Merged
dwsmith1983 merged 17 commits intomainfrom
fix/safety-operational-hardening
Mar 8, 2026
Merged

fix: safety & operational hardening#61
dwsmith1983 merged 17 commits intomainfrom
fix/safety-operational-hardening

Conversation

@dwsmith1983
Copy link
Copy Markdown
Owner

Summary

Addresses safety, security, and operational hardness findings from the 3-panel expert codebase evaluation.

Safety & Security

  • Restrict os.ExpandEnv to INTERLOCK_ prefix only — prevents credential exfiltration via pipeline config env vars (airflow, databricks, lambda triggers)
  • Release trigger lock on SFN start failure — prevents permanent lock-out of pipelines
  • Log all audit write errors (WriteJobEvent, publishEvent, WriteSensor) — previously silently discarded with _ =
  • Replace all time.Now() with injectable d.now() — enables deterministic time in tests
  • Add EVENTS_TABLE and EVENTS_TTL_DAYS to alert-dispatcher env validation

Operational Hardness

  • Deep copy pipeline configs in ConfigCache.GetAll() via JSON round-trip — prevents mutation of cached data
  • CloudWatch alarms for Lambda errors (6 functions), SFN execution failures, DLQ message depth (3 queues), DynamoDB stream iterator age (2 streams) — all routed through EventBridge into existing event pipeline
  • Reserved concurrency limits on all 6 Lambda functions (configurable via lambda_concurrency variable)
  • Secrets Manager integration for Slack bot token with backward-compatible env var fallback
  • EventBridge rules route CW alarm state changes to event-sink (all states) and SQS alert queue (ALARM only) via input transformers — zero Go code changes
  • Scoped IAM policy for Lambda trigger invocation (configurable via lambda_trigger_arns)

Security hardening (from reviewer pass)

  • Inline Terraform policy with jsonencode (eliminates dangling data block)
  • Conditional SLACK_BOT_TOKEN env var (only set when not using Secrets Manager)
  • Scoped Lambda trigger IAM to specific ARNs
  • Fail-fast on nil/empty Secrets Manager response
  • Fail-fast when no Slack token is configured

Coverage

Package Coverage
internal/lambda 85.6%
internal/store 90.2%
internal/trigger 85.8%

Replace os.ExpandEnv() with os.Expand(str, safeEnvLookup) in Airflow,
Databricks, and Lambda triggers. The HTTP trigger already used the safe
pattern — these three bypassed it, allowing config authors to exfiltrate
secrets like AWS_SECRET_ACCESS_KEY via header/body expansion.
Add ReleaseTriggerLock call when startSFN fails after lock acquisition
in handleSensorEvent and reconcileSensorTriggers. Without this, the
pipeline is deadlocked until lock TTL expiry (4.5h default). The rerun
path already implemented this pattern correctly.

Also fix scheduleSLAAlerts to skip pipelines on GetTrigger error instead
of falling through to schedule SLA alerts for unknown state.
The alert-dispatcher uses both env vars for DynamoDB thread timestamp
storage but envcheck didn't validate them at startup, causing silent
runtime failures.
Replace all _ = d.Store.WriteJobEvent(...) silent discards with
warn-level logging in rerun.go (6 sites) and orchestrator.go (3 sites).
No control flow changes — audit write failures are now visible in
CloudWatch Logs without blocking the happy path.
Migrate all direct time.Now() calls in internal/lambda/ production files
to d.now(), which delegates to NowFunc when set. This enables
deterministic time testing across all handlers — previously only watchdog
used the injectable pattern.

Adds now time.Time parameter to ResolveExecutionDate and
handleSLACalculate to propagate injectable time to standalone functions.
Also adds JobEventJobPollExhausted to isJobTerminal switch.
Replace shallow struct copy (cp := *v) with JSON marshal/unmarshal to
prevent nested pointer fields (SLA, PostRun, Job.Config maps) from being
shared between cache and callers. InjectDateArgs mutates Glue.Arguments
in-place, which previously corrupted the cache for subsequent reads.
Add metric alarms for: Lambda errors (6 functions), SFN execution
failures, DLQ message depth (3 queues), DynamoDB stream iterator age
(2 streams). All use for_each to avoid repetition. Optional
sns_alarm_topic_arn variable for SNS notifications.
Set reserved_concurrent_executions on all 6 Lambda functions via
configurable lambda_concurrency variable. Add conditional
lambda:InvokeFunction IAM policy for the Lambda trigger type
(enable_lambda_trigger variable, default false).
Add EventBridge rules on the default bus to capture CloudWatch alarm
state changes for interlock alarms (prefix-filtered). Input transformers
reshape CW alarm events into InterlockEvent format so event-sink and
alert-dispatcher handle them natively with zero Go logic changes.

All state changes logged to events table; only ALARM transitions route
to Slack via the existing alert SQS queue. Updates SQS queue policy to
allow both the framework alert rule and the new CW alarm alert rule.

Also adds missing event types to the framework alert rule: POST_RUN_DRIFT,
RERUN_REJECTED, LATE_DATA_ARRIVAL, TRIGGER_RECOVERED, and 5 others.
Add optional SLACK_SECRET_ARN env var for alert-dispatcher. When set,
reads the Slack bot token from Secrets Manager at cold start instead of
from plaintext SLACK_BOT_TOKEN env var. Falls back to env var when ARN
is empty for backward compatibility.

Includes Terraform resources: conditional IAM policy for
secretsmanager:GetSecretValue, slack_secret_arn variable. Removes
SLACK_BOT_TOKEN from envcheck required vars since it's optional when
using Secrets Manager.
- Inline Secrets Manager policy document to eliminate dangling data block
  that evaluates even when count=0 (S1)
- Scope lambda:InvokeFunction to configurable lambda_trigger_arns
  variable instead of Resource=* (S2)
- Conditionally omit SLACK_BOT_TOKEN from Lambda env when
  SLACK_SECRET_ARN is set (S4)
Add xoxb-/xoxe- prefix check after reading secret value. Logs a warning
if the format doesn't match expected Slack bot token prefix, preventing
silent auth failures from JSON-encoded or malformed secrets.
- Reuse `now` variable in handleSensorEvent instead of calling d.now()
  multiple times (avoids time drift between calendar check and date
  resolution)
- Capture time.Now() once in ResolveExecutionDate tests to prevent
  date-boundary race at midnight
…anager

Instead of silently continuing with an empty Slack token when
SecretString is nil, exit with an error. This prevents the dispatcher
from running in a broken state.
Move the now capture above EvaluateRule so rule evaluation, calendar
exclusion, and execution date resolution all use the same instant.
Eliminates the last temporal inconsistency in handleSensorEvent.
Guard against starting with an empty token when neither SLACK_BOT_TOKEN
nor SLACK_SECRET_ARN resolves to a non-empty value. Prevents the
dispatcher from silently dropping all alerts.
@github-actions github-actions bot added tests Test changes lambda Lambda handlers deploy Deployment and ASL triggers Trigger types dependencies Dependency updates types Public types (pkg/types) labels Mar 8, 2026
@dwsmith1983 dwsmith1983 self-assigned this Mar 8, 2026
@dwsmith1983 dwsmith1983 merged commit 0dbc117 into main Mar 8, 2026
6 checks passed
@dwsmith1983 dwsmith1983 deleted the fix/safety-operational-hardening branch March 8, 2026 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates deploy Deployment and ASL lambda Lambda handlers tests Test changes triggers Trigger types types Public types (pkg/types)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant