Skip to content

fix: close 15 STAMP safety gaps for v0.7.0#57

Merged
dwsmith1983 merged 4 commits intomainfrom
fix/safety-gaps
Mar 8, 2026
Merged

fix: close 15 STAMP safety gaps for v0.7.0#57
dwsmith1983 merged 4 commits intomainfrom
fix/safety-gaps

Conversation

@dwsmith1983
Copy link
Copy Markdown
Owner

Summary

  • Calendar exclusion uses execution date — re-runs for previous days check the job's date (not time.Now()), preventing incorrect exclusions on weekends/holidays
  • Atomic lock resetResetTriggerLock uses single DynamoDB UpdateItem with attribute_exists(PK) condition, eliminating the delete+create race window
  • 3 new EventBridge eventsBASELINE_CAPTURE_FAILED, PIPELINE_EXCLUDED, RERUN_ACCEPTED close audit trail gaps across all code paths
  • ASL CompleteTrigger failure path — catches trigger errors, cancels SLA schedules, then enters terminal Fail state (prevents orphaned SLA alarms)
  • Lock release on SFN start failure — both rerun and job-failure retry paths release the trigger lock if StartExecution fails
  • Terminal trigger status on calendar exclusionhandleJobFailure sets FAILED_FINAL instead of leaving the lock in RUNNING to expire via TTL
  • Event ordering correctnessRERUN_ACCEPTED only publishes after lock atomicity is confirmed
  • Post-run drift observabilityPIPELINE_EXCLUDED event published when drift rerun is skipped by calendar
  • 30+ new tests covering all new paths

Calendar exclusion now uses the job's execution date (not wall clock),
preventing incorrect skips on re-runs for previous days. Lock reset is
atomic via DynamoDB UpdateItem (eliminates delete+create race window).
New events (BASELINE_CAPTURE_FAILED, PIPELINE_EXCLUDED, RERUN_ACCEPTED)
close audit trail gaps. ASL CompleteTrigger failure path cancels SLA
schedules before entering terminal Fail state. All paths that start SFN
executions now release locks on failure.

Safety gaps addressed:
- Calendar exclusion uses execution date, not time.Now()
- Atomic lock reset (ResetTriggerLock) replaces delete+create
- Lock release on SFN StartExecution failure (rerun + job failure)
- BASELINE_CAPTURE_FAILED event on post-run baseline capture error
- PIPELINE_EXCLUDED event on calendar exclusion (sensor + rerun paths)
- RERUN_ACCEPTED event before lock reset (audit completeness)
- INFRA_FAILURE event on lock reset failure
- ASL CompleteTrigger catch → SLA cancellation → Fail state
- Post-run drift skips rerun write on excluded dates
- 30+ new tests covering all new paths
…ervability)

- handleJobFailure: set trigger to FAILED_FINAL on calendar exclusion
  (prevents orphaned RUNNING lock that silently expires via TTL)
- handleRerunRequest: move RERUN_ACCEPTED event after ResetTriggerLock
  succeeds (event now only publishes when rerun actually starts)
- handlePostRunCompleted: publish PIPELINE_EXCLUDED event when drift
  rerun is skipped due to calendar exclusion (closes observability gap)
@github-actions github-actions bot added tests Test changes lambda Lambda handlers deploy Deployment and ASL types Public types (pkg/types) labels Mar 8, 2026
@dwsmith1983 dwsmith1983 self-assigned this Mar 8, 2026
@github-actions github-actions bot added the docs Documentation label Mar 8, 2026
Hardcode constant pipeline/schedule/date values in test helpers that
only ever receive a single value. Remove unused seedSensorForCircuitBreaker
and seedJobSuccess functions. Remove orphaned const blocks.
@dwsmith1983 dwsmith1983 merged commit 96a65ee into main Mar 8, 2026
6 checks passed
@dwsmith1983 dwsmith1983 deleted the fix/safety-gaps branch March 8, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deploy Deployment and ASL docs Documentation lambda Lambda handlers tests Test changes types Public types (pkg/types)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant