chrispsheehan · chrispsheehan · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026
diff --git a/.github/docs/README.md b/.github/docs/README.md
@@ -128,7 +128,7 @@ flowchart LR
 ### Cleanup And Discovery
 
 - `destroy.yml`
-  Tears down app layers before shared dependencies, including the shared observability dashboard and any environment-owned shared artifact stacks such as the `dev` code bucket.
+  Tears down app layers before shared dependencies, including the shared observability dashboard and any environment-owned shared artifact stacks such as the `dev` code bucket. In the `database` job, `dev` now runs `tg_action: init` first to read Terraform outputs from the database stack, then passes `cluster_identifier` and `manual_snapshot_identifier_prefix` into `justfile.deploy` so the cleanup recipe deletes only repo-owned manual Aurora cluster snapshots before Terragrunt destroy. `prod` intentionally retains those manual snapshots.
 - `shared_directories_get.yml`
   Derives the directory-based matrices used by wrapper workflows and PR action-test discovery.
 
@@ -206,6 +206,7 @@ Run these checks on every CI, workflow, or deploy-contract change.
 - confirm destroy ordering still removes downstream consumers before shared stacks
 - check required Terraform variables on destroy as well as apply
 - prefer depending on real downstream consumers rather than serializing unrelated shared stacks
+- when a runtime or module creates manual backup artifacts outside Terraform resource ownership, decide explicitly whether destroy should delete or retain them by environment and keep that behavior documented in `destroy.yml` contracts
 
 ## Wrapper Workflow Summary
 

diff --git a/.github/workflows/destroy.yml b/.github/workflows/destroy.yml
@@ -206,6 +206,31 @@ jobs:
           role-to-assume: ${{ env.AWS_OIDC_ROLE_ARN }}
           aws-region: ${{ env.AWS_REGION }}
 
+      - name: Get database infra outputs
+        if: inputs.environment != 'prod'
+        id: get-database
+        uses: ./.github/actions/terragrunt
+        env:
+          TF_VAR_database_security_group_id: "destroy-placeholder"
+        with:
+          tg_directory: infra/live/${{ inputs.environment }}/aws/database
+          tg_action: init
+
+      - name: Delete dev manual database snapshots
+        if: inputs.environment != 'prod'
+        uses: ./.github/actions/just
+        env:
+          TG_OUTPUTS: ${{ steps.get-database.outputs.tg_outputs }}
+          CLUSTER_IDENTIFIER: ${{ fromJson(steps.get-database.outputs.tg_outputs).cluster_identifier.value }}
+          MANUAL_SNAPSHOT_PREFIX: ${{ fromJson(steps.get-database.outputs.tg_outputs).manual_snapshot_identifier_prefix.value }}
+        with:
+          justfile_path: justfile.deploy
+          just_action: database-delete-manual-snapshots
+
+      - name: Keep prod manual database snapshots
+        if: inputs.environment == 'prod'
+        run: echo "Retaining prod manual database snapshots."
+
       - name: Destroy database infra
         uses: ./.github/actions/terragrunt
         env:

diff --git a/AGENTS.md b/AGENTS.md
@@ -18,7 +18,11 @@ Update documentation in the same change:
 - keep `.github/docs/README.md` as the source of truth for workflow contracts and CI feasibility checks
 - prefer Mermaid diagrams in `.github/docs/README.md` that show jobs, `needs`, and reusable-workflow relationships rather than trying to reproduce the exact GitHub Actions UI
 - when adding a new AWS infra type or service family, check whether the deploy role in `infra/live/global_vars.hcl` needs additional `allowed_role_actions` and update it in the same change if required
+- when changing Terraform in a way that introduces any new AWS service surface area or API family, even inside an existing module, review `infra/live/global_vars.hcl` for required `allowed_role_actions` updates in the same change; do not limit this check only to obviously new top-level stack types
+- before closing any infra change that adds AWS resources, IAM principals, or orchestration services, explicitly verify whether it introduced new permissions for deploy-time creation or mutation and update `infra/live/global_vars.hcl` if needed
 - when changing the set of deployable Lambda or ECS runtimes, check whether the shared `observability` dashboard still reflects the current runtime surface and update it in the same change if needed
+- when changing `infra/modules/aws/_shared/database/**` recovery behavior, restore-drill behavior, backup retention, reader defaults, or other resilience knobs, include a rough cost comparison in the final response that contrasts `dev`, `standard`, and `critical`; keep it qualitative unless current pricing was explicitly requested, and call out that Aurora scratch restore compute/storage dominates drill cost more than Step Functions or Scheduler
+- when changing a live database stack's `recovery_class`, include a short conspicuous ANSI-colored warning block in the final response in the form "you have changed from X to Y" followed by a brief note about likely cost direction such as higher backup storage, more required readers, or more frequent restore drills; keep it short and awareness-focused rather than explanatory
 
 ### Documentation Architecture
 

diff --git a/README.md b/README.md
@@ -170,6 +170,7 @@ see [infra/README.md](infra/README.md#infra-deployment-use-cases).
 For Lambda provisioned concurrency patterns and example `provisioned_config` shapes, see [infra/modules/aws/_shared/lambda/README.md](infra/modules/aws/_shared/lambda/README.md).
 
 For ECS scaling patterns and `scaling_strategy` examples, see [infra/modules/aws/_shared/service/README.md](infra/modules/aws/_shared/service/README.md).
+For Aurora recovery posture presets such as `dev`, `standard`, and `critical`, plus the optional restore-drill Step Functions skeleton, see [infra/modules/aws/_shared/database/README.md](infra/modules/aws/_shared/database/README.md).
 
 ### Deployment Model
 

diff --git a/infra/live/dev/aws/database/terragrunt.hcl b/infra/live/dev/aws/database/terragrunt.hcl
@@ -4,7 +4,7 @@ include "root" {
 
 inputs = {
   database_name                = "app"
-  backup_retention_period      = 1
+  recovery_class               = "dev"
   rds_min_capacity             = 0.5
   rds_max_capacity             = 1.0
   rds_max_reader_count         = 0

diff --git a/infra/live/global_vars.hcl b/infra/live/global_vars.hcl
@@ -11,8 +11,10 @@ locals {
     "application-autoscaling:*",
     "cloudwatch:*",
     "events:*",
+    "scheduler:*",
     "sqs:*",
     "sns:*",
+    "states:*",
     "cloudfront:*",
     "xray:*",
     "ec2:*",

diff --git a/infra/live/prod/aws/database/terragrunt.hcl b/infra/live/prod/aws/database/terragrunt.hcl
@@ -4,7 +4,7 @@ include "root" {
 
 inputs = {
   database_name                         = "app"
-  backup_retention_period               = 7
+  recovery_class                        = "standard" # "critical" for production workloads, "standard" for non-production workloads
   rds_min_capacity                      = 0.5
   rds_max_capacity                      = 2.0
   rds_max_reader_count                  = 1

diff --git a/infra/modules/aws/_shared/database/README.md b/infra/modules/aws/_shared/database/README.md
@@ -29,7 +29,9 @@ Shared Aurora PostgreSQL Serverless v2 module.
 - `publicly_accessible`
 - `database_port`
 - `engine_version`
-- `backup_retention_period`
+- `recovery_class`
+- `restore_drill`
+- `manual_snapshot`
 - `rds_min_capacity`
 - `rds_max_capacity`
 - `rds_max_reader_count`
@@ -45,10 +47,123 @@ Shared Aurora PostgreSQL Serverless v2 module.
 - `database_port`
 - `readonly_endpoint`
 - `readwrite_endpoint`
+- `recovery_class`
+- `restore_drill_cadence`
+- `target_rpo_minutes`
+- `target_rto_minutes`
+- `restore_drill_enabled`
+- `restore_drill_mode`
+- `restore_drill_schedule_expression`
+- `restore_drill_state_machine_arn`
+- `restore_drill_state_machine_name`
+- `manual_snapshot_enabled`
+- `manual_snapshot_state_machine_arn`
+- `manual_snapshot_state_machine_name`
+- `manual_snapshot_identifier_prefix`
 
 This module is intentionally Aurora PostgreSQL Serverless v2 specific. It does not currently support provisioned RDS instances or non-Postgres engines.
 In this repo the concrete `database` wrapper resolves the VPC and public or private subnet ids, while the shared infra workflow injects `database_security_group_id` from the `security` stack via `TF_VAR_database_security_group_id`.
 By default the module tracks the latest matching Aurora PostgreSQL 16.x engine version rather than pinning a specific patch release.
 SSM parameter paths are rooted at `/<environment>/<project>/<database>/...` so they do not collide with AWS-reserved `/aws` prefixes.
 The runtime contract for database credentials is the Aurora-managed master secret exposed from the cluster. Terraform reads the managed secret ARN directly from the cluster resource rather than doing a separate Secrets Manager lookup during the same apply, because AWS may not populate that managed-secret reference early enough for an immediate data read.
 If you need new scale-out readers to inherit cluster tags, keep that automation in a separate stack such as `rds_reader_tagger` rather than pushing event-driven behavior into this shared database module.
+
+## Recovery Classes
+
+The shared module derives backup retention, deletion protection, final-snapshot behavior, minimum reader count, and recovery metadata from a single `recovery_class` input.
+
+### `dev`
+
+- 1 day of automated backup retention
+- deletion protection disabled
+- no final snapshot on destroy
+- no required reader instances
+- `restore_drill_cadence = "never"`
+
+### `standard`
+
+- 7 days of automated backup retention
+- deletion protection enabled
+- final snapshot required on destroy
+- at least 1 reader instance when multiple subnet AZs are available
+- `restore_drill_cadence = "monthly"`
+
+### `critical`
+
+- 35 days of automated backup retention
+- deletion protection enabled
+- final snapshot required on destroy
+- at least 2 reader instances when enough subnet AZs are available
+- `restore_drill_cadence = "weekly"`
+
+The module publishes `RecoveryClass`, `RestoreDrillCadence`, `TargetRPOMinutes`, and `TargetRTOMinutes` as cluster tags so operators can see the intended recovery posture directly on the Aurora cluster.
+
+## Restore Drill
+
+The shared module can also provision an opt-in restore-drill skeleton inside the same database module. When enabled, it creates:
+
+- a Step Functions state machine for manual restore-drill execution
+- an optional EventBridge Scheduler schedule when the mode includes scheduled runs
+- the IAM roles needed for the scheduler to start the state machine and for Step Functions to call RDS APIs
+
+Example:
+
+```hcl
+recovery_class = "standard"
+
+restore_drill = {
+  enabled      = true
+  mode         = "manual_and_scheduled"
+  use_pitr     = true
+  retain_hours = 4
+}
+```
+
+The schedule expression is derived from `recovery_class`:
+
+- `dev`: no automatic schedule
+- `standard`: `rate(30 days)`
+- `critical`: `rate(7 days)`
+
+Rough cost guidance by recovery class:
+
+- `dev`: lowest ongoing cost; 1-day automated backups, no final snapshot on destroy, no required reader instances, no scheduled drill by default
+- `standard`: moderate cost increase; 7-day backups, final snapshot on destroy, at least 1 reader when multiple subnet AZs are available, monthly scheduled drill if enabled
+- `critical`: highest ongoing cost; 35-day backups, final snapshot on destroy, at least 2 readers when enough subnet AZs are available, weekly scheduled drill if enabled
+
+The largest drill-related cost is the temporary restored Aurora cluster and scratch writer instance. Step Functions and EventBridge Scheduler usually contribute negligible cost compared with Aurora compute and storage.
+
+The current Step Functions skeleton:
+
+1. restores a temporary Aurora cluster from PITR
+2. waits for the scratch cluster to become available
+3. creates one temporary writer instance
+4. waits for the instance to become available
+5. holds the restored environment for the configured retention window
+6. deletes the temporary instance and cluster
+
+This first version does not yet run application-level validation against the restored database. It proves restore orchestration and cleanup only. Add a dedicated validation Lambda or ECS task later once the restore path itself is stable.
+
+## Manual Snapshot
+
+The shared module can also provision an opt-in manual snapshot trigger. This is separate from the restore drill:
+
+- `manual_snapshot` creates a named Aurora cluster snapshot on demand
+- `restore_drill` restores a temporary cluster and validates the recovery path
+
+Example:
+
+```hcl
+manual_snapshot = {
+  enabled = true
+}
+```
+
+When enabled, the module creates a second Step Functions state machine that:
+
+1. builds a unique snapshot identifier
+2. creates a manual Aurora cluster snapshot
+3. waits until the snapshot reaches `available`
+
+Use the `manual_snapshot_state_machine_arn` or `manual_snapshot_state_machine_name` output to start it manually from the Step Functions console or CLI.
+The module also exposes `manual_snapshot_identifier_prefix` so destroy or cleanup paths can delete only the repo-owned manual snapshots without re-deriving the naming contract outside Terraform.
diff --git a/infra/modules/aws/_shared/database/data.tf b/infra/modules/aws/_shared/database/data.tf
@@ -8,3 +8,66 @@ data "aws_subnet" "selected" {
   for_each = toset(var.subnet_ids)
   id       = each.value
 }
+
+data "aws_iam_policy_document" "restore_drill_sfn_assume" {
+  statement {
+    effect = "Allow"
+
+    principals {
+      type        = "Service"
+      identifiers = ["states.amazonaws.com"]
+    }
+
+    actions = ["sts:AssumeRole"]
+  }
+}
+
+data "aws_iam_policy_document" "restore_drill_sfn" {
+  statement {
+    sid    = "RestoreAndDescribeRds"
+    effect = "Allow"
+    actions = [
+      "rds:CreateDBInstance",
+      "rds:DeleteDBCluster",
+      "rds:DeleteDBInstance",
+      "rds:DescribeDBClusters",
+      "rds:DescribeDBInstances",
+      "rds:RestoreDBClusterToPointInTime",
+    ]
+    resources = ["*"]
+  }
+
+  statement {
+    sid    = "CreateAndDescribeManualSnapshots"
+    effect = "Allow"
+    actions = [
+      "rds:CreateDBClusterSnapshot",
+      "rds:DescribeDBClusterSnapshots",
+    ]
+    resources = ["*"]
+  }
+}
+
+data "aws_iam_policy_document" "restore_drill_scheduler_assume" {
+  statement {
+    effect = "Allow"
+
+    principals {
+      type        = "Service"
+      identifiers = ["scheduler.amazonaws.com"]
+    }
+
+    actions = ["sts:AssumeRole"]
+  }
+}
+
+data "aws_iam_policy_document" "restore_drill_scheduler" {
+  statement {
+    sid    = "StartRestoreDrillExecution"
+    effect = "Allow"
+    actions = [
+      "states:StartExecution",
+    ]
+    resources = aws_sfn_state_machine.restore_drill[*].arn
+  }
+}