Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ flowchart LR
### Cleanup And Discovery

- `destroy.yml`
Tears down app layers before shared dependencies, including the shared observability dashboard and any environment-owned shared artifact stacks such as the `dev` code bucket.
Tears down app layers before shared dependencies, including the shared observability dashboard and any environment-owned shared artifact stacks such as the `dev` code bucket. In the `database` job, `dev` now runs `tg_action: init` first to read Terraform outputs from the database stack, then passes `cluster_identifier` and `manual_snapshot_identifier_prefix` into `justfile.deploy` so the cleanup recipe deletes only repo-owned manual Aurora cluster snapshots before Terragrunt destroy. `prod` intentionally retains those manual snapshots.
- `shared_directories_get.yml`
Derives the directory-based matrices used by wrapper workflows and PR action-test discovery.

Expand Down Expand Up @@ -206,6 +206,7 @@ Run these checks on every CI, workflow, or deploy-contract change.
- confirm destroy ordering still removes downstream consumers before shared stacks
- check required Terraform variables on destroy as well as apply
- prefer depending on real downstream consumers rather than serializing unrelated shared stacks
- when a runtime or module creates manual backup artifacts outside Terraform resource ownership, decide explicitly whether destroy should delete or retain them by environment and keep that behavior documented in `destroy.yml` contracts

## Wrapper Workflow Summary

Expand Down
25 changes: 25 additions & 0 deletions .github/workflows/destroy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,31 @@ jobs:
role-to-assume: ${{ env.AWS_OIDC_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}

- name: Get database infra outputs
if: inputs.environment != 'prod'
id: get-database
uses: ./.github/actions/terragrunt
env:
TF_VAR_database_security_group_id: "destroy-placeholder"
with:
tg_directory: infra/live/${{ inputs.environment }}/aws/database
tg_action: init

- name: Delete dev manual database snapshots
if: inputs.environment != 'prod'
uses: ./.github/actions/just
env:
TG_OUTPUTS: ${{ steps.get-database.outputs.tg_outputs }}
CLUSTER_IDENTIFIER: ${{ fromJson(steps.get-database.outputs.tg_outputs).cluster_identifier.value }}
MANUAL_SNAPSHOT_PREFIX: ${{ fromJson(steps.get-database.outputs.tg_outputs).manual_snapshot_identifier_prefix.value }}
with:
justfile_path: justfile.deploy
just_action: database-delete-manual-snapshots

- name: Keep prod manual database snapshots
if: inputs.environment == 'prod'
run: echo "Retaining prod manual database snapshots."

- name: Destroy database infra
uses: ./.github/actions/terragrunt
env:
Expand Down
4 changes: 4 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,11 @@ Update documentation in the same change:
- keep `.github/docs/README.md` as the source of truth for workflow contracts and CI feasibility checks
- prefer Mermaid diagrams in `.github/docs/README.md` that show jobs, `needs`, and reusable-workflow relationships rather than trying to reproduce the exact GitHub Actions UI
- when adding a new AWS infra type or service family, check whether the deploy role in `infra/live/global_vars.hcl` needs additional `allowed_role_actions` and update it in the same change if required
- when changing Terraform in a way that introduces any new AWS service surface area or API family, even inside an existing module, review `infra/live/global_vars.hcl` for required `allowed_role_actions` updates in the same change; do not limit this check only to obviously new top-level stack types
- before closing any infra change that adds AWS resources, IAM principals, or orchestration services, explicitly verify whether it introduced new permissions for deploy-time creation or mutation and update `infra/live/global_vars.hcl` if needed
- when changing the set of deployable Lambda or ECS runtimes, check whether the shared `observability` dashboard still reflects the current runtime surface and update it in the same change if needed
- when changing `infra/modules/aws/_shared/database/**` recovery behavior, restore-drill behavior, backup retention, reader defaults, or other resilience knobs, include a rough cost comparison in the final response that contrasts `dev`, `standard`, and `critical`; keep it qualitative unless current pricing was explicitly requested, and call out that Aurora scratch restore compute/storage dominates drill cost more than Step Functions or Scheduler
- when changing a live database stack's `recovery_class`, include a short conspicuous ANSI-colored warning block in the final response in the form "you have changed from X to Y" followed by a brief note about likely cost direction such as higher backup storage, more required readers, or more frequent restore drills; keep it short and awareness-focused rather than explanatory

### Documentation Architecture

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ see [infra/README.md](infra/README.md#infra-deployment-use-cases).
For Lambda provisioned concurrency patterns and example `provisioned_config` shapes, see [infra/modules/aws/_shared/lambda/README.md](infra/modules/aws/_shared/lambda/README.md).

For ECS scaling patterns and `scaling_strategy` examples, see [infra/modules/aws/_shared/service/README.md](infra/modules/aws/_shared/service/README.md).
For Aurora recovery posture presets such as `dev`, `standard`, and `critical`, plus the optional restore-drill Step Functions skeleton, see [infra/modules/aws/_shared/database/README.md](infra/modules/aws/_shared/database/README.md).

### Deployment Model

Expand Down
2 changes: 1 addition & 1 deletion infra/live/dev/aws/database/terragrunt.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ include "root" {

inputs = {
database_name = "app"
backup_retention_period = 1
recovery_class = "dev"
rds_min_capacity = 0.5
rds_max_capacity = 1.0
rds_max_reader_count = 0
Expand Down
2 changes: 2 additions & 0 deletions infra/live/global_vars.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,10 @@ locals {
"application-autoscaling:*",
"cloudwatch:*",
"events:*",
"scheduler:*",
"sqs:*",
"sns:*",
"states:*",
"cloudfront:*",
"xray:*",
"ec2:*",
Expand Down
2 changes: 1 addition & 1 deletion infra/live/prod/aws/database/terragrunt.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ include "root" {

inputs = {
database_name = "app"
backup_retention_period = 7
recovery_class = "standard" # "critical" for production workloads, "standard" for non-production workloads
rds_min_capacity = 0.5
rds_max_capacity = 2.0
rds_max_reader_count = 1
Expand Down
117 changes: 116 additions & 1 deletion infra/modules/aws/_shared/database/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,9 @@ Shared Aurora PostgreSQL Serverless v2 module.
- `publicly_accessible`
- `database_port`
- `engine_version`
- `backup_retention_period`
- `recovery_class`
- `restore_drill`
- `manual_snapshot`
- `rds_min_capacity`
- `rds_max_capacity`
- `rds_max_reader_count`
Expand All @@ -45,10 +47,123 @@ Shared Aurora PostgreSQL Serverless v2 module.
- `database_port`
- `readonly_endpoint`
- `readwrite_endpoint`
- `recovery_class`
- `restore_drill_cadence`
- `target_rpo_minutes`
- `target_rto_minutes`
- `restore_drill_enabled`
- `restore_drill_mode`
- `restore_drill_schedule_expression`
- `restore_drill_state_machine_arn`
- `restore_drill_state_machine_name`
- `manual_snapshot_enabled`
- `manual_snapshot_state_machine_arn`
- `manual_snapshot_state_machine_name`
- `manual_snapshot_identifier_prefix`

This module is intentionally Aurora PostgreSQL Serverless v2 specific. It does not currently support provisioned RDS instances or non-Postgres engines.
In this repo the concrete `database` wrapper resolves the VPC and public or private subnet ids, while the shared infra workflow injects `database_security_group_id` from the `security` stack via `TF_VAR_database_security_group_id`.
By default the module tracks the latest matching Aurora PostgreSQL 16.x engine version rather than pinning a specific patch release.
SSM parameter paths are rooted at `/<environment>/<project>/<database>/...` so they do not collide with AWS-reserved `/aws` prefixes.
The runtime contract for database credentials is the Aurora-managed master secret exposed from the cluster. Terraform reads the managed secret ARN directly from the cluster resource rather than doing a separate Secrets Manager lookup during the same apply, because AWS may not populate that managed-secret reference early enough for an immediate data read.
If you need new scale-out readers to inherit cluster tags, keep that automation in a separate stack such as `rds_reader_tagger` rather than pushing event-driven behavior into this shared database module.

## Recovery Classes

The shared module derives backup retention, deletion protection, final-snapshot behavior, minimum reader count, and recovery metadata from a single `recovery_class` input.

### `dev`

- 1 day of automated backup retention
- deletion protection disabled
- no final snapshot on destroy
- no required reader instances
- `restore_drill_cadence = "never"`

### `standard`

- 7 days of automated backup retention
- deletion protection enabled
- final snapshot required on destroy
- at least 1 reader instance when multiple subnet AZs are available
- `restore_drill_cadence = "monthly"`

### `critical`

- 35 days of automated backup retention
- deletion protection enabled
- final snapshot required on destroy
- at least 2 reader instances when enough subnet AZs are available
- `restore_drill_cadence = "weekly"`

The module publishes `RecoveryClass`, `RestoreDrillCadence`, `TargetRPOMinutes`, and `TargetRTOMinutes` as cluster tags so operators can see the intended recovery posture directly on the Aurora cluster.

## Restore Drill

The shared module can also provision an opt-in restore-drill skeleton inside the same database module. When enabled, it creates:

- a Step Functions state machine for manual restore-drill execution
- an optional EventBridge Scheduler schedule when the mode includes scheduled runs
- the IAM roles needed for the scheduler to start the state machine and for Step Functions to call RDS APIs

Example:

```hcl
recovery_class = "standard"

restore_drill = {
enabled = true
mode = "manual_and_scheduled"
use_pitr = true
retain_hours = 4
}
```

The schedule expression is derived from `recovery_class`:

- `dev`: no automatic schedule
- `standard`: `rate(30 days)`
- `critical`: `rate(7 days)`

Rough cost guidance by recovery class:

- `dev`: lowest ongoing cost; 1-day automated backups, no final snapshot on destroy, no required reader instances, no scheduled drill by default
- `standard`: moderate cost increase; 7-day backups, final snapshot on destroy, at least 1 reader when multiple subnet AZs are available, monthly scheduled drill if enabled
- `critical`: highest ongoing cost; 35-day backups, final snapshot on destroy, at least 2 readers when enough subnet AZs are available, weekly scheduled drill if enabled

The largest drill-related cost is the temporary restored Aurora cluster and scratch writer instance. Step Functions and EventBridge Scheduler usually contribute negligible cost compared with Aurora compute and storage.

The current Step Functions skeleton:

1. restores a temporary Aurora cluster from PITR
2. waits for the scratch cluster to become available
3. creates one temporary writer instance
4. waits for the instance to become available
5. holds the restored environment for the configured retention window
6. deletes the temporary instance and cluster

This first version does not yet run application-level validation against the restored database. It proves restore orchestration and cleanup only. Add a dedicated validation Lambda or ECS task later once the restore path itself is stable.

## Manual Snapshot

The shared module can also provision an opt-in manual snapshot trigger. This is separate from the restore drill:

- `manual_snapshot` creates a named Aurora cluster snapshot on demand
- `restore_drill` restores a temporary cluster and validates the recovery path

Example:

```hcl
manual_snapshot = {
enabled = true
}
```

When enabled, the module creates a second Step Functions state machine that:

1. builds a unique snapshot identifier
2. creates a manual Aurora cluster snapshot
3. waits until the snapshot reaches `available`

Use the `manual_snapshot_state_machine_arn` or `manual_snapshot_state_machine_name` output to start it manually from the Step Functions console or CLI.
The module also exposes `manual_snapshot_identifier_prefix` so destroy or cleanup paths can delete only the repo-owned manual snapshots without re-deriving the naming contract outside Terraform.
63 changes: 63 additions & 0 deletions infra/modules/aws/_shared/database/data.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,66 @@ data "aws_subnet" "selected" {
for_each = toset(var.subnet_ids)
id = each.value
}

data "aws_iam_policy_document" "restore_drill_sfn_assume" {
statement {
effect = "Allow"

principals {
type = "Service"
identifiers = ["states.amazonaws.com"]
}

actions = ["sts:AssumeRole"]
}
}

data "aws_iam_policy_document" "restore_drill_sfn" {
statement {
sid = "RestoreAndDescribeRds"
effect = "Allow"
actions = [
"rds:CreateDBInstance",
"rds:DeleteDBCluster",
"rds:DeleteDBInstance",
"rds:DescribeDBClusters",
"rds:DescribeDBInstances",
"rds:RestoreDBClusterToPointInTime",
]
resources = ["*"]
}

statement {
sid = "CreateAndDescribeManualSnapshots"
effect = "Allow"
actions = [
"rds:CreateDBClusterSnapshot",
"rds:DescribeDBClusterSnapshots",
]
resources = ["*"]
}
}

data "aws_iam_policy_document" "restore_drill_scheduler_assume" {
statement {
effect = "Allow"

principals {
type = "Service"
identifiers = ["scheduler.amazonaws.com"]
}

actions = ["sts:AssumeRole"]
}
}

data "aws_iam_policy_document" "restore_drill_scheduler" {
statement {
sid = "StartRestoreDrillExecution"
effect = "Allow"
actions = [
"states:StartExecution",
]
resources = aws_sfn_state_machine.restore_drill[*].arn
}
}
Loading
Loading