Design Spec: gcp-dms-migration-health
Parent: #85
Target: rw-cli-codecollection
Spec
# CodeBundle Design Spec — GCP Database Migration Service (DMS) health
# Parent intake: Inspect DMS Tasks; report on DMS latency; healthy = jobs running, not stuck, timely execution.
codebundle_name: "gcp-dms-migration-health"
target_collection: "rw-cli-codecollection"
display_name: "GCP Database Migration Service (DMS) Health"
author: "rw-codebundle-agent"
purpose: |
Monitors Google Cloud Database Migration Service (DMS) migration jobs for failed or stuck
states, surfaces recent operation failures, and reports replication latency (CDC lag) using
Cloud Monitoring metrics so operators can confirm migrations are progressing and cutover-ready.
tasks:
- name: "List DMS Migration Jobs and Flag Unhealthy States for `${GCP_PROJECT_ID}`"
description: |
Lists migration jobs in the configured region(s) via `gcloud database-migration migration-jobs list`
(and describe as needed). Raises issues for jobs in terminal failure states, jobs stuck in
transitional states longer than expected, or jobs not in RUNNING/CDC when continuous replication
is required. Adds a summary table to the report.
script_name: "list-migration-jobs.sh"
expected_issue_severity: [2, 4]
access_level: "read-only"
data_type: "metrics"
- name: "List Recent DMS Operations and Flag Failures for `${GCP_PROJECT_ID}`"
description: |
Lists recent DMS operations (`gcloud database-migration operations list`) filtered by project/location.
Surfaces failed or cancelled operations and long-running operations that may indicate stuck work.
Correlates operation metadata with migration job names where labels allow.
script_name: "list-dms-operations.sh"
expected_issue_severity: [2, 3]
access_level: "read-only"
data_type: "metrics"
- name: "Report DMS Replication Lag from Cloud Monitoring for `${GCP_PROJECT_ID}`"
description: |
For migration jobs in CDC / continuous replication, reads Cloud Monitoring metrics for
`migration_job/max_replica_sec_lag` and optionally `migration_job/max_replica_bytes_lag`
(resource type `datamigration.googleapis.com/MigrationJob`). Compares against configurable
thresholds and raises issues when lag indicates the destination is materially behind the source.
Documents metric delay (samples up to ~180s after observation per Google docs) in README/notes.
script_name: "fetch-dms-replication-lag-metrics.sh"
expected_issue_severity: [2, 4]
access_level: "read-only"
data_type: "metrics"
- name: "Summarize DMS Migration Job Details for Flagged Jobs in `${GCP_PROJECT_ID}`"
description: |
For jobs flagged by prior tasks (or all jobs when `DMS_JOB_NAMES` is set), runs
`gcloud database-migration migration-jobs describe` to capture phase, errors, and static
configuration snippets for the report and for higher-severity issues when error details exist.
script_name: "describe-migration-jobs.sh"
expected_issue_severity: [3, 4]
access_level: "read-only"
data_type: "metrics"
- name: "Optional Error Log Correlation for DMS in `${GCP_PROJECT_ID}`"
description: |
When issues are present, queries Cloud Logging for relevant DMS / datamigration log entries
in a bounded time window to aid triage. Read-only; gated or no-op when no unhealthy jobs.
script_name: "fetch-dms-error-logs.sh"
expected_issue_severity: [2, 3]
access_level: "read-only"
data_type: "logs-regexp"
scope:
level: "Project"
qualifiers:
- GCP_PROJECT_ID
- GCP_DMS_LOCATION
iteration_pattern: |
One SLX per project and DMS region (location) pair. User supplies `GCP_DMS_LOCATION` (e.g. us-central1)
where migration jobs run; optional future enhancement: iterate all `gcloud database-migration locations list`
locations when set to `All` (document cost/latency tradeoff).
resource_types:
- "gcp_dms_migration_jobs"
generation_strategy: |
Match discovered DMS-capable projects; qualify by project_id and dms_location. Resource match pattern
should align with RunWhen Local discovery for Database Migration Service migration jobs once
resource types are registered; until then, generation may use project + location qualifiers only.
env_vars:
- name: GCP_PROJECT_ID
description: "GCP project ID that owns the DMS migration jobs"
required: true
- name: GCP_DMS_LOCATION
description: "DMS regional location (e.g. us-central1) passed to gcloud --location"
required: true
- name: DMS_JOB_NAMES
description: "Comma-separated migration job IDs, or 'All' to evaluate every job in the location"
required: false
default: "All"
- name: REPLICATION_LAG_SEC_THRESHOLD
description: "Alert when migration_job/max_replica_sec_lag exceeds this many seconds (CDC phase)"
required: false
default: "300"
- name: REPLICATION_LAG_BYTES_THRESHOLD
description: "Optional alert when migration_job/max_replica_bytes_lag exceeds this many bytes"
required: false
default: "0"
secrets:
- name: gcp_credentials
description: "GCP service account JSON with read-only access to DMS and Monitoring"
format: |
Standard GCP service account key JSON; requires roles such as datamigration.viewer,
monitoring.viewer, and logging.viewer as appropriate for list/describe and metric reads.
platform:
name: "gcp"
cli_tools:
- "gcloud database-migration"
- "gcloud monitoring"
- "gcloud logging"
- "jq"
auth_methods:
- "Service account (gcp_credentials / GOOGLE_APPLICATION_CREDENTIALS)"
api_docs: |
https://cloud.google.com/sdk/gcloud/reference/database-migration/
https://cloud.google.com/database-migration/docs/postgres/migration-job-metrics
related_bundles:
- name: "gcp-cloud-function-health"
relationship: "complements"
notes: "Existing GCP project-scoped health pattern (gcloud + RW.CLI); reuse structure and SLI style."
- name: "gcp-project-cost-health"
relationship: "complements"
notes: "Cost reporting for GCP; no operational DMS migration or replication lag coverage."
test_scenarios:
- name: "healthy_cdc_migration"
description: "Migration job in RUNNING/CDC with lag under thresholds and no failed operations"
expected_issues: 0
- name: "failed_migration_job"
description: "At least one job in FAILED state or operation list shows FAILED"
expected_issues: 2
expected_severities: [3, 3]
- name: "high_replication_lag"
description: "CDC job with max_replica_sec_lag above REPLICATION_LAG_SEC_THRESHOLD"
expected_issues: 1
expected_severities: [3]
notes: |
- “DMS Tasks” in the intake is interpreted as DMS migration jobs and their asynchronous operations
(operations API), not Kubernetes Pods.
- Replication lag metrics are only meaningful during CDC; initial dump phases should be excluded
from lag alerting via job phase or metric availability checks.
- SQL Server–specific DMS metrics (e.g. transaction_log_upload_sec_lag) may be added in a follow-up
task if product needs heterogeneous engine coverage in the same bundle.
- SLI should aggregate a small set of signals (e.g. count of unhealthy jobs + jobs over lag threshold)
consistent with other GCP health bundles in this collection.
Design Spec: gcp-dms-migration-health
Parent: #85
Target:
rw-cli-codecollectionSpec