Skip to content

[design-spec] gcp-dms-migration-health #86

@rw-codebundle-agent

Description

@rw-codebundle-agent

Design Spec: gcp-dms-migration-health

Parent: #85
Target: rw-cli-codecollection

Spec

# CodeBundle Design Spec — GCP Database Migration Service (DMS) health
# Parent intake: Inspect DMS Tasks; report on DMS latency; healthy = jobs running, not stuck, timely execution.

codebundle_name: "gcp-dms-migration-health"
target_collection: "rw-cli-codecollection"
display_name: "GCP Database Migration Service (DMS) Health"
author: "rw-codebundle-agent"

purpose: |
  Monitors Google Cloud Database Migration Service (DMS) migration jobs for failed or stuck
  states, surfaces recent operation failures, and reports replication latency (CDC lag) using
  Cloud Monitoring metrics so operators can confirm migrations are progressing and cutover-ready.

tasks:
  - name: "List DMS Migration Jobs and Flag Unhealthy States for `${GCP_PROJECT_ID}`"
    description: |
      Lists migration jobs in the configured region(s) via `gcloud database-migration migration-jobs list`
      (and describe as needed). Raises issues for jobs in terminal failure states, jobs stuck in
      transitional states longer than expected, or jobs not in RUNNING/CDC when continuous replication
      is required. Adds a summary table to the report.
    script_name: "list-migration-jobs.sh"
    expected_issue_severity: [2, 4]
    access_level: "read-only"
    data_type: "metrics"

  - name: "List Recent DMS Operations and Flag Failures for `${GCP_PROJECT_ID}`"
    description: |
      Lists recent DMS operations (`gcloud database-migration operations list`) filtered by project/location.
      Surfaces failed or cancelled operations and long-running operations that may indicate stuck work.
      Correlates operation metadata with migration job names where labels allow.
    script_name: "list-dms-operations.sh"
    expected_issue_severity: [2, 3]
    access_level: "read-only"
    data_type: "metrics"

  - name: "Report DMS Replication Lag from Cloud Monitoring for `${GCP_PROJECT_ID}`"
    description: |
      For migration jobs in CDC / continuous replication, reads Cloud Monitoring metrics for
      `migration_job/max_replica_sec_lag` and optionally `migration_job/max_replica_bytes_lag`
      (resource type `datamigration.googleapis.com/MigrationJob`). Compares against configurable
      thresholds and raises issues when lag indicates the destination is materially behind the source.
      Documents metric delay (samples up to ~180s after observation per Google docs) in README/notes.
    script_name: "fetch-dms-replication-lag-metrics.sh"
    expected_issue_severity: [2, 4]
    access_level: "read-only"
    data_type: "metrics"

  - name: "Summarize DMS Migration Job Details for Flagged Jobs in `${GCP_PROJECT_ID}`"
    description: |
      For jobs flagged by prior tasks (or all jobs when `DMS_JOB_NAMES` is set), runs
      `gcloud database-migration migration-jobs describe` to capture phase, errors, and static
      configuration snippets for the report and for higher-severity issues when error details exist.
    script_name: "describe-migration-jobs.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "metrics"

  - name: "Optional Error Log Correlation for DMS in `${GCP_PROJECT_ID}`"
    description: |
      When issues are present, queries Cloud Logging for relevant DMS / datamigration log entries
      in a bounded time window to aid triage. Read-only; gated or no-op when no unhealthy jobs.
    script_name: "fetch-dms-error-logs.sh"
    expected_issue_severity: [2, 3]
    access_level: "read-only"
    data_type: "logs-regexp"

scope:
  level: "Project"
  qualifiers:
    - GCP_PROJECT_ID
    - GCP_DMS_LOCATION
  iteration_pattern: |
    One SLX per project and DMS region (location) pair. User supplies `GCP_DMS_LOCATION` (e.g. us-central1)
    where migration jobs run; optional future enhancement: iterate all `gcloud database-migration locations list`
    locations when set to `All` (document cost/latency tradeoff).

resource_types:
  - "gcp_dms_migration_jobs"
generation_strategy: |
  Match discovered DMS-capable projects; qualify by project_id and dms_location. Resource match pattern
  should align with RunWhen Local discovery for Database Migration Service migration jobs once
  resource types are registered; until then, generation may use project + location qualifiers only.

env_vars:
  - name: GCP_PROJECT_ID
    description: "GCP project ID that owns the DMS migration jobs"
    required: true

  - name: GCP_DMS_LOCATION
    description: "DMS regional location (e.g. us-central1) passed to gcloud --location"
    required: true

  - name: DMS_JOB_NAMES
    description: "Comma-separated migration job IDs, or 'All' to evaluate every job in the location"
    required: false
    default: "All"

  - name: REPLICATION_LAG_SEC_THRESHOLD
    description: "Alert when migration_job/max_replica_sec_lag exceeds this many seconds (CDC phase)"
    required: false
    default: "300"

  - name: REPLICATION_LAG_BYTES_THRESHOLD
    description: "Optional alert when migration_job/max_replica_bytes_lag exceeds this many bytes"
    required: false
    default: "0"

secrets:
  - name: gcp_credentials
    description: "GCP service account JSON with read-only access to DMS and Monitoring"
    format: |
      Standard GCP service account key JSON; requires roles such as datamigration.viewer,
      monitoring.viewer, and logging.viewer as appropriate for list/describe and metric reads.

platform:
  name: "gcp"
  cli_tools:
    - "gcloud database-migration"
    - "gcloud monitoring"
    - "gcloud logging"
    - "jq"
  auth_methods:
    - "Service account (gcp_credentials / GOOGLE_APPLICATION_CREDENTIALS)"
  api_docs: |
    https://cloud.google.com/sdk/gcloud/reference/database-migration/
    https://cloud.google.com/database-migration/docs/postgres/migration-job-metrics

related_bundles:
  - name: "gcp-cloud-function-health"
    relationship: "complements"
    notes: "Existing GCP project-scoped health pattern (gcloud + RW.CLI); reuse structure and SLI style."
  - name: "gcp-project-cost-health"
    relationship: "complements"
    notes: "Cost reporting for GCP; no operational DMS migration or replication lag coverage."

test_scenarios:
  - name: "healthy_cdc_migration"
    description: "Migration job in RUNNING/CDC with lag under thresholds and no failed operations"
    expected_issues: 0

  - name: "failed_migration_job"
    description: "At least one job in FAILED state or operation list shows FAILED"
    expected_issues: 2
    expected_severities: [3, 3]

  - name: "high_replication_lag"
    description: "CDC job with max_replica_sec_lag above REPLICATION_LAG_SEC_THRESHOLD"
    expected_issues: 1
    expected_severities: [3]

notes: |
  - “DMS Tasks” in the intake is interpreted as DMS migration jobs and their asynchronous operations
    (operations API), not Kubernetes Pods.
  - Replication lag metrics are only meaningful during CDC; initial dump phases should be excluded
    from lag alerting via job phase or metric availability checks.
  - SQL Server–specific DMS metrics (e.g. transaction_log_upload_sec_lag) may be added in a follow-up
    task if product needs heterogeneous engine coverage in the same bundle.
  - SLI should aggregate a small set of signals (e.g. count of unhealthy jobs + jobs over lag threshold)
    consistent with other GCP health bundles in this collection.

Metadata

Metadata

Assignees

No one assigned

    Labels

    completedAgent work completeddesign-specArchitect has produced a design specnew-codebundleScoped issue for SRE to implement a new CodeBundle

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions