Skip to content

[design-spec] mongodb-atlas-cluster-health #110

@rw-codebundle-agent

Description

@rw-codebundle-agent

Design Spec: mongodb-atlas-cluster-health

Parent: #107
Target: rw-cli-codecollection

Spec

codebundle_name: "mongodb-atlas-cluster-health"
target_collection: "rw-cli-codecollection"
display_name: "MongoDB Atlas Cluster Health"
author: "rw-codebundle-agent"

purpose: |
  Read-only health checks for MongoDB Atlas clusters within a project: inventory,
  deployment/replica availability signals, and key workload metrics (connections,
  replication lag, resource pressure) via the Atlas Admin API. Helps operators
  catch instability before customer-facing incidents.

tasks:
  - name: "Gather MongoDB Atlas Cluster Inventory for Project `${ATLAS_PROJECT_ID}`"
    description: "Lists clusters and captures edition, MongoDB version, region(s), instance sizes, and paused/maintenance flags for baseline triage context."
    script_name: "gather-atlas-cluster-inventory.sh"
    expected_issue_severity: [1, 2]
    access_level: "read-only"
    data_type: "config"

  - name: "Check MongoDB Atlas Cluster State for Project `${ATLAS_PROJECT_ID}`"
    description: "Evaluates cluster/replica set operational state from Atlas APIs (e.g. transitional states, unhealthy members, upgrade operations) and raises issues when availability is degraded."
    script_name: "check-atlas-cluster-state.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "config"

  - name: "Analyze MongoDB Atlas Cluster Metrics for Project `${ATLAS_PROJECT_ID}`"
    description: "Pulls recent Atlas process/hardware measurements (connections, CPU/disk pressure, replication lag where exposed) against configurable thresholds."
    script_name: "analyze-atlas-cluster-metrics.sh"
    expected_issue_severity: [2, 4]
    access_level: "read-only"
    data_type: "metrics"

scope:
  level: "Project"
  qualifiers:
    - ATLAS_ORG_ID
    - ATLAS_PROJECT_ID
  iteration_pattern: |
    One SLX per Atlas project (ATLAS_PROJECT_ID). Optional CLUSTER_FILTER env var
    narrows checks to named cluster(s); default evaluates all clusters in the project.

resource_types:
  - "mongodb_atlas_cluster"
generation_strategy: |
  Match discovered mongodb_atlas_cluster resources in RunWhen; qualifier chain includes
  organization/project/cluster identifiers. Alternatively scope SLX to project-level
  discovery when cluster-level assets are not indexed—tasks iterate clusters via Atlas API.

env_vars:
  - name: ATLAS_PROJECT_ID
    description: "MongoDB Atlas project ID (24-char hex)"
    required: true

  - name: ATLAS_ORG_ID
    description: "MongoDB Atlas organization ID for audit context and multi-project setups"
    required: false

  - name: CLUSTER_FILTER
    description: "Comma-separated cluster names to limit scope; empty means all clusters in project"
    required: false
    default: ""

  - name: CONNECTION_THRESHOLD
    description: "Issue when normalized connection utilization exceeds this percentage"
    required: false
    default: "85"

  - name: DISK_UTIL_THRESHOLD
    description: "Issue when disk utilization exceeds this percentage"
    required: false
    default: "85"

  - name: REPLICATION_LAG_MS_THRESHOLD
    description: "Issue when replica lag exceeds this many milliseconds (where measurable)"
    required: false
    default: "5000"

secrets:
  - name: atlas_api_key_credentials
    description: "MongoDB Atlas programmatic API key pair"
    format: |
      JSON or env mapping with ATLAS_PUBLIC_API_KEY and ATLAS_PRIVATE_API_KEY (Atlas Admin API)

platform:
  name: "mongodb_atlas"
  cli_tools:
    - "curl"
    - "jq"
    - "mongocli"
  auth_methods:
    - "Atlas Admin API key digest auth (public + private key)"
  api_docs: "https://www.mongodb.com/docs/atlas/reference/api-resources-spec/v2/"

related_bundles:
  - name: "mongodb-health-gcp-promql"
    relationship: "complements"
    notes: "GCP Prometheus/GMP bundle observes MongoDB workloads on GKE; this bundle targets Atlas-hosted clusters via Atlas APIs."
  - name: "mongodb-atlas-operations-health"
    relationship: "complements"
    notes: "Companion bundle covers alerts, backups, and network access while this bundle focuses on runtime cluster health."

test_scenarios:
  - name: "healthy_project"
    description: "All clusters ACTIVE, metrics within thresholds"
    expected_issues: 0

  - name: "degraded_replica"
    description: "Atlas reports elevated replication lag or transitional cluster state"
    expected_issues: 2
    expected_severities: [3, 4]

notes: |
  Prefer Atlas Admin API v2 over scraping UI. Respect Atlas rate limits; batch metric
  windows conservatively (e.g. last 15–60 minutes). Implementation should emit structured
  JSON issues compatible with RW.Core.Add Issue patterns used in sibling azure-* bundles.
  Before merge, run `python -m scorer.score <bundle_path>` from codebundle-farm (threshold 70).

Metadata

Metadata

Assignees

No one assigned

    Labels

    completedAgent work completeddesign-specArchitect has produced a design specnew-codebundleScoped issue for SRE to implement a new CodeBundle

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions