Skip to content

[design-spec] azure-cosmosdb-health #103

@rw-codebundle-agent

Description

@rw-codebundle-agent

Design Spec: azure-cosmosdb-health

Parent: #102
Target: rw-cli-codecollection

Spec

# --- Identity ---
codebundle_name: "azure-cosmosdb-health"
target_collection: "rw-cli-codecollection"
display_name: "Azure Cosmos DB Health"
author: "rw-codebundle-agent"

# --- Purpose ---
purpose: |
  Monitors Azure Cosmos DB accounts for platform-reported health, account configuration
  (security, resiliency, throughput model), and utilization via Azure Monitor metrics
  including request units, throttling, latency, and storage—supporting right-sizing and
  configuration review. Cosmos DB does not expose classic VM-style CPU/memory; operator-facing
  signals are normalized RU consumption, 429s/throttling, server-side latency, and data/index
  storage against account limits, with lookback windows for trend context.

# --- Tasks ---
# Each task becomes one Robot Framework task + one bash script (3-12 total).
tasks:
  - name: "Check Cosmos DB Resource Health for Account in Resource Group"
    description: "Queries Azure Resource Health for Microsoft.DocumentDB/databaseAccounts in the target resource group; flags provider-reported availability incidents."
    script_name: "cosmosdb-resource-health.sh"
    expected_issue_severity: [2, 3]
    access_level: "read-only"
    data_type: "logs-config"

  - name: "Check Cosmos DB Account Network and Security Configuration"
    description: "Reviews public network access, IP/firewall rules, VNet integration when present, and whether private endpoint usage is expected per policy; does not change settings."
    script_name: "cosmosdb-network-config.sh"
    expected_issue_severity: [2, 3]
    access_level: "read-only"
    data_type: "logs-config"

  - name: "Check Cosmos DB Multi-Region and Failover Readiness"
    description: "Validates number of read regions, write region, and failover/availability zone posture against organizational HA expectations; surfaces single-region and misconfigured failover."
    script_name: "cosmosdb-regions-failover.sh"
    expected_issue_severity: [2, 3]
    access_level: "read-only"
    data_type: "config"

  - name: "Check Cosmos DB Backup and Point-in-Time Recovery Settings"
    description: "Reads backup/continuous backup policy and retention signals exposed by the management plane (API-version dependent); flags disabled or misaligned backup for tier."
    script_name: "cosmosdb-backup-policy.sh"
    expected_issue_severity: [2, 4]
    access_level: "read-only"
    data_type: "config"

  - name: "Analyze Historical RU Consumption and Throttling"
    description: "Pulls Azure Monitor time-series (e.g. TotalRequestUnits, Throttled requests, 429 where available) over configurable lookback; compares to provisioned or autoscale max; flags sustained throttling and headroom issues."
    script_name: "cosmosdb-ru-throttling-metrics.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "metrics"

  - name: "Analyze Data and Index Storage Utilization"
    description: "Aggregates data and index storage where metrics exist; surfaces growth trends against account or partition limits; supports sizing conversations."
    script_name: "cosmosdb-storage-utilization.sh"
    expected_issue_severity: [2, 3]
    access_level: "read-only"
    data_type: "metrics"

  - name: "Analyze Server-Side Latency and Error Rates"
    description: "Uses ServerSideLatency, DocumentCount/Query metrics or equivalent to spot latency regression and high error share across APIs; uses historical windows for false-positive reduction."
    script_name: "cosmosdb-latency-error-metrics.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "metrics"

  - name: "Review Throughput Provisioning and Autoscale Fit"
    description: "Distinguishes serverless vs provisioned vs autoscale; checks min/max RU and utilization ratio; provides coarse rightsizing signals (with operator confirmation; no automatic scaling)."
    script_name: "cosmosdb-throughput-provisioning.sh"
    expected_issue_severity: [2, 3]
    access_level: "read-only"
    data_type: "config"

# --- Scope ---
scope:
  level: "ResourceGroup"
  qualifiers:
    - AZURE_RESOURCE_SUBSCRIPTION_ID
    - AZURE_RESOURCE_GROUP
  iteration_pattern: |
    Default: one runbook per discovered Cosmos DB account in the resource group, or
    a single account when ACCOUNTS (or resource name) is set. List accounts with
    `az cosmosdb list -g` and run checks per account; support RESOURCES=All or
    a comma-separated account list. Loop variable: `COSMOSDB_ACCOUNT_NAME`.

# --- Resource Discovery ---
resource_types:
  - "azure_cosmos_cosmosdb"
generation_strategy: |
  One SLX per Cosmos DB account in the resource group (qualifiers: resource_group,
  subscription, account name). Resource match on Microsoft.DocumentDB/databaseAccounts
  in RunWhen Local; confirm the exact `resourceTypes` string against the platform
  registry for Azure Cosmos DB before shipping.

# --- Configuration ---
env_vars:
  - name: AZURE_RESOURCE_SUBSCRIPTION_ID
    description: "Azure subscription ID for API and metrics"
    required: true

  - name: AZURE_RESOURCE_GROUP
    description: "Resource group containing Cosmos DB account(s)"
    required: true

  - name: COSMOSDB_ACCOUNT_NAME
    description: "Single account name, comma-separated list, or All for discovery in the RG"
    required: false
    default: "All"

  - name: METRIC_LOOKBACK
    description: "Lookback for historical metrics (e.g. 7d, 30d) passed to `az monitor metrics` or equivalent"
    required: false
    default: "7d"

  - name: RU_UTILIZATION_WARN_PERCENT
    description: "Warn when max normalized RU consumption over lookback exceeds this percent of provisioned/ceiling (operator-tunable)"
    required: false
    default: "80"

  - name: THROTTLE_EVENTS_WARN
    description: "Minimum throttled/429 count (or rate) in lookback to raise an issue"
    required: false
    default: "1"

secrets:
  - name: platform_credentials
    description: "Azure service principal for read-only resource and metrics"
    format: |
      JSON object with: CLIENT_ID, TENANT_ID, CLIENT_SECRET, SUBSCRIPTION_ID
  - name: platform_pat
    description: "Optional alternative auth path if used by collection conventions"
    format: "Plain text PAT string"

# --- Platform Context ---
platform:
  name: "azure"
  cli_tools:
    - "az"
    - "az monitor"
    - "jq"
  auth_methods:
    - "Service Principal (azure_credentials / platform_credentials)"
  api_docs: "https://learn.microsoft.com/en-us/azure/cosmos-db/"

# --- Relationships ---
related_bundles:
  - name: "azure-c7n-codecollection / azure-db-health"
    relationship: "complements"
    notes: "Cloud Custodian SQL/Redis style checks; no Cosmos DB–specific RU, partition, or API metrics—this bundle is specialized for DocumentDB."
  - name: "azure-kv-health"
    relationship: "complements"
    notes: "Same read-only Azure health + metrics + config pattern; reuse structure and issue schema."
  - name: "azure-servicebus-health"
    relationship: "complements"
    notes: "Parallel Azure PaaS health bundle for metrics, resource health, and configuration tasks."

# --- Test Strategy ---
test_scenarios:
  - name: "healthy_cosmos_account"
    description: "Resource healthy, no throttling, RU headroom, backup enabled, multi-region as expected"
    expected_issues: 0

  - name: "undersized_provisioned_ru"
    description: "Sustained RU at ceiling with throttling in lookback"
    expected_issues: 2
    expected_severities: [3, 3]

# --- Notes ---
notes: |
  Azure Cosmos DB surfaces workload health primarily via RU, throttling, and latency—not
  OS-level CPU/RAM. Where the intake asked for “CPU, memory, and RU,” map CPU/memory
  to: normalized RU consumption, physical partition/CPU throttling (if exposed for the
  API), and memory-related signals only if present in public metrics (otherwise document
  the mapping in the README for operators).
  Use consistent labels/tags with other `azure-*-health` bundles. Prefer
  `az monitor metrics` with dimensions (DatabaseName, CollectionName) only when
  cost/latency constraints allow; default to account-level aggregation for first version.

Metadata

Metadata

Assignees

No one assigned

    Labels

    design-specArchitect has produced a design specin-progressIssue is being actively worked by the agentnew-codebundleScoped issue for SRE to implement a new CodeBundle

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions