Skip to content

[design-spec] azure-cosmosdb-utilization-health #105

@rw-codebundle-agent

Description

@rw-codebundle-agent

Design Spec: azure-cosmosdb-utilization-health

Parent: #102
Target: rw-cli-codecollection

Spec

codebundle_name: "azure-cosmosdb-utilization-health"
target_collection: "rw-cli-codecollection"
display_name: "Azure Cosmos DB Utilization and Sizing Health"
author: "rw-codebundle-agent"

purpose: |
  Evaluates historical and point-in-time utilization for Azure Cosmos DB: request units (RU),
  throttling, latency, and storage growth to support right-sizing and throughput planning.

tasks:
  - name: "Analyze Cosmos DB Normalized RU Consumption Trends for Account `${COSMOSDB_ACCOUNT_NAME}` in Resource Group `${AZURE_RESOURCE_GROUP}`"
    description: "Pulls Azure Monitor time series for normalized RU consumption to detect sustained hot partitions or headroom loss."
    script_name: "cosmosdb-normalized-ru-trends.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "metrics"

  - name: "Analyze Cosmos DB Total Request Units Consumed for Account `${COSMOSDB_ACCOUNT_NAME}` in Resource Group `${AZURE_RESOURCE_GROUP}`"
    description: "Aggregates Total Request Units over the lookback window for workload growth and chargeback-style signals."
    script_name: "cosmosdb-total-ru-consumed.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "metrics"

  - name: "Check Cosmos DB Throttling and HTTP 429 Rate for Account `${COSMOSDB_ACCOUNT_NAME}` in Resource Group `${AZURE_RESOURCE_GROUP}`"
    description: "Correlates throttled requests / 429 indicators with provisioned throughput to flag undersizing."
    script_name: "cosmosdb-throttling-429.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "metrics"

  - name: "Analyze Cosmos DB Server-side Latency for Account `${COSMOSDB_ACCOUNT_NAME}` in Resource Group `${AZURE_RESOURCE_GROUP}`"
    description: "Reviews server-side latency metrics for regressions that often precede saturation or hot keys."
    script_name: "cosmosdb-server-latency.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "metrics"

  - name: "Analyze Cosmos DB Data and Index Storage Utilization for Account `${COSMOSDB_ACCOUNT_NAME}` in Resource Group `${AZURE_RESOURCE_GROUP}`"
    description: "Tracks data and index storage growth; flags rapid expansion that may drive partition count and cost."
    script_name: "cosmosdb-storage-utilization.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "metrics"

  - name: "Analyze Cosmos DB Provisioned Throughput vs Consumed Load for Account `${COSMOSDB_ACCOUNT_NAME}` in Resource Group `${AZURE_RESOURCE_GROUP}`"
    description: "Compares autoscale or manual provisioned RU/s against consumed RU patterns for oversizing and undersizing recommendations."
    script_name: "cosmosdb-throughput-sizing.sh"
    expected_issue_severity: [3, 4]
    access_level: "read-only"
    data_type: "metrics"

scope:
  level: "ResourceGroup"
  qualifiers:
    - AZ_SUBSCRIPTION
    - AZURE_RESOURCE_GROUP
  iteration_pattern: |
    Same as config bundle: filter by `${COSMOSDB_ACCOUNT_NAME}` or discover all accounts in the group.

resource_types:
  - "azure_cosmosdb_database_account"

generation_strategy: |
  Pair with azure-cosmosdb-config-health generation: same resource match and SLX qualifiers so
  operators can run configuration and utilization task sets side by side per account.

env_vars:
  - name: AZ_SUBSCRIPTION
    description: "Azure subscription ID (UUID)"
    required: true

  - name: AZURE_RESOURCE_GROUP
    description: "Resource group containing the Cosmos DB account(s)"
    required: true

  - name: COSMOSDB_ACCOUNT_NAME
    description: "Cosmos DB account name, or All for every account in the resource group"
    required: false
    default: "All"

  - name: METRICS_LOOKBACK_DAYS
    description: "Days of historical metrics to analyze for trends"
    required: false
    default: "14"

  - name: NORMALIZED_RU_THRESHOLD_PCT
    description: "Percentage of normalized RU consumption above which to raise sizing issues"
    required: false
    default: "80"

  - name: THROTTLE_EVENTS_THRESHOLD
    description: "Minimum throttled request count (or rate) in the window to flag undersizing"
    required: false
    default: "1"

secrets:
  - name: azure_credentials
    description: "Service principal used by Azure CLI"
    format: |
      JSON object with: CLIENT_ID, TENANT_ID, CLIENT_SECRET, SUBSCRIPTION_ID

platform:
  name: "azure"
  cli_tools:
    - "az"
    - "jq"
  auth_methods:
    - "Service Principal (azure_credentials)"
  api_docs: "https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/microsoft-documentdb-databaseaccounts-metrics"

related_bundles:
  - name: "azure-cosmosdb-config-health"
    relationship: "complements"
    notes: "Configuration and DR/network checks; this bundle focuses on utilization and sizing signals."
  - name: "azure-appservice-plan-health"
    relationship: "complements"
    notes: "Similar weekly utilization and capacity pattern for Azure PaaS; reuse metric query idioms."
  - name: "azure-db-health"
    relationship: "complements"
    notes: "Cloud Custodian bundle in another collection lists broad DB CPU/memory patterns; Cosmos-specific RU metrics are not covered there."

test_scenarios:
  - name: "right_sized_account"
    description: "Normalized RU below threshold, no throttling, stable latency"
    expected_issues: 0

  - name: "undersized_account"
    description: "High normalized RU with throttling events in the window"
    expected_issues: 2
    expected_severities: [3, 4]

notes: |
  Intake wording referenced historical CPU and memory; for Cosmos DB, implementers should map those
  intents to service-native metrics (normalized RU, total RU consumed, server-side latency, storage)
  rather than host CPU/RAM, which are not customer-visible in the same way as IaaS.
  Use `az monitor metrics list` with appropriate metric names and dimensions (DatabaseName,
  CollectionName, Region) where APIs allow. Keep tasks under ~60s combined where possible by
  narrowing time grain and parallelizing account loops.

Metadata

Metadata

Assignees

No one assigned

    Labels

    completedAgent work completeddesign-specArchitect has produced a design specnew-codebundleScoped issue for SRE to implement a new CodeBundle

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions