# --- Identity ---
codebundle_name: "azure-cosmosdb-health"
target_collection: "rw-cli-codecollection"
display_name: "Azure Cosmos DB Health"
author: "rw-codebundle-agent"
# --- Purpose ---
purpose: |
Monitors Azure Cosmos DB accounts for platform-reported health, account configuration
(security, resiliency, throughput model), and utilization via Azure Monitor metrics
including request units, throttling, latency, and storage—supporting right-sizing and
configuration review. Cosmos DB does not expose classic VM-style CPU/memory; operator-facing
signals are normalized RU consumption, 429s/throttling, server-side latency, and data/index
storage against account limits, with lookback windows for trend context.
# --- Tasks ---
# Each task becomes one Robot Framework task + one bash script (3-12 total).
tasks:
- name: "Check Cosmos DB Resource Health for Account in Resource Group"
description: "Queries Azure Resource Health for Microsoft.DocumentDB/databaseAccounts in the target resource group; flags provider-reported availability incidents."
script_name: "cosmosdb-resource-health.sh"
expected_issue_severity: [2, 3]
access_level: "read-only"
data_type: "logs-config"
- name: "Check Cosmos DB Account Network and Security Configuration"
description: "Reviews public network access, IP/firewall rules, VNet integration when present, and whether private endpoint usage is expected per policy; does not change settings."
script_name: "cosmosdb-network-config.sh"
expected_issue_severity: [2, 3]
access_level: "read-only"
data_type: "logs-config"
- name: "Check Cosmos DB Multi-Region and Failover Readiness"
description: "Validates number of read regions, write region, and failover/availability zone posture against organizational HA expectations; surfaces single-region and misconfigured failover."
script_name: "cosmosdb-regions-failover.sh"
expected_issue_severity: [2, 3]
access_level: "read-only"
data_type: "config"
- name: "Check Cosmos DB Backup and Point-in-Time Recovery Settings"
description: "Reads backup/continuous backup policy and retention signals exposed by the management plane (API-version dependent); flags disabled or misaligned backup for tier."
script_name: "cosmosdb-backup-policy.sh"
expected_issue_severity: [2, 4]
access_level: "read-only"
data_type: "config"
- name: "Analyze Historical RU Consumption and Throttling"
description: "Pulls Azure Monitor time-series (e.g. TotalRequestUnits, Throttled requests, 429 where available) over configurable lookback; compares to provisioned or autoscale max; flags sustained throttling and headroom issues."
script_name: "cosmosdb-ru-throttling-metrics.sh"
expected_issue_severity: [3, 4]
access_level: "read-only"
data_type: "metrics"
- name: "Analyze Data and Index Storage Utilization"
description: "Aggregates data and index storage where metrics exist; surfaces growth trends against account or partition limits; supports sizing conversations."
script_name: "cosmosdb-storage-utilization.sh"
expected_issue_severity: [2, 3]
access_level: "read-only"
data_type: "metrics"
- name: "Analyze Server-Side Latency and Error Rates"
description: "Uses ServerSideLatency, DocumentCount/Query metrics or equivalent to spot latency regression and high error share across APIs; uses historical windows for false-positive reduction."
script_name: "cosmosdb-latency-error-metrics.sh"
expected_issue_severity: [3, 4]
access_level: "read-only"
data_type: "metrics"
- name: "Review Throughput Provisioning and Autoscale Fit"
description: "Distinguishes serverless vs provisioned vs autoscale; checks min/max RU and utilization ratio; provides coarse rightsizing signals (with operator confirmation; no automatic scaling)."
script_name: "cosmosdb-throughput-provisioning.sh"
expected_issue_severity: [2, 3]
access_level: "read-only"
data_type: "config"
# --- Scope ---
scope:
level: "ResourceGroup"
qualifiers:
- AZURE_RESOURCE_SUBSCRIPTION_ID
- AZURE_RESOURCE_GROUP
iteration_pattern: |
Default: one runbook per discovered Cosmos DB account in the resource group, or
a single account when ACCOUNTS (or resource name) is set. List accounts with
`az cosmosdb list -g` and run checks per account; support RESOURCES=All or
a comma-separated account list. Loop variable: `COSMOSDB_ACCOUNT_NAME`.
# --- Resource Discovery ---
resource_types:
- "azure_cosmos_cosmosdb"
generation_strategy: |
One SLX per Cosmos DB account in the resource group (qualifiers: resource_group,
subscription, account name). Resource match on Microsoft.DocumentDB/databaseAccounts
in RunWhen Local; confirm the exact `resourceTypes` string against the platform
registry for Azure Cosmos DB before shipping.
# --- Configuration ---
env_vars:
- name: AZURE_RESOURCE_SUBSCRIPTION_ID
description: "Azure subscription ID for API and metrics"
required: true
- name: AZURE_RESOURCE_GROUP
description: "Resource group containing Cosmos DB account(s)"
required: true
- name: COSMOSDB_ACCOUNT_NAME
description: "Single account name, comma-separated list, or All for discovery in the RG"
required: false
default: "All"
- name: METRIC_LOOKBACK
description: "Lookback for historical metrics (e.g. 7d, 30d) passed to `az monitor metrics` or equivalent"
required: false
default: "7d"
- name: RU_UTILIZATION_WARN_PERCENT
description: "Warn when max normalized RU consumption over lookback exceeds this percent of provisioned/ceiling (operator-tunable)"
required: false
default: "80"
- name: THROTTLE_EVENTS_WARN
description: "Minimum throttled/429 count (or rate) in lookback to raise an issue"
required: false
default: "1"
secrets:
- name: platform_credentials
description: "Azure service principal for read-only resource and metrics"
format: |
JSON object with: CLIENT_ID, TENANT_ID, CLIENT_SECRET, SUBSCRIPTION_ID
- name: platform_pat
description: "Optional alternative auth path if used by collection conventions"
format: "Plain text PAT string"
# --- Platform Context ---
platform:
name: "azure"
cli_tools:
- "az"
- "az monitor"
- "jq"
auth_methods:
- "Service Principal (azure_credentials / platform_credentials)"
api_docs: "https://learn.microsoft.com/en-us/azure/cosmos-db/"
# --- Relationships ---
related_bundles:
- name: "azure-c7n-codecollection / azure-db-health"
relationship: "complements"
notes: "Cloud Custodian SQL/Redis style checks; no Cosmos DB–specific RU, partition, or API metrics—this bundle is specialized for DocumentDB."
- name: "azure-kv-health"
relationship: "complements"
notes: "Same read-only Azure health + metrics + config pattern; reuse structure and issue schema."
- name: "azure-servicebus-health"
relationship: "complements"
notes: "Parallel Azure PaaS health bundle for metrics, resource health, and configuration tasks."
# --- Test Strategy ---
test_scenarios:
- name: "healthy_cosmos_account"
description: "Resource healthy, no throttling, RU headroom, backup enabled, multi-region as expected"
expected_issues: 0
- name: "undersized_provisioned_ru"
description: "Sustained RU at ceiling with throttling in lookback"
expected_issues: 2
expected_severities: [3, 3]
# --- Notes ---
notes: |
Azure Cosmos DB surfaces workload health primarily via RU, throttling, and latency—not
OS-level CPU/RAM. Where the intake asked for “CPU, memory, and RU,” map CPU/memory
to: normalized RU consumption, physical partition/CPU throttling (if exposed for the
API), and memory-related signals only if present in public metrics (otherwise document
the mapping in the README for operators).
Use consistent labels/tags with other `azure-*-health` bundles. Prefer
`az monitor metrics` with dimensions (DatabaseName, CollectionName) only when
cost/latency constraints allow; default to account-level aggregation for first version.
Design Spec: azure-cosmosdb-health
Parent: #102
Target:
rw-cli-codecollectionSpec