Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
apiVersion: runwhen.com/v1
kind: GenerationRules
spec:
platform: azure
generationRules:
- resourceTypes:
- azure_cosmosdb_database_account
matchRules:
- type: pattern
pattern: ".+"
properties: [name]
mode: substring
slxs:
- baseName: azure-cosmosdb-utilization-health
qualifiers: ["subscription_id", "resource_group"]
baseTemplateName: azure-cosmosdb-utilization-health
levelOfDetail: basic
outputItems:
- type: slx
- type: sli
- type: runbook
templateName: azure-cosmosdb-utilization-health-taskset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelIndicator
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
displayUnitsLong: Health Score
displayUnitsShort: score
locations:
- {{default_location}}
description: Composite 0-1 score from normalized RU headroom, HTTP 429 rate, and server-side latency for Cosmos DB {{ match_resource.name }}.
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/azure-cosmosdb-utilization-health/sli.robot
intervalStrategy: intermezzo
intervalSeconds: 300
configProvided:
- name: AZ_SUBSCRIPTION
value: "{{ subscription_id }}"
- name: AZURE_RESOURCE_GROUP
value: "{{ resource_group.name }}"
- name: COSMOSDB_ACCOUNT_NAME
value: "{{ match_resource.name }}"
- name: NORMALIZED_RU_THRESHOLD_PCT
value: "80"
- name: THROTTLE_EVENTS_THRESHOLD
value: "1"
- name: SERVER_LATENCY_MS_THRESHOLD
value: "100"
- name: SLI_METRICS_OFFSET
value: "2d"
secretsProvided:
{% if wb_version %}
{% include "azure-auth.yaml" ignore missing %}
{% else %}
- name: azure_credentials
workspaceKey: AUTH DETAILS NOT FOUND
{% endif %}
alertConfig:
tasks:
persona: eager-edgar
sessionTTL: 10m
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelX
metadata:
name: {{ slx_name }}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
imageURL: https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/azure/databases/10137-icon-service-Azure-Cosmos-DB.svg
alias: {{ match_resource.name }} Cosmos DB Utilization Health
asMeasuredBy: Azure Monitor metrics for normalized RU, 429 rate, latency, storage, and throughput sizing.
configProvided:
- name: SLX_PLACEHOLDER
value: SLX_PLACEHOLDER
owners:
- {{ workspace.owner_email }}
statement: Cosmos DB account {{ match_resource.name }} in resource group {{ resource_group.name }} should show healthy utilization without chronic throttling or mis-sized throughput.
additionalContext:
{% include "azure-hierarchy.yaml" ignore missing %}
qualified_name: "{{ match_resource.qualified_name }}"
tags:
{% include "azure-tags.yaml" ignore missing %}
- name: cloud
value: azure
- name: service
value: cosmosdb
- name: scope
value: resource-group
- name: access
value: read-only
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
apiVersion: runwhen.com/v1
kind: Runbook
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
location: {{default_location}}
description: Analyze Cosmos DB utilization metrics for {{ match_resource.name }} in resource group {{ resource_group.name }} (subscription {{ subscription_name }}).
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/azure-cosmosdb-utilization-health/runbook.robot
configProvided:
- name: AZ_SUBSCRIPTION
value: "{{ subscription_id }}"
- name: AZURE_RESOURCE_GROUP
value: "{{ resource_group.name }}"
- name: COSMOSDB_ACCOUNT_NAME
value: "{{ match_resource.name }}"
- name: METRICS_LOOKBACK_DAYS
value: "14"
- name: NORMALIZED_RU_THRESHOLD_PCT
value: "80"
- name: THROTTLE_EVENTS_THRESHOLD
value: "1"
- name: SERVER_LATENCY_MS_THRESHOLD
value: "100"
- name: STORAGE_GROWTH_PCT_THRESHOLD
value: "25"
- name: UNDERUTILIZED_NORMALIZED_PCT
value: "15"
- name: RU_DAILY_GROWTH_RATIO
value: "1.5"
- name: AZURE_SUBSCRIPTION_NAME
value: "{{ subscription_name }}"
secretsProvided:
{% if wb_version %}
{% include "azure-auth.yaml" ignore missing %}
{% else %}
- name: azure_credentials
workspaceKey: AUTH DETAILS NOT FOUND
{% endif %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
### Testing `azure-cosmosdb-utilization-health`

This directory holds optional validation and Terraform for a sample Cosmos DB account.

1. **Quick validation**: From `.test`, run `task` (or `bash validate-all-tests.sh`) to `bash -n` all bundle scripts.
2. **Terraform**: Configure Azure credentials per your environment, then `task build-infra` to create a minimal Cosmos DB account in a resource group for live metric queries. Destroy with `task clean`.

Do not commit secrets. Use `terraform.tfvars` locally for non-production subscriptions only.
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
version: "3"

tasks:
default:
desc: "Validate bundle shell scripts and (optionally) Terraform format"
cmds:
- bash validate-all-tests.sh
- task: terraform-fmt-check

terraform-fmt-check:
desc: "terraform fmt check (requires terraform in PATH)"
dir: terraform
cmds:
- |
if command -v terraform >/dev/null 2>&1; then
terraform fmt -check -recursive || terraform fmt -recursive
else
echo "terraform not installed; skipping fmt"
fi

build-infra:
desc: "Provision test Cosmos DB account (optional; consumes Azure quota)"
dir: terraform
cmds:
- terraform init
- terraform apply -auto-approve

clean:
desc: "Destroy test infrastructure"
dir: terraform
cmds:
- |
if [ -f terraform.tfstate ]; then
terraform destroy -auto-approve
fi
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
terraform {
backend "local" {
path = "terraform.tfstate"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
resource "azurerm_resource_group" "test_rg" {
name = var.resource_group
location = var.location
}

resource "random_string" "suffix" {
length = 8
upper = false
special = false
}

resource "azurerm_cosmosdb_account" "test" {
name = "rwcosmos${random_string.suffix.result}"
location = azurerm_resource_group.test_rg.location
resource_group_name = azurerm_resource_group.test_rg.name
offer_type = "Standard"
kind = "GlobalDocumentDB"

consistency_policy {
consistency_level = "Session"
}

geo_location {
location = azurerm_resource_group.test_rg.location
failover_priority = 0
}

tags = var.tags
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
output "cosmosdb_account_name" {
value = azurerm_cosmosdb_account.test.name
}

output "resource_group_name" {
value = azurerm_resource_group.test_rg.name
}

output "subscription_hint" {
value = "Use the same subscription ID as Terraform provider context for RunWhen config."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "4.18.0"
}
random = {
source = "hashicorp/random"
version = "~> 3.0"
}
}
}

provider "azurerm" {
features {}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
resource_group = "azure-cosmosdb-utilization-health-test"
location = "East US"
tags = {
env = "test"
lifecycle = "deleteme"
product = "runwhen"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
variable "resource_group" {
type = string
}

variable "location" {
type = string
default = "East US"
}

variable "tags" {
type = map(string)
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/env bash
set -euo pipefail
# Validates shell syntax for bundle scripts (optional shellcheck when installed).
ROOT="$(cd "$(dirname "$0")/.." && pwd)"
echo "Checking scripts under ${ROOT}"
for f in "${ROOT}"/*.sh; do
[[ -f "$f" ]] || continue
bash -n "$f"
done
if command -v shellcheck >/dev/null 2>&1; then
shellcheck "${ROOT}"/*.sh || true
fi
echo "OK: bash syntax check passed for bundle scripts."
68 changes: 68 additions & 0 deletions codebundles/azure-cosmosdb-utilization-health/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Azure Cosmos DB Utilization and Sizing Health

This CodeBundle evaluates historical and point-in-time utilization for Azure Cosmos DB using Azure Monitor: normalized RU consumption, total RU, HTTP 429 throttling, server-side latency, storage growth, and throughput sizing signals. It complements configuration-focused bundles (for example `azure-cosmosdb-config-health`) with capacity and cost-oriented metrics.

## Overview

- **Normalized RU trends**: Detects sustained high utilization and rising pressure versus the first half of the lookback window.
- **Total RU consumed**: Flags sharp growth in daily `TotalRequestUnits` between halves of the window.
- **Throttling / 429**: Sums `TotalRequests` filtered to status `429` to catch undersizing or hot partitions.
- **Server-side latency**: Compares `ServerSideLatency` hourly averages to a configurable millisecond threshold.
- **Storage**: Tracks `DataUsage` and `IndexUsage` for rapid expansion.
- **Throughput sizing**: Highlights ceiling risk from high normalized RU and possible over-provisioning when normalized RU stays low while `ProvisionedThroughput` remains high.
- **SLI**: A lightweight `sli.robot` averages binary checks (normalized RU, 429 count, latency) into a 0–1 score.

Metric names follow Microsoft’s supported metrics for `Microsoft.DocumentDB/databaseAccounts` (for example `NormalizedRUConsumption`, `TotalRequestUnits`, `TotalRequests` with `StatusCode`, `ServerSideLatency`, `DataUsage`, `IndexUsage`, `ProvisionedThroughput`).

## Configuration

### Required variables

- `AZ_SUBSCRIPTION`: Azure subscription ID (UUID) used for `az account set` and metric queries.
- `AZURE_RESOURCE_GROUP`: Resource group containing the Cosmos DB account(s).

### Optional variables

- `COSMOSDB_ACCOUNT_NAME`: Cosmos DB account name, or `All` to scan every account in the group (default: `All`).
- `METRICS_LOOKBACK_DAYS`: Days of history for runbook tasks (default: `14`).
- `NORMALIZED_RU_THRESHOLD_PCT`: Normalized RU percentage that triggers utilization and sizing issues (default: `80`).
- `THROTTLE_EVENTS_THRESHOLD`: Minimum total HTTP 429 count in the window to raise throttling issues (default: `1`).
- `SERVER_LATENCY_MS_THRESHOLD`: Maximum acceptable hourly average `ServerSideLatency` in ms (default: `100`).
- `STORAGE_GROWTH_PCT_THRESHOLD`: Percent growth from start to end of the window on `DataUsage` / `IndexUsage` that flags storage expansion (default: `25`).
- `UNDERUTILIZED_NORMALIZED_PCT`: Normalized RU level used with `ProvisionedThroughput` to suggest over-provisioning (default: `15`).
- `RU_DAILY_GROWTH_RATIO`: Ratio of later-window to earlier-window average daily total RU for spike detection (default: `1.5`).
- `AZURE_SUBSCRIPTION_NAME`: Friendly subscription label for context in reports (default: `Azure Subscription`).

### SLI-only variables

- `SLI_METRICS_OFFSET`: Short lookback for the SLI snapshot (default: `2d`), for example `2d` or `24h`.

### Secrets

- `azure_credentials`: JSON or structured secret consumed by the RunWhen Azure integration (typically `CLIENT_ID`, `TENANT_ID`, `CLIENT_SECRET`, `SUBSCRIPTION_ID` / `AZURE_SUBSCRIPTION_ID`). If absent, ambient `az login` / workload identity is assumed.

## Tasks overview

### Analyze Cosmos DB Normalized RU Consumption Trends

Uses `NormalizedRUConsumption` to detect sustained values above the threshold and upward trends correlated with the second half of the window.

### Analyze Cosmos DB Total Request Units Consumed

Uses daily `TotalRequestUnits` totals to detect a sharp increase between the first and second half of the lookback period.

### Check Cosmos DB Throttling and HTTP 429 Rate

Queries `TotalRequests` with dimension filter `StatusCode eq '429'` and compares the aggregate to `THROTTLE_EVENTS_THRESHOLD`.

### Analyze Cosmos DB Server-side Latency

Evaluates `ServerSideLatency` hourly averages against `SERVER_LATENCY_MS_THRESHOLD`.

### Analyze Cosmos DB Data and Index Storage Utilization

Measures relative growth on `DataUsage` and `IndexUsage` against `STORAGE_GROWTH_PCT_THRESHOLD`.

### Analyze Cosmos DB Provisioned Throughput vs Consumed Load

Combines `NormalizedRUConsumption` with `ProvisionedThroughput` for ceiling and over-provisioning hints.
Loading