Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0353315
Add azure-cosmos-avad-test module for AVAD Change Feed Processor soak…
jeet1995 May 4, 2026
a266d83
Move AVAD soak test into azure-cosmos-benchmark module
jeet1995 May 4, 2026
626b0f1
Fix AVAD code to Java 8 compatibility for benchmark module
jeet1995 May 5, 2026
a7136cb
Fix self-review findings: blocking bugs, watch issues, and chaos scope
jeet1995 May 5, 2026
caa22d9
Revert unrelated hooks.json change
jeet1995 May 5, 2026
949f3e1
Refactor Ingestor to use Cosmos bulk API
jeet1995 May 5, 2026
20a30a3
Remove unused SoakMetrics class
jeet1995 May 5, 2026
7339285
Align close() pattern across all components
jeet1995 May 5, 2026
4966fd7
Share CosmosAsyncClient between workload and ReconciliationWriter
jeet1995 May 5, 2026
040a68e
Replace inline fully qualified names with imports
jeet1995 May 5, 2026
8a226d3
Add JSON-based configuration with env var overrides
jeet1995 May 5, 2026
6c42d4c
Add local + AKS orchestration scripts
jeet1995 May 5, 2026
34d8a26
Fix: remove unsupported setStartTime for AVAD CFP mode
jeet1995 May 5, 2026
85d579b
Fix: enable contentResponseOnWrite on all clients
jeet1995 May 5, 2026
22811a6
Make ReconciliationWriter synchronous
jeet1995 May 5, 2026
fb6d5e1
Fix ingestor bulk response correlation with IdentityHashMap
jeet1995 May 5, 2026
05c6fa2
Fix ingestor: replace Flux.interval+concatMap with blocking loop
jeet1995 May 5, 2026
9f910d0
Remove inline CRTS tracking from AvadReader
jeet1995 May 5, 2026
2c82dea
Propagate ReconciliationWriter failures to handleChanges
jeet1995 May 5, 2026
de96e88
Remove soak-health container — use offline Reconciler instead
jeet1995 May 5, 2026
691819d
Redesign reconciliation: Cosmos-only, per-reader, 5 sources
jeet1995 May 5, 2026
dbc1146
Add Spark LV and AVAD change feed reader notebooks
jeet1995 May 5, 2026
876db3a
Add Spark-based reconciler notebook
jeet1995 May 5, 2026
d422e16
Fix Dockerfile: use -cp with explicit main class, local build
jeet1995 May 5, 2026
d06c8ce
Tune CFP throughput: bulk recon writes, faster polling, region alignment
jeet1995 May 5, 2026
5faaa4e
Fix Spark notebooks: widget config, checkpoint path, column schema
jeet1995 May 6, 2026
432347d
Fix Spark AVAD: use AllVersionsAndDeletes mode, startFrom=Now
jeet1995 May 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions sdk/cosmos/azure-cosmos-benchmark/avad-soak/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Single-stage build for cosmos-avad-test soak runner
# Build context: azure-cosmos-benchmark/ (module root)
# Requires: mvn package run locally first (produces fat jar)

FROM eclipse-temurin:21-jre-jammy
WORKDIR /app

# Install curl for health probes
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*

COPY target/azure-cosmos-benchmark-*-jar-with-dependencies.jar /app/app.jar

# Health endpoint port
EXPOSE 8080

# JVM tuning for container environments
ENV JAVA_OPTS="-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 -XX:+ExitOnOutOfMemoryError"

ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -cp /app/app.jar com.azure.cosmos.avadtest.Main $0 $@"]
46 changes: 46 additions & 0 deletions sdk/cosmos/azure-cosmos-benchmark/avad-soak/chaos/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Cosmos DB Soak Test — Chaos Library

Reusable chaos injection scenarios for AKS-hosted Cosmos DB
consumers. Works with any workload deployed via the soak
infra Helm chart.

## Scenarios

| Scenario | Script | What It Tests |
|----------|--------|---------------|
| Pod Kill | `pod-kill.sh` | Lease rebalancing after random pod loss |
| Partition Split | `partition-split.sh` | Continuation token validity across splits |

## Usage

### Manual — run one scenario

```bash
export NAMESPACE=cosmos-soak
export COSMOS_ACCOUNT=<your-account>
export COSMOS_RG=<your-rg>

# Kill a random AVAD CFP pod
bash chaos/scenarios/pod-kill.sh

# Trigger partition split (2x throughput)
SCALE_FACTOR=2 bash chaos/scenarios/partition-split.sh
```

### Automated — via soak orchestrator

The `run-soak.sh` orchestrator reads `chaos-schedule.yaml`
and fires scenarios on a phase-based schedule:

```
Warm-up → Steady → Chaos → Recovery → repeat
```

See `chaos-schedule.yaml` for interval/parameter config.

## Adding a New Scenario

1. Create `chaos/scenarios/my-scenario.sh`
2. Use env vars for all parameters (no hardcoded values)
3. Add an entry to `chaos-schedule.yaml`
4. The soak orchestrator will pick it up automatically
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Chaos schedule configuration
# The soak orchestrator reads this file to determine when to
# fire each chaos scenario.
#
# interval_hours: time between invocations of this scenario
# recovery_minutes: time to wait after chaos before checking health
# enabled: set to false to skip this scenario

schedule:
- scenario: pod-kill
interval_hours: 2
recovery_minutes: 5
enabled: true
params:
component: avad-cfp

- scenario: partition-split
interval_hours: 12
recovery_minutes: 30
enabled: true
params:
scale_factor: 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash
# Partition Split — scale feed container throughput to trigger split
set -euo pipefail

COSMOS_ACCOUNT="${COSMOS_ACCOUNT:?Set COSMOS_ACCOUNT}"
COSMOS_RG="${COSMOS_RG:?Set COSMOS_RG}"
COSMOS_DB="${COSMOS_DB:-graph_db}"
FEED_CONTAINER="${FEED_CONTAINER:-avad-test}"
SCALE_FACTOR="${SCALE_FACTOR:-2}"

echo "[$(date '+%H:%M:%S')] Chaos: partition-split"

# Get current throughput
CURRENT_RU=$(az cosmosdb sql container throughput show \
--account-name "$COSMOS_ACCOUNT" \
--resource-group "$COSMOS_RG" \
--database-name "$COSMOS_DB" \
--name "$FEED_CONTAINER" \
--query "resource.throughput" -o tsv)

TARGET_RU=$((CURRENT_RU * SCALE_FACTOR))

echo " Current feed RU: $CURRENT_RU"
echo " Scaling to: $TARGET_RU RU (${SCALE_FACTOR}x) to trigger split"

# Get pre-split partition count
PRE_SPLIT_PARTITIONS=$(az cosmosdb sql container show \
--account-name "$COSMOS_ACCOUNT" \
--resource-group "$COSMOS_RG" \
--database-name "$COSMOS_DB" \
--name "$FEED_CONTAINER" \
--query "resource.statistics[0].partitionCount" -o tsv 2>/dev/null || echo "unknown")

echo " Pre-split partition count: $PRE_SPLIT_PARTITIONS"

# Scale up
az cosmosdb sql container throughput update \
--account-name "$COSMOS_ACCOUNT" \
--resource-group "$COSMOS_RG" \
--database-name "$COSMOS_DB" \
--name "$FEED_CONTAINER" \
--throughput "$TARGET_RU" \
--output none

echo " Feed container scaled to $TARGET_RU RU"
echo " Partition split may take several minutes to complete"

# Poll for split completion (check partition count changes)
WAIT_TIME=0
MAX_WAIT=1800 # 30 minutes
while [ $WAIT_TIME -lt $MAX_WAIT ]; do
sleep 60
WAIT_TIME=$((WAIT_TIME + 60))
echo " Waiting for split... (${WAIT_TIME}s elapsed)"
done

echo " Partition split chaos event complete (waited ${WAIT_TIME}s)"
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash
# Pod Kill — kill a random CFP pod to test lease rebalancing
set -euo pipefail

NAMESPACE="${NAMESPACE:-cosmos-soak}"
COMPONENT="${COMPONENT:-avad-cfp}"
LABEL="app.kubernetes.io/component=${COMPONENT}"

echo "[$(date '+%H:%M:%S')] Chaos: pod-kill targeting $COMPONENT"

POD=$(kubectl get pods -n "$NAMESPACE" -l "$LABEL" \
--field-selector=status.phase=Running \
-o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | shuf -n 1)

if [ -z "$POD" ]; then
echo " No running pods found for $LABEL"
exit 0
fi

echo " Killing pod: $POD"
kubectl delete pod "$POD" -n "$NAMESPACE" --grace-period=0 --force
echo " Pod $POD killed"
17 changes: 17 additions & 0 deletions sdk/cosmos/azure-cosmos-benchmark/avad-soak/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"cosmos": {
"endpoint": "https://abhm-cfp-region-test.documents.azure.com:443/",
"regionalEndpoint": "",
"database": "graph_db",
"feedContainer": "avad-test",
"leaseContainer": "avad-test-leases",
"preferredRegion": "West Central US"
},
"ingestor": {
"opsPerSec": 500,
"docSizeBytes": 512,
"logicalPartitionCount": 1000,
"durationSeconds": 1800,
"workerCount": 2
}
}
76 changes: 76 additions & 0 deletions sdk/cosmos/azure-cosmos-benchmark/avad-soak/infra/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Cosmos DB Soak Test — Infrastructure

Reusable Helm chart and setup scripts for running Cosmos DB
change feed processor soak tests on AKS.

## Prerequisites

Before deploying, create the required Kubernetes secrets:

```bash
# Cosmos DB key secret (referenced by Helm chart)
kubectl create secret generic <release-name>-secrets \
--namespace cosmos-soak \
--from-literal=cosmos-key="<your-cosmos-key>"

# ACR pull secret (if not using AKS-managed ACR attachment)
kubectl create secret docker-registry acr-secret \
--namespace cosmos-soak \
--docker-server=<acr-name>.azurecr.io \
--docker-username=<sp-id> \
--docker-password=<sp-password>
```

If using AKS with `--attach-acr`, the `acr-secret` is not needed
and can be removed from the chart templates.

## Quick Start

```bash
# 1. Create AKS cluster
./scripts/setup-aks.sh

# 2. Create Cosmos containers
./scripts/setup-cosmos.sh

# 3. Build + push image to ACR
./scripts/setup-acr.sh

# 4. Create secrets (see Prerequisites above)

# 5. Deploy (from repo root)
cd ../..
./run-soak.sh
```

## What This Provides

| Component | Description |
|-----------|-------------|
| `chart/` | Helm chart with templated Deployments, StatefulSets, ConfigMaps, probes |
| `scripts/setup-aks.sh` | AKS cluster provisioning |
| `scripts/setup-cosmos.sh` | Cosmos containers (feed, lease, reconciliation, health) |
| `scripts/setup-acr.sh` | ACR creation + image build/push |

## Reusing for Your Own Workload

1. Build a container image with your workload logic
2. Implement HTTP endpoints: `/health` (liveness), `/ready`
(readiness), `/metrics` (optional)
3. Create a `values-myworkload.yaml` overriding:
- `image.repository` / `image.tag`
- `cosmos.*` (endpoint, containers, etc.)
- `avadConsumer.replicas` / `lvConsumer.replicas`
4. Deploy: `helm upgrade --install my-soak ./infra/chart -f values-myworkload.yaml`

## Azure Resources

Override default resource names via environment variables
in each script:

```bash
export SUBSCRIPTION="<your-subscription-id>"
export RG="<your-resource-group>"
export AKS_CLUSTER="<your-aks-name>"
export ACR_NAME="<your-acr-name>"
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
apiVersion: v2
name: cosmos-soak
description: Reusable Helm chart for Cosmos DB soak testing on AKS
version: 0.1.0
appVersion: "1.0"
type: application
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
{{- define "cosmos-soak.labels" -}}
app.kubernetes.io/name: {{ .Chart.Name }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}

{{- define "cosmos-soak.selectorLabels" -}}
app.kubernetes.io/name: {{ .Chart.Name }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

{{- define "cosmos-soak.cosmosEnv" -}}
- name: COSMOS_ENDPOINT
valueFrom:
configMapKeyRef:
name: {{ .Release.Name }}-config
key: endpoint
- name: COSMOS_KEY
valueFrom:
secretKeyRef:
name: {{ .Release.Name }}-secrets
key: cosmos-key
- name: COSMOS_DATABASE
valueFrom:
configMapKeyRef:
name: {{ .Release.Name }}-config
key: database
- name: COSMOS_FEED_CONTAINER
valueFrom:
configMapKeyRef:
name: {{ .Release.Name }}-config
key: feedContainer
- name: COSMOS_LEASE_CONTAINER
valueFrom:
configMapKeyRef:
name: {{ .Release.Name }}-config
key: leaseContainer
- name: COSMOS_PREFERRED_REGION
valueFrom:
configMapKeyRef:
name: {{ .Release.Name }}-config
key: preferredRegion
- name: OPS_PER_SEC
valueFrom:
configMapKeyRef:
name: {{ .Release.Name }}-config
key: opsPerSec
- name: DOC_SIZE_BYTES
valueFrom:
configMapKeyRef:
name: {{ .Release.Name }}-config
key: docSizeBytes
- name: LOGICAL_PARTITION_COUNT
valueFrom:
configMapKeyRef:
name: {{ .Release.Name }}-config
key: logicalPartitionCount
- name: DURATION_SECONDS
valueFrom:
configMapKeyRef:
name: {{ .Release.Name }}-config
key: durationSeconds
- name: WORKER_COUNT
valueFrom:
configMapKeyRef:
name: {{ .Release.Name }}-config
key: workerCount
{{- end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ .Release.Name }}-config
namespace: {{ .Values.namespace }}
labels:
{{- include "cosmos-soak.labels" . | nindent 4 }}
data:
endpoint: {{ .Values.cosmos.endpoint | quote }}
database: {{ .Values.cosmos.database | quote }}
feedContainer: {{ .Values.cosmos.feedContainer | quote }}
leaseContainer: {{ .Values.cosmos.leaseContainer | quote }}
preferredRegion: {{ .Values.cosmos.preferredRegion | quote }}
opsPerSec: {{ .Values.cosmos.opsPerSec | quote }}
docSizeBytes: {{ .Values.cosmos.docSizeBytes | quote }}
logicalPartitionCount: {{ .Values.cosmos.logicalPartitionCount | quote }}
durationSeconds: {{ .Values.cosmos.durationSeconds | quote }}
workerCount: {{ .Values.cosmos.workerCount | quote }}
Loading
Loading