Skip to content

Per-Partition Automatic Failover: Faster detection of per-partition write region through availability strategy for writes.#48421

Draft
jeet1995 wants to merge 3 commits intoAzure:mainfrom
jeet1995:AzCosmos_WriteAvailabilityStrategyForPPAF
Draft

Per-Partition Automatic Failover: Faster detection of per-partition write region through availability strategy for writes.#48421
jeet1995 wants to merge 3 commits intoAzure:mainfrom
jeet1995:AzCosmos_WriteAvailabilityStrategyForPPAF

Conversation

@jeet1995
Copy link
Member

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@jeet1995 jeet1995 force-pushed the AzCosmos_WriteAvailabilityStrategyForPPAF branch 2 times, most recently from 72cd8e0 to acbf49c Compare March 14, 2026 22:35
…riter accounts

Enable proactive write hedging for Per-Partition Automatic Failover (PPAF) on single-writer
Cosmos DB accounts. When a write to the primary region is slow or failing, the SDK now hedges
the write to a read region — reducing time-to-recovery from 60-120s (retry-based) to the
hedging threshold (~1s with default config).

## Problem

In PPAF-enabled single-writer accounts, when a partition fails over, the SDK waits for error
signals (503, 408, 410) which can take 60-120s before marking a region as failed for that
partition via the retry-based path in GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.

## Solution

Plug the existing availability strategy (hedging) machinery into the write path for PPAF:

1. **Speculation gating** (RxDocumentClientImpl.getApplicableRegionsForSpeculation):
   - Relax the canUseMultipleWriteLocations() gate for PPAF single-writer accounts
   - Relax the isIdempotentWriteRetriesEnabled gate (PPAF provides partition-level consistency)
   - Use ALL account-level read regions (getAvailableReadRoutingContexts) as hedge candidates,
     not just preferred regions — PPAF failover can target any read region

2. **Routing** (tryAddPartitionLevelLocationOverride + CrossRegionAvailabilityContext):
   - Add ppafWriteHedgeTargetRegion field to CrossRegionAvailabilityContextForRxDocumentServiceRequest
   - In tryAddPartitionLevelLocationOverride: when ppafWriteHedgeTargetRegion is set, create the
     conchashmap entry via computeIfAbsent and route via hedgeFailoverInfo.getCurrent()
   - This is synchronous and deterministic — conchashmap updated in the same request pipeline
   - Thread safety: uses getCurrent() from the computeIfAbsent result (not raw hedgeTarget)
     to avoid routing to a region the concurrent retry path may have marked as failed

3. **Default E2E policy** (evaluatePpafEnforcedE2eLatencyPolicyCfgForWrites):
   - Mirrors the read defaults exactly — symmetric hedging behavior for reads and writes
   - Only applied to point write operations (batch excluded via isPointOperation gate)
   - DIRECT: timeout=networkRequestTimeout+1s, threshold=min(timeout/2, 1s), step=500ms
   - GATEWAY: timeout=min(6s, httpTimeout), threshold=min(timeout/2, 1s), step=500ms

4. **Safety lever** (Configs.isWriteAvailabilityStrategyEnabledWithPpaf):
   - System property COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF (default: true)
   - Allows opt-out without code changes if regression is observed

## Files changed (6)

- Configs.java: Write availability strategy PPAF config flag
- RxDocumentClientImpl.java: Speculation gating, region resolution, write E2E policy
- CrossRegionAvailabilityContextForRxDocumentServiceRequest.java: ppafWriteHedgeTargetRegion field
- ClientRetryPolicy.java: Honor ppafWriteHedgeTargetRegion in tryAddPartitionLevelLocationOverride
- GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.java: Hedge target handling
  in tryAddPartitionLevelLocationOverride with computeIfAbsent + getCurrent()
- PerPartitionAutomaticFailoverE2ETests.java: 26 new test cases

## Test coverage

| Op      | DIRECT (mocked transport) | GATEWAY (mocked HttpClient) |
|---------|--------------------------|----------------------------|
| Create  | 410/21005 + 503/21008    | delayed write region       |
| Replace | 410/21005                | delayed write region       |
| Upsert  | 410/21005                | delayed write region       |
| Delete  | 410/21005                | delayed write region       |
| Patch   | 410/21005                | delayed write region       |

Additional tests:
- Opt-out via COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF=false
- Batch bypass verification (batch uses retry-based PPAF, not hedging)
- Explicit conchashmap verification: after hedge success, asserts the PPAF manager's
  partitionKeyRangeToFailoverInfo entry points to a region != the failed write region

All assertions are exact match: 2 regions before failover, 1 region after failover.
165 tests total (existing + new), 0 regressions, 0 modifications to existing test logic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the AzCosmos_WriteAvailabilityStrategyForPPAF branch from b07dde3 to 46125f4 Compare March 14, 2026 23:28
@jeet1995 jeet1995 changed the title Az cosmos write availability strategy for ppaf Per-Partition Automatic Failover: Faster detection of per-partition write region through availability strategy for writes. Mar 16, 2026
jeet1995 and others added 2 commits March 16, 2026 14:21
…ible error codes

Add 34 new test configurations to write availability strategy hedging
tests covering all error codes from the base PPAF E2E test suite:

DIRECT mode:
- 503/21008 (SERVICE_UNAVAILABLE) for Replace, Upsert, Delete, Patch
- 403/3 (FORBIDDEN_WRITEFORBIDDEN) for all 5 write ops
- 408/UNKNOWN (REQUEST_TIMEOUT) for all 5 write ops

GATEWAY mode:
- 403/3 (FORBIDDEN_WRITEFORBIDDEN) for all 5 write ops
- 408/UNKNOWN (REQUEST_TIMEOUT) for all 5 write ops
- 408/GATEWAY_ENDPOINT_READ_TIMEOUT (network error) for all 5 write ops
- 503/GATEWAY_ENDPOINT_UNAVAILABLE (network error) for all 5 write ops

Parameterize gateway test method to accept error codes instead of
hardcoding 503. Extend setupHttpClientToThrowCosmosException to support
combined delay + network error mode for gateway-specific fault types.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant