Commit 5758091
Add gRPC client and worker connection resiliency (#135)
* Add gRPC resiliency design spec
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Ignore local worktrees
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add gRPC resiliency implementation plan
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add gRPC resiliency option types
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add grpc resiliency validation tests
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Thread gRPC resiliency options through constructors
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Strengthen retained client state tests
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add shared gRPC resiliency helpers
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add completion long-poll resiliency test
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add grpc resiliency edge-case tests
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Harden worker gRPC stream reconnect behavior
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix worker channel cleanup on teardown
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add worker silent disconnect tests
Extend worker resiliency coverage with an end-to-end silent-disconnect recovery test and an explicit reconnect backoff assertion.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add sync client gRPC channel recreation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Reset sync client long-poll failure tracking
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add sync client recreation input test
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add sync client recreation test coverage
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add async client gRPC channel recreation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add async channel recreation transport test
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add gRPC connection resiliency
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Remove repo-wide pytest importlib addopts
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Update gRPC resiliency plan tracking
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix worker channel retirement for in-flight completions
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix worker shutdown channel draining
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Rename Azure Managed gRPC resiliency test module
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix sync client channel cleanup
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Address automated review feedback
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Remove superpowers docs from PR
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Address andystaples PR review feedback
- Make FailureTracker thread-safe with an internal lock so multi-threaded
sync clients can't race the consecutive-failure counter (review [3/10]).
- Track _AsyncWorkerManager pool shutdown via an explicit _pool_is_shutdown
flag instead of reading ThreadPoolExecutor._shutdown (CPython private API,
review [4/10]).
- Collapse identical wrap_execution/wrap_cancellation closures in the worker
stream loop into a single wrap_with_release helper (review [5/10]).
- Promote the retired-channel close delay and jitter exponent cap to named
module-level constants (review [7/10]).
- Key _InFlightChannelTracker on the channel object instead of id(channel)
so the lifetime invariant is local to the tracker (review [9/10]).
- Rename TaskHubGrpcWorker._can_recreate_channel() to the existing
_owns_channel attribute used by the clients, so both files use the same
name for the same concept (review [2/10]).
- Add regression tests for FailureTracker concurrency and for thread-pool
recreation after manager shutdown.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Refactor client resiliency to use UnaryUnaryClientInterceptor
Centralize client failure tracking and channel-recreate triggering in a
`ClientResiliencyInterceptor` (sync) and `AsyncClientResiliencyInterceptor`
(async) instead of the per-call `_invoke_unary` indirection. This addresses
feedback [1/10] on PR #135: resiliency wiring now lives in one place and the
call sites read as normal stub calls.
- Add `ClientResiliencyInterceptor` and `AsyncClientResiliencyInterceptor`
in `durabletask/internal/grpc_resiliency.py`.
- Switch `LONG_POLL_METHODS` and `is_client_transport_failure` to use full
gRPC method paths (`/TaskHubSidecarService/...`) so the interceptor can
match the `method` field on `ClientCallDetails` directly.
- Wire the resiliency interceptor into `TaskHubGrpcClient` and
`AsyncTaskHubGrpcClient`: it is always prepended (defensive copy of any
user interceptors) and re-applied on every channel recreate so all unary
calls flow through it.
- Remove both `_invoke_unary` methods and revert all 34 call sites to
ordinary `self._stub.MethodName(req)` (or `await ...` for async).
- Caller-owned channels (sync and async) deliberately bypass the resiliency
interceptor since they are never recreated; this preserves the caller's
exact channel reference and avoids `grpc.aio`'s lack of a public
`intercept_channel` equivalent.
- Add test shims (`_ResilientSyncTestStub`/`_ResilientAsyncTestStub` plus
`install_resilient_test_stubs`) so tests that patch
`stubs.TaskHubSidecarServiceStub` with `MagicMock` still observe the
failure-tracking pipeline.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Use inspect.isawaitable in AsyncClientResiliencyInterceptor
Replace the ad-hoc `hasattr(result, '__await__')` check in
`AsyncClientResiliencyInterceptor._record_outcome` with the canonical
`inspect.isawaitable` predicate, and tighten the `on_recreate` callback
annotation to `Callable[[], Union[None, Awaitable[object]]]` so it reflects
the actual contract (sync callbacks return None, async callbacks return an
Awaitable that we await).
Addresses the github-code-quality 'Statement has no effect' warning surfaced
on PR #135 by making the awaitable check explicit and type-driven.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Discard awaitable result with explicit underscore in resiliency interceptor
CodeQL's `py/ineffectual-statement` heuristic re-flagged `await result` in
`AsyncClientResiliencyInterceptor._record_outcome` after the previous fix:
the rule treats expression statements whose value is discarded as unused, and
does not recognise that `await` is always a side-effecting suspension point
(the whole purpose of the call is to run the async recreate callback to
completion).
Rewriting the line as `_ = await result` keeps the exact same runtime
behaviour but documents the intent (return value intentionally discarded) and
satisfies the linter.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fire-and-forget channel recreate; hoist client state; tighten async interceptor exception handling
Addresses three follow-up review comments on the resiliency interceptor refactor:
[1/3] Channel recreate now runs fire-and-forget (daemon thread for sync,
asyncio.create_task for async). The original RPC error propagates to the
caller without being delayed by DNS, TLS handshake, or contention on
_recreate_lock. A client-side single-flight guard avoids spawning duplicate
work when many failures land in a burst; the existing cooldown still
prevents thrash. close() waits for any in-flight recreate to finish so the
teardown path stays deterministic. A _recreate_done_event (test seam) lets
tests synchronise on completion without polling.
[2/3] Hoisted _closing, _recreate_lock, _last_recreate_time,
_retired_channels / _retired_channel_close_tasks above ClientResiliencyInterceptor
construction in both __init__ methods so the bound recreate callback is safe
to invoke at any time during construction.
[3/3] AsyncClientResiliencyInterceptor now uses 'except Exception' (so
asyncio.CancelledError, KeyboardInterrupt and SystemExit propagate
unchanged) and mirrors the sync interceptor's policy by resetting the
failure counter on non-AioRpcError exceptions. _record_outcome is now
synchronous on both interceptors because the on_recreate callback no
longer awaits the recreate.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Modernize PR-new code for Python 3.10+ baseline
Apply three Python 3.10+ idioms to code introduced by this PR:
- PEP 604 union syntax: replace Optional[X] with X | None in grpc_options.py and grpc_resiliency.py.
- Dataclass tightening: add slots=True, kw_only=True to FailureTracker, GrpcRetryPolicyOptions, GrpcChannelOptions, GrpcWorkerResiliencyOptions, and GrpcClientResiliencyOptions. Update the two positional FailureTracker(...) call sites in client.py to use threshold=... kwargs.
- PEP 617 parenthesized context managers: rewrite the 10 chained 'with patch(...), patch(...):' blocks added by this PR in test_client.py. Pre-existing chained sites are left untouched to keep the diff surgical.
Internal-only change (no public API or behavior impact). 85/85 resiliency-focused tests pass; 232/8 passed/skipped in the broader non-e2e suite.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Bernd Verst <beverst@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: andystaples <77818326+andystaples@users.noreply.github.com>1 parent 75e916b commit 5758091
15 files changed
Lines changed: 2844 additions & 99 deletions
File tree
- durabletask-azuremanaged
- durabletask/azuremanaged
- durabletask
- internal
- tests
- durabletask-azuremanaged
- durabletask
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
131 | 131 | | |
132 | 132 | | |
133 | 133 | | |
| 134 | + | |
| 135 | + | |
134 | 136 | | |
135 | | - | |
| 137 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
24 | | - | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
25 | 49 | | |
26 | 50 | | |
27 | 51 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
17 | 21 | | |
18 | 22 | | |
19 | 23 | | |
| |||
Lines changed: 10 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
19 | 22 | | |
20 | 23 | | |
21 | 24 | | |
| |||
30 | 33 | | |
31 | 34 | | |
32 | 35 | | |
| 36 | + | |
33 | 37 | | |
34 | 38 | | |
35 | 39 | | |
| |||
54 | 58 | | |
55 | 59 | | |
56 | 60 | | |
| 61 | + | |
57 | 62 | | |
58 | 63 | | |
59 | 64 | | |
| |||
74 | 79 | | |
75 | 80 | | |
76 | 81 | | |
| 82 | + | |
| 83 | + | |
77 | 84 | | |
78 | 85 | | |
79 | 86 | | |
| |||
104 | 111 | | |
105 | 112 | | |
106 | 113 | | |
| 114 | + | |
107 | 115 | | |
108 | 116 | | |
109 | 117 | | |
| |||
128 | 136 | | |
129 | 137 | | |
130 | 138 | | |
| 139 | + | |
131 | 140 | | |
132 | 141 | | |
Lines changed: 8 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | | - | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
17 | 20 | | |
18 | 21 | | |
19 | 22 | | |
| |||
34 | 37 | | |
35 | 38 | | |
36 | 39 | | |
| 40 | + | |
| 41 | + | |
37 | 42 | | |
38 | 43 | | |
39 | 44 | | |
| |||
74 | 79 | | |
75 | 80 | | |
76 | 81 | | |
| 82 | + | |
77 | 83 | | |
78 | 84 | | |
79 | 85 | | |
| |||
101 | 107 | | |
102 | 108 | | |
103 | 109 | | |
| 110 | + | |
104 | 111 | | |
105 | 112 | | |
106 | 113 | | |
| |||
0 commit comments