From e5c22881377e478d4f140447b3810d848a421cd1 Mon Sep 17 00:00:00 2001 From: NotAProfDev <84450364+NotAProfDev@users.noreply.github.com> Date: Tue, 30 Jun 2026 18:16:05 +0000 Subject: [PATCH 1/4] docs(adr): WebSocket transport contract (ADR-0032) Fix the oath-adapter-net-ws-api contract that ADR-0029 deferred as "a deliberate later session", mirroring ADR-0030 for HTTP and grounded in IBKR's Client Portal WebSocket: - untyped duplex frame channel; subscription grammar + demux adapter-side - asymmetric shape: Stream recv / RPITIT one-shot send, split owned halves - minimal Frame enum; default stack delivers only data frames to the adapter - separate epoch-stamped lifecycle channel (feed-down is first-class) - recovery split: transport reconnects, adapter replays + differential reconcile - uniform no-silent-drop backpressure guarantee; per-stream policy adapter-side - WsConnector leaf over tokio-tungstenite + rustls - per-transport AuthSource trait, one shared IbkrAuthSource impl Resilience layer stack follows in ADR-0033. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...nsport-contract-duplex-frames-lifecycle.md | 258 ++++++++++++++++++ 1 file changed, 258 insertions(+) create mode 100644 docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md diff --git a/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md b/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md new file mode 100644 index 0000000..82cedd1 --- /dev/null +++ b/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md @@ -0,0 +1,258 @@ +# WebSocket transport contract: untyped duplex frame channel, asymmetric `Stream`/RPITIT split, out-of-band lifecycle + +[ADR-0029](0029-network-adapter-stack-transport-split-compile-time-composition.md) +split the net layer by transport, placed `Service` and the request/reply contracts in +`oath-adapter-net-http-api`, and deferred the WebSocket transport to "a deliberate later +session" — guaranteeing only that the kernel was ready for it (`Layer` machinery + +`ErrorKind` + `Timer` all apply unchanged). This ADR is that session: it fixes the +**`oath-adapter-net-ws-api` contract** — the streaming connection-shape the kernel's +`Service` could not model, what it carries, the leaf seam a backend implements, and the +backend — driven by the first [Broker](../../CONTEXT.md)/[Data Provider](../../CONTEXT.md), +IBKR's Client Portal Web API WebSocket. It mirrors [ADR-0030](0030-http-transport-contract-wire-bytes-streaming-composition.md) +for HTTP. The resilience layers that wrap this contract (reconnect, heartbeat, the +bounded-buffer mechanism §6 names) are specified in **ADR-0033**. + +## The grounding case — IBKR Client Portal WebSocket + +The WS endpoint (`wss://api.ibkr.com/v1/api/ws`, or `localhost:5000` via the gateway) +has **no independent authentication**: it authorizes only by replaying the authenticated +gateway session — the `set-cookie` cookies (gateway mode, e.g. `x-sess-uuid`) or the +`session` value from the REST `/tickle` keepalive. The session idles out after ~5 min +unless `/tickle` is called (~every 60s). So the WS rides **on top of** the REST session, +which is why HTTP (ADR-0030) was built first. Market data is **conflated server-side to +~500ms per instrument** over ~100 subscription lines; `smd` subscriptions **self-terminate +after 10 minutes**. These are reference data for the adapter, not domain terms. + +## Decision + +### 1. Untyped duplex **frame** channel; subscription grammar and demux stay in the adapter + +`net-ws-api` is pure frame transport: a connection over which the adapter **sends** frames +(subscription commands) and **receives** a stream of frames. The transport knows nothing +of subscriptions, topics, or `conid`s. IBKR's grammar (`smd+`/`smh+`/`umd-`), JSON parsing, +the `{"topic":"system","hb":…}` application heartbeat, and **demux** of the one multiplexed +frame stream into per-instrument canonical streams all live in `oath-adapter-ibkr`. This is +the [ADR-0030](0030-http-transport-contract-wire-bytes-streaming-composition.md) §1 / ADR-0003 +anti-corruption boundary carried into WS: a venue's wire grammar must not leak into a shared +crate, and `net-ws-api` stays reusable for a future non-IBKR streaming feed. + +A consequence used throughout: the transport is **grammar-blind**. It cannot distinguish a +market-data frame from an order/execution frame — that classification *is* the demux the +adapter owns — and §5/§6 both turn on this. + +### 2. Asymmetric shape: `Stream` recv, RPITIT send, split owned halves + +`Service` models request→one-reply and cannot model "subscribe → many frames over time" +(ADR-0029 §2). The contract is therefore **asymmetric by operation shape** — the same move +ADR-0030 §2 made (buffered request body / streaming response body): + +```rust +fn connect(&self, handshake: http::Request<()>) + -> impl Future> + Send; + +// recv half — receiving is stream-shaped: +WsSource: impl Stream> + Send + +// send half — sending one frame is request-shaped (one-shot), NOT a Sink: +trait WsSink { fn send(&mut self, f: Frame) -> impl Future> + Send; + fn close(&mut self) -> impl Future> + Send; } +``` + +- **recv is `futures_core::Stream`**, not a hand-rolled pull iterator. This is the + `http_body::Body` precedent (ADR-0030 §3 took a `poll_frame`/`Pin` streaming trait plus + `pin-project-lite` and rejected hand-writing the state machine). `impl Stream + Send` is + monomorphised, zero-box, not object-safe — it honours ADR-0029 §5 (no `dyn`, no + `async-trait`, no per-call alloc) exactly as `impl Future` does; §5 forbids boxing, not + poll-shaped traits. It gives the ADR-0033 reconnect/heartbeat layers `StreamExt`/`unfold` + (async-closure wrapping, idle `timeout`) instead of manual `poll_next`. +- **send is RPITIT one-shot**, mirroring `HttpClient::send` — deliberately **not** `Sink`, + whose `poll_ready`/`start_send`/`poll_flush` *is* the poll-handshake the `Service` design + walked away from. Subscribe/heartbeat traffic is low-volume; no `Sink` backpressure + handshake is warranted. +- **Split owned halves.** `connect` yields a send half and a recv half that move to separate + tasks (IBKR needs concurrent send of subscribe/heartbeat and receive of frames). They are + single-owner — recv is exclusive `&mut self` (inherent in `Stream::poll_next`'s + `Pin<&mut Self>`), not the shared `&self` of `Service`. This is the identity-wrap of + tungstenite's `WebSocketStream::split()` (the ADR-0030 §7 "leaf is nearly an identity + wrap" criterion, applied to WS). + +### 3. `Frame` is a minimal enum; the default stack hands the adapter only data frames + +```rust +enum Frame { Text(Bytes), Binary(Bytes), Ping(Bytes), Pong(Bytes), Close(Option) } +``` + +"Untyped" (§1) means *no venue/JSON typing* — not flattening WebSocket's own protocol frame +kinds, which are transport concerns (RFC 6455), not venue concerns. The enum is the +**leaf/inter-layer** vocabulary. After the ADR-0033 default stack, the **adapter-facing +`WsSource` delivers only `Text`/`Binary` data frames**: the heartbeat layer absorbs protocol +`Ping`/`Pong` (auto-Pong) and `Close` becomes a lifecycle transition (§4). IBKR's +*application* heartbeat is a `Text` frame, so it reaches the adapter — that is venue liveness, +handled adapter-side. Crucially, **control frames bypass the §6 data buffer** and are answered +regardless of consumer drain speed, so a slow data-consumer never starves the Pong that keeps +IBKR from dropping us. + +### 4. Lifecycle is a separate, epoch-stamped channel — not a widened frame item + +Connection health is a **third handle**, not an item interleaved into the data stream: + +```rust +Lifecycle: impl Stream // (a watch-style last-value is an ADR-0033 sub-choice) +enum ConnState { Connected { epoch: u64 }, Stale, Reconnecting, Resumed { epoch: u64 }, Lagged { count: u64 } } +``` + +- **The data stream stays `Result` (§2), uncontaminated by control variants.** +- **The feed-*down* edge is first-class.** For a trading system the safety-critical event is + the feed going *stale*, not its recovery: a stale order/exec stream means we may be blind on + fills and must stop issuing and let risk react. `Stale`/`Reconnecting` are therefore signals + in their own right, feeding [ADR-0004](0004-risk-as-continuous-control-loop.md) (risk control + loop) and [ADR-0022](0022-reliable-order-path-graduated-failure.md) (graduated failure) — not + just `Resumed`. +- **It is the shared signal plane for the ADR-0033 layers** — the reconnect layer emits + `Resumed`, the heartbeat layer emits `Stale`, the buffer layer emits `Lagged` (§6). This is + the WS analogue of what `ErrorKind`/Telemetry is to the HTTP stack. +- **Epoch-stamped.** `Resumed{epoch}` bounds the adapter's reconcile window ("reconcile since + epoch N") and disambiguates an in-flight `umd-` queued against the dead connection vs. the + new one. Ordering is free: a fresh session is silent until resubscribe, so `Resumed` strictly + precedes any post-reconnect frame and `Stale` strictly follows the last pre-drop frame — the + correlation an in-band design would buy is guaranteed by the protocol regardless of channel. + +### 5. Recovery is split: transport re-establishes the connection, adapter replays subscriptions + +Because the transport is grammar-blind (§1), it **cannot** replay subscriptions — it does not +understand `smd+`, and a blind replay-log of sent frames would resurrect `umd-`'d +subscriptions. So: + +- The **transport** (ADR-0033 reconnect layer) rebuilds TCP + the WS handshake, **re-injects + auth** (§8), bumps the epoch, and emits `Resumed{epoch}`. +- The **adapter** owns subscription replay and the **differential** recovery only it can make, + because only it knows which stream a frame belongs to: + - **Market-data** streams resubscribe and accept the gap — a resubscribe returns the current + book (a fresh `LatestValue`, [ADR-0020](0020-bus-trait-delivery-classes-access-patterns.md)), + so the gap self-heals (cf. [ADR-0002](0002-backend-agnostic-bus-canonical-message-model.md): + "acting on stale-but-delivered messages is a separate, consumer-side freshness concern"). + - **Order/exec** streams cannot merely resume — fills may have occurred during the gap — so + the adapter runs a REST **reconciliation** pass + ([ADR-0006](0006-broker-reconciliation-contract.md)). + + The same `Resumed` (or `Lagged`, §6) signal thus drives two different adapter responses. This + is where the WS and REST transports meet in `oath-adapter-ibkr`, exactly as ADR-0029 foresaw. + IBKR's 10-minute `smd` self-termination gives resubscription a **second trigger** beyond + reconnect: a periodic refresh timer the adapter owns. + +### 6. Backpressure: a uniform, no-silent-drop guarantee at the transport; per-stream policy in the adapter + +"What happens when a `WsSource` is not drained" is a property the adapter codes against, so it +is **contract**, not a resilience detail. A WS subscription is push-based — IBKR sets the rate, +we set the consume rate — so frames can queue. Two facts fix the guarantee: + +- **Grammar-blindness (§1)** means the transport *cannot* apply a per-stream policy — it can't + tell an MD frame from an order frame. So the transport guarantee is necessarily **uniform**. +- A grammar-blind transport that silently dropped could discard an **execution report** — the + WS analogue of the duplicate-order incident `Retry` was designed around. So **silent drop is + out**. + +The guarantee: **the transport never silently discards; on overflow it drops oldest data frames +and emits `Lagged{count}` (§4).** The **per-stream drop/keep policy lives adapter-side, after +demux** (MD → `LatestValue` drop-to-latest, ADR-0020; orders → reliable handling + +reconcile-on-`Lagged`). Control frames bypass the buffer (§3). Two distinct "latest" mechanisms +at two layers — coarse, grammar-blind drop-oldest at the transport (hence the signal); semantic +per-`conid` overwrite downstream. + +Rationale to record: with IBKR specifically, overflow from *broker* volume is unlikely +(conflation @500ms × ~100 lines ≈ ~200 small msgs/sec, trivially drained). `Lagged` exists for +**consumer-side stall correctness** — if *our* demux stalls (a blocked downstream, a scheduler +hiccup), we must not silently lose an order frame — not as throughput defence. (TCP-backpressure +— refusing to read the socket — is rejected: it stalls *all* subscriptions and the Pong, and is +wrong for a market feed that wants freshest-wins.) The buffer *mechanism* is ADR-0033; the +*guarantee* is here. + +### 7. Leaf seam and backend: `WsConnector` over tokio-tungstenite + rustls + +The named dependency-inversion seam the adapter codes against — the `HttpClient` analogue — is +the `connect` of §2, exposed as a `WsConnector` trait. The WS upgrade *is* an HTTP GET, so the +handshake is an `http::Request` (reusing the `http` crate, consistent with `net-http-api`). The +first leaf backend is **`oath-adapter-net-ws-tungstenite`** (tokio-tungstenite over rustls) — the +analogue of `net-http-hyper` — whose `WebSocketStream::split()` is the near-identity source of +`WsSink`/`WsSource`. Per ADR-0029 §5 it is a **compile-time `impl WsConnector` seam**, not `dyn`. + +### 8. Auth: a per-transport `AuthSource` trait, one shared impl, re-pulled per (re)connect + +`AuthSource` is the seam that lets venue- and scheme-neutral net layers apply *current* +credentials to an outgoing request/handshake without learning the scheme — gateway mode stamps a +`Cookie`; OAuth 1.0a stamps a signed `Authorization` header; swapping them is swapping the +`impl`. It operates on `http::request::Parts` (method + uri + headers — body-agnostic, so the +same shape serves HTTP's `Request` and the WS `Request<()>`). + +`AuthSource` is **not** hoisted to the kernel: it touches `http` (forbidden in the std-only +kernel, ADR-0029 §3), and it is shared by HTTP + WS but **not universal** (a future FIX/multicast +transport authenticates differently) — the same category as `Service`, which ADR-0029 kept out +of the kernel. So the small trait is **declared per-transport** (in `net-http-api` and +`net-ws-api`), and IBKR's single `IbkrAuthSource` (one gateway session, one `/tickle` loop) +implements both — no extra crate. The WS reconnect layer calls it **per (re)connect** (never +caching at first connect), so a session refreshed by `/tickle` between drop and reconnect is +picked up — the streaming analogue of ADR-0031 §1's per-attempt re-stamp. + +## Considered options + +- *Reuse `Service` for WS* — rejected by ADR-0029 §2; request→one-reply cannot model a + subscription's many frames. +- *Typed per-subscription stream API (transport owns demux)* — rejected: demux-by-topic is venue + grammar; owning it breaches the §1/ADR-0003 boundary and couples `net-ws-api` to IBKR. +- *Hand-rolled RPITIT pull `recv()` for uniformity with `Service`* — rejected: it reads as "stay + free of streaming-contract deps," but `net-http-api` is *not* free of them (it took + `http-body` + `pin-project-lite`); the blessed pattern is the ecosystem streaming trait, and + `Stream` honours ADR-0029 §5 just as `Body` does. (A `recv()` would also make the leaf + hand-crank an iterator over tungstenite's `Stream`, forfeiting the identity wrap.) +- *`Sink` for the send half* — rejected: its `poll_ready`/`start_send`/`poll_flush` is the + poll-handshake the contract avoids; one-shot `send` suffices for low-volume control traffic. +- *Opaque `Bytes` frames* — rejected: erases the `Text`/`Binary` distinction and forces + protocol `Ping`/`Pong`/`Close` up into the adapter. +- *Widen the recv item to `Frame | Resumed | …` for lifecycle* — rejected: smears the control + plane into the data plane and re-widens the §2 stream; and a `Resumed`-only widening would + emit the "all clear" while staying silent on the safety-critical feed-down edge. +- *Transport replays subscriptions on reconnect* — rejected: requires grammar (which `umd-` + cancels which `smd+`), the §1 leak; replay is the adapter's because only it holds the + subscription set and the order-vs-MD recovery distinction. +- *Silent drop-to-latest at the transport, or terminal `WsError` on overflow* — rejected: + silent drop can lose an execution report (grammar-blind); a terminal error tears down *every* + subscription on a transient consumer hiccup (reconnect storm). Drop-oldest + `Lagged` keeps the + connection and signals the one thing the adapter needs. +- *`AuthSource` in the kernel, or in a shared `net-auth-api` crate* — rejected: kernel is + std-only and `AuthSource` needs `http` and is non-universal; a dedicated crate is unwanted + ceremony for a one-method trait. Per-transport declaration with one shared impl gives "one + session feeds both" without either. + +## Consequences + +- **New crates:** `oath-adapter-net-ws-api` (contract; deps `http`, `bytes`, `futures-core`, + `futures-util`, `pin-project-lite`, `thiserror`, `tracing` — the mirror of `net-http-api`'s + set; still zero-I/O, zero-runtime) and the leaf `oath-adapter-net-ws-tungstenite` (the only + `tokio`/`tokio-tungstenite`/`rustls` dependency, owning the `tungstenite::Error → WsError` and + `Message → Frame` mappings). `WsError` implements `HasErrorKind` once (close codes / connection + failures → `ErrorKind`). +- **`AuthSource` is declared in both `net-http-api` and `net-ws-api`** (identical one-method + trait); the README graph already lists `oath-adapter-net-api` as "HTTP/WS composition + primitives" and is updated to show the per-transport crates when they are built (deferred, as + for ADR-0029–0031). +- **The adapter (`oath-adapter-ibkr`) owns:** the subscription grammar, JSON/frame parsing, + demux, subscription replay + the periodic 10-min `smd` refresh, the differential + `Resumed`/`Lagged` recovery (MD resubscribe vs. order REST-reconcile, ADR-0006), the per-stream + backpressure policy, and the single `IbkrAuthSource`. +- **The lifecycle channel becomes a first-class input to risk/order control** (ADR-0004 / + ADR-0022), not merely an internal reconnect detail. +- **Recv-side backpressure is settled here** (§6), so ADR-0033 implements the buffer mechanism + with no contract change. + +## Relationships + +Fills the WebSocket contract **ADR-0029** deferred, on its kernel (`Layer`/`ServiceBuilder`, +`ErrorKind`, `Timer` unchanged). Mirrors **ADR-0030** (the HTTP sibling) and reuses its +`AuthSource` seam (**ADR-0031** §1 per-attempt re-stamp). Rests on **ADR-0007** (in-process ⇒ +compile-time `impl` seam, no `dyn`) and **ADR-0003** (anti-corruption: grammar/typing in the +adapter). Recovery defers to **ADR-0006** (reconciliation) and reads delivery semantics from +**ADR-0020** / **ADR-0002**; the lifecycle channel feeds **ADR-0004** / **ADR-0022**. Is the +base for **ADR-0033** (the WS resilience stack: reconnect, heartbeat, the §6 buffer layer, and +the default layer order). Glossary unchanged — `Frame`, `WsSource`, `Lifecycle`, `WsConnector`, +`AuthSource` are implementation vocabulary, and [CONTEXT.md](../../CONTEXT.md) is domain-only; +IBKR WS values are reference data for the adapter. From 6aec463cfbd1616d37c7082ee77065cdfa235282 Mon Sep 17 00:00:00 2001 From: NotAProfDev <84450364+NotAProfDev@users.noreply.github.com> Date: Wed, 1 Jul 2026 18:13:25 +0000 Subject: [PATCH 2/4] docs(adr): WebSocket resilience stack (ADR-0033) Add ADR-0033, the WS analogue of ADR-0031 (HTTP resilience), completing the pair ADR-0032 deferred. Grounded in IBKR's Client Portal WS and cross-checked against Binance and Coinbase so the generic transport carries no IBKR-shaped assumption. Decisions: two-seam composition (uniform WsConnector inside / richer ReconnectingConnection out, the tower Layer->Service split); reconnect as a spawned actor over a new runtime-neutral Spawn seam (mirrors Timer); a two-axis layer stack (connect-time chain + per-frame recv/send pipelines) with the 0031 ordering invariants; transport-liveness vs adapter session-keepalive split (mandatory auto-Pong, passive idle + active keepalive-when-idle probe); lifecycle as a watch of an epoch-stamped LifecycleSnapshot (lossless for the feed-down edge, never blocks the actor); dual count+byte drop-oldest buffer; a circuit breaker that retries transient loss forever but surfaces permanent failure as Unrecoverable; a send-axis rate limit; and force_reconnect on a control handle. Amends ADR-0032 in place (same PR, unmerged): smd 10->15min silent per-topic expiry; Unrecoverable added to ConnState; watch/LifecycleSnapshot delivery form and cumulative total_lagged resolved. Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 7 + ...nsport-contract-duplex-frames-lifecycle.md | 29 +- ...ilience-reconnect-actor-watch-lifecycle.md | 351 ++++++++++++++++++ 3 files changed, 380 insertions(+), 7 deletions(-) create mode 100644 docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 87a38e5..c8657d2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -19,6 +19,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added +- WebSocket transport design: ADR-0032 (contract — untyped duplex frame channel, + asymmetric `Stream`/RPITIT split, epoch-stamped lifecycle, `WsConnector` leaf, + per-transport `AuthSource`) and ADR-0033 (resilience — reconnect actor over a + runtime-neutral `Spawn` seam, two-axis layer stack, `watch`-of-`LifecycleSnapshot`, + dual-bound drop-oldest buffer, send-side rate limit, and a circuit breaker that + retries transient loss forever but surfaces permanent failure as `Unrecoverable`). + Validated against IBKR, Binance, and Coinbase WebSocket semantics. - `oath-model` numeric primitives — the root contract's first real content: `Price` (signed fixed-point `i128`), `Quantity` (unsigned `u128` magnitude), `Side` (`Buy`/`Sell`), and `ArithmeticError`, with checked `const fn` add/sub that error diff --git a/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md b/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md index 82cedd1..6842d4f 100644 --- a/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md +++ b/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md @@ -21,7 +21,11 @@ gateway session — the `set-cookie` cookies (gateway mode, e.g. `x-sess-uuid`) unless `/tickle` is called (~every 60s). So the WS rides **on top of** the REST session, which is why HTTP (ADR-0030) was built first. Market data is **conflated server-side to ~500ms per instrument** over ~100 subscription lines; `smd` subscriptions **self-terminate -after 10 minutes**. These are reference data for the adapter, not domain terms. +after ~15 minutes** (raised from 10 by IBKR ~2026-04; the server does not auto-unsubscribe on +expiry, so the client `umd+`s then `smd+`s to refresh). Expiry is **silent and per-topic**: +inbound ticks for the affected `conid` simply stop — no close frame, no error — while the +connection, session, and system heartbeat all stay healthy. These are reference data for the +adapter, not domain terms. ## Decision @@ -97,10 +101,20 @@ IBKR from dropping us. Connection health is a **third handle**, not an item interleaved into the data stream: ```rust -Lifecycle: impl Stream // (a watch-style last-value is an ADR-0033 sub-choice) -enum ConnState { Connected { epoch: u64 }, Stale, Reconnecting, Resumed { epoch: u64 }, Lagged { count: u64 } } +Lifecycle: a last-value channel of ConnState // watch-style; delivery form resolved in ADR-0033 +enum ConnState { + Connected { epoch: u64 }, Stale, Reconnecting, Resumed { epoch: u64 }, + Lagged { count: u64 }, // buffer overflow (§6) + Unrecoverable, // a classified non-transient failure — will not self-heal (ADR-0033 §7) +} ``` +ADR-0033 resolves the delivery form (a `watch` of an epoch-stamped `LifecycleSnapshot`, not a +transition stream — its §5 explains why, and why `Lagged`'s count is carried as a +monotonic cumulative total under last-value semantics). `Unrecoverable` is emitted by the +resilience layer when it classifies a permanent failure rather than retrying it forever +(ADR-0033 §7); it is the one terminal state — every other variant is transient. + - **The data stream stays `Result` (§2), uncontaminated by control variants.** - **The feed-*down* edge is first-class.** For a trading system the safety-critical event is the feed going *stale*, not its recovery: a stale order/exec stream means we may be blind on @@ -137,7 +151,7 @@ subscriptions. So: The same `Resumed` (or `Lagged`, §6) signal thus drives two different adapter responses. This is where the WS and REST transports meet in `oath-adapter-ibkr`, exactly as ADR-0029 foresaw. - IBKR's 10-minute `smd` self-termination gives resubscription a **second trigger** beyond + IBKR's ~15-minute `smd` self-termination gives resubscription a **second trigger** beyond reconnect: a periodic refresh timer the adapter owns. ### 6. Backpressure: a uniform, no-silent-drop guarantee at the transport; per-stream policy in the adapter @@ -236,13 +250,14 @@ picked up — the streaming analogue of ADR-0031 §1's per-attempt re-stamp. primitives" and is updated to show the per-transport crates when they are built (deferred, as for ADR-0029–0031). - **The adapter (`oath-adapter-ibkr`) owns:** the subscription grammar, JSON/frame parsing, - demux, subscription replay + the periodic 10-min `smd` refresh, the differential + demux, subscription replay + the periodic ~15-min `smd` refresh, the differential `Resumed`/`Lagged` recovery (MD resubscribe vs. order REST-reconcile, ADR-0006), the per-stream backpressure policy, and the single `IbkrAuthSource`. - **The lifecycle channel becomes a first-class input to risk/order control** (ADR-0004 / ADR-0022), not merely an internal reconnect detail. -- **Recv-side backpressure is settled here** (§6), so ADR-0033 implements the buffer mechanism - with no contract change. +- **Recv-side backpressure is settled here** (§6): the *guarantee* (drop-oldest data + `Lagged`, + control bypasses) is unchanged by ADR-0033, which only refines how the count is carried under + the §4 last-value channel (cumulative `total_lagged`) and adds the dual count+byte bound. ## Relationships diff --git a/docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md b/docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md new file mode 100644 index 0000000..2db58fa --- /dev/null +++ b/docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md @@ -0,0 +1,351 @@ +# WebSocket resilience: reconnect actor over a `Spawn` seam, two-axis layer stack, watch-lifecycle, and a circuit breaker that inverts for transient loss but survives for permanent failure + +[ADR-0032](0032-websocket-transport-contract-duplex-frames-lifecycle.md) fixed the +**`oath-adapter-net-ws-api` contract** — the untyped duplex frame channel, the asymmetric +`Stream` recv / RPITIT send split, the epoch-stamped lifecycle channel, the uniform +no-silent-drop backpressure *guarantee*, the `WsConnector` leaf seam, and the per-transport +`AuthSource`. It deferred the **resilience stack that wraps that contract** — reconnect, +heartbeat, the §6 buffer *mechanism*, and the default layer order — to this ADR. This is the +WebSocket analogue of [ADR-0031](0031-http-resilience-venue-pacing.md) (the HTTP sibling), +driven by the first [Broker](../../CONTEXT.md)/[Data Provider](../../CONTEXT.md), IBKR's +Client Portal WebSocket, and — because this crate is one of the ADR-0029 series' *generic*, +reusable transports — **cross-checked against Binance and Coinbase**, whose keepalive and +loss models differ from IBKR's in ways that would otherwise leak an IBKR-shaped assumption +into a shared crate. + +Every timing layer is generic over +[`net-api::Timer`](0029-network-adapter-stack-transport-split-compile-time-composition.md); +unlike the HTTP stack, the WS reconnect supervisor is a long-lived task, so this ADR adds a +second runtime-neutral seam, `Spawn`, alongside `Timer`. + +## The grounding cases + +The transport is **grammar-blind** (ADR-0032 §1): it cannot tell a market-data frame from an +order frame, nor a subscription from a keepalive. That single fact forces the split that +recurs throughout this ADR — **transport liveness** (is the socket alive?) is generic and +lives in the layers; **session keepalive** and **loss recovery** (is the *venue* about to +idle-drop us? which *stream* lost data?) are venue grammar and live in the adapter. + +| Concern | IBKR CP | Binance spot | Coinbase Advanced Trade | +|---|---|---|---| +| Server keepalive | app `{"topic":"system","hb":…}` Text; **not guaranteed on an idle/unsubscribed socket** | **server sends a protocol `Ping` every 20s**; no `Pong` within 60s → dropped | app `heartbeats` channel (1/s); channels close in 60–90s idle | +| Client keepalive | `tic` Text ~q10s + REST `/tickle` (5-min session) | auto-`Pong`; user streams keep `listenKey` alive via REST | subscribe `heartbeats`; **JWT expires 2 min** → re-auth | +| Inbound send limit | low volume | **~5 msgs/s** (ping/pong/subscribe all count) → disconnect; *"IPs repeatedly disconnected may be banned"*; 300 conns / 5 min / IP | disconnected if no subscribe within 5s | +| Silent partial loss | `smd` self-terminates ~15 min: **ticks for a `conid` stop, socket/session/`hb` all healthy** | listenKey expiry stops the user stream, socket alive | channel idle-close; `sequence`/`heartbeat_counter` gaps | +| Large frames | small JSON | depth diffs small | **level2 BTC-USD snapshot overflows a 100 KB buffer** (clients raise to ~100 MiB) | + +These are reference data for the adapters, not domain terms. Two facts do real work below: +Binance proves **auto-`Pong` is mandatory and inbound sends must be paced**; Coinbase proves a +**generic transport can receive multi-MB frames**, so a frame-count-only buffer bound is an +IBKR-shaped assumption that OOMs on venue #2. + +## Decision + +### 1. Composition: a uniform `WsConnector` *inside*, a richer `ReconnectingConnection` *out* + +The kernel's `Layer`/`ServiceBuilder` (ADR-0029 §3, deliberately `Service`-bound-free) is +reused to *assemble* the stack — but that reuse is **assembly ergonomics, not the abstraction +doing the resilience work.** Industry hand-assembles the frame half of a resilient socket; it +does not model reconnect/heartbeat/buffer as a generic middleware stack. What is load-bearing +is the seam split, and there are **two** seams, which ADR-0032's single `WsConnector` conflated: + +- **Composition seam** — `WsConnector::connect(handshake) -> (WsSink, WsSource, Lifecycle)`, + the ADR-0032 §2 triple. Internal; the leaf and every inner layer implement it. This is how + layers stack; it never reaches an adapter. +- **Usage seam** — a new type `ReconnectingConnection { sink, source, lifecycle, control }`, + produced only at the assembly boundary and handed to the adapter exactly once. The control + handle (`WsControl`: `force_reconnect`, `shutdown`) exists **only here**. + +```rust +trait WsConnector { // composition — ADR-0032 §2/§4 unchanged + fn connect(&self, h: http::Request<()>) + -> impl Future> + Send; +} + +struct ReconnectingConnection { // usage — new here; what build() yields + sink: WsSink, // {send, close} — minimal, as landed + source: WsSource, // Stream> + lifecycle: Lifecycle, // last-value watch of LifecycleSnapshot (§5) + control: WsControl, // force_reconnect(), shutdown() +} +``` + +The reconnect layer is the exact **`Layer → Service` analogue** from tower: you compose +`Layer`s but *hold* a `Service` (`ServiceBuilder` yields a `Buffer>` used as a +`Service`, never as a `Layer`). Composition unit ≠ product type. So the leaf never grows a +control handle it cannot honour — `force_reconnect` on a raw socket would be a silent no-op, +the class of dishonest seam this crate's charter forbids. An adapter that genuinely wants a +raw connection opts in with an explicit `PassthroughReconnect` layer supplying a trivial +`WsControl` in **one** quarantined place, not a triviality forced onto every leaf. This is the +managed-handle pattern industry ships (gRPC's `ManagedChannel` carries `resetConnectBackoff()` +/ `shutdown()`; the raw transport carries none; likewise `ezsockets::Client` vs. a raw +`WebSocketStream`). + +### 2. Reconnect is a spawned actor over a runtime-neutral `Spawn` seam + +The reconnect supervisor owns the single socket, drains a **channel-backed** `WsSink`, forwards +to the live connection, and on a break rebuilds it, re-injects auth, bumps the epoch, and emits +`Resumed{epoch}`. It is a long-lived task that outlives any single send or poll and coordinates +the two independently-owned halves (ADR-0032 §2) — so, unlike every HTTP layer (which is purely +caller-driven inside `call`), it must **spawn**. + +`net-ws-api` is zero-runtime (ADR-0032 Consequences). The reconciliation is the same one +ADR-0029 §4 made for `Timer` and the net-http construction surface made for `async-lock`: a +**`Spawn` trait is an abstraction, not a runtime.** So this ADR declares a minimal `Spawn` +seam in `net-ws-api`; the backend (`net-ws-tungstenite`) provides the tokio impl. The actor +lives in `net-ws-api`, keeping the whole resilience stack — as in ADR-0031 — in the contract +crate, mock-testable and backend-reusable. + +- **Backoff** is capped exponential (§7), and also honours a + **connection-attempt rate** (Binance: 300 conns / 5 min / IP) so reconnect itself cannot + storm into a ban. +- **Auth is re-pulled per (re)connect** (ADR-0032 §8), never cached at first connect, so a + session refreshed by `/tickle` between drop and reconnect is picked up — the streaming + analogue of ADR-0031 §1's per-attempt re-stamp. + +### 3. The default stack: two axes, not one line + +ADR-0031's single line (`Tracing → CircuitBreaker → Retry → RateLimit → Timeout → +BufferOrStream → Auth → leaf`) does not transliterate, because WS concerns split across two +axes rather than one per-request `call`. (First `.layer()` is outermost — ADR-0029's +`ServiceBuilder` invariant.) + +```text +Connect-time (per (re)connect): Tracing → Reconnect(backoff, epoch) → ConnectTimeout → Auth → leaf.connect +Recv per-frame (socket→adapter): socket → Heartbeat(auto-Pong, absorb control, idle→Stale) → Buffer(drop-oldest, Lagged) → WsSource +Send per-frame (adapter→socket): WsSink{send, close} → SendRateLimit(token bucket) → socket +Control plane: WsControl.force_reconnect() → actor +``` + +The ordering **invariants** (the reason assembly lives once, over an arbitrary leaf, so a +`stack(MockWsConnector, …)` can regression-test them — mirroring ADR-0031's rationale): + +- **`Auth` innermost at connect-time, re-stamped per (re)connect** — the WS analogue of + ADR-0031's *Auth-inside-Retry*: **`Reconnect` is the `Retry`-analogue**, retrying the + *connection*, with `Auth` re-stamping inside each attempt. +- **`ConnectTimeout` inside `Reconnect`** — each attempt gets a fresh timeout; a hung handshake + cannot wedge the backoff loop (ADR-0031's *Retry-outside-Timeout*). +- **`Heartbeat` socket-side of `Buffer`** — control frames are handled *before* the data ring + (ADR-0032 §3/§6), so auto-`Pong` is never queued behind a slow data consumer. +- **`Tracing` outermost** — one span over all reconnects (ADR-0031 §6), a Telemetry source + (ADR-0014), secret-safe (auth material is injected below it). + +### 4. Heartbeat/liveness: transport liveness in the layer, session keepalive in the adapter + +The layer is grammar-blind, so it can only ever send a **protocol `Ping`** — which probes the +*socket* but does not satisfy any venue's *session* keepalive (`tic`, `heartbeats`-subscribe, +`listenKey` PUT are all venue grammar). The split is therefore hard: + +- **Layer (generic) owns transport liveness:** auto-`Pong` on protocol `Ping` — **mandatory** + (Binance drops a connection that misses it); swallow `Pong`; map `Close` to a lifecycle + transition; a **passive idle-read timeout** (`Timer`-driven) → `Stale`; and an **active + protocol-`Ping` probe when idle** (*keepalive-when-idle*, not flat-off — IBKR gives no + guaranteed heartbeat on an idle/unsubscribed socket, so a purely passive detector could + starve on a fresh idle connect). The active probe is a config knob (interval + idle + threshold). +- **Adapter owns session keepalive** (venue grammar): IBKR `tic`; Coinbase `heartbeats` + + per-message JWT; Binance `listenKey` — the ADR-0003 boundary carried into liveness. + +### 5. Lifecycle: a `watch` of an epoch-stamped snapshot, not a transition stream + +ADR-0032 §4 deferred the delivery form. It is resolved here as a **last-value `watch` of a rich +snapshot**, *not* a transition stream and *not* a naive watch of bare `ConnState`: + +```rust +struct LifecycleSnapshot { + phase: ConnState, // level: Connected/Stale/Reconnecting/Resumed/Unrecoverable + epoch: u64, // monotonic: bumped on every completed down-cycle + down_since: Option, + attempts: u64, // monotonic + total_lagged: u64, // monotonic cumulative — NOT a per-event delta (see §6) +} +``` + +- A **transition stream is rejected**: its emitter is the socket-owning actor (§2); a bounded + stream with a blocking sender couples the actor's liveness to a slow risk consumer — the + actor stalls, stops answering `Ping`, and *causes* the disconnect it is trying to report — + worst exactly under the stress that generates down-edges. Unbounded trades that for unbounded + memory; drop-on-full reintroduces the lost edge. +- A **naive watch of bare `ConnState` is also rejected**: a fast `Stale → Reconnecting → + Resumed` coalesces to `Resumed`, hiding the safety-critical feed-*down* edge (ADR-0032 §4) + from a slow risk loop. +- The **epoch resolves both**. The risk loop `select!`s `changed()`, and on wake `borrow()`s + and diffs `epoch`. Losslessness is by construction: **{ currently-down (`phase`) ∨ + epoch-advanced }** is total — either the consumer reads an in-progress down phase, or a + fully-coalesced cycle is recovered from the epoch delta ("epoch jumped 5→9 ⇒ four down-cycles, + regardless of what I witnessed"). Only transient edge *ordering/timing* is lost, which is + telemetry, not risk logic; even a cancel-all-on-down action keys on epoch-advance, so it is + idempotent and lossless. The `watch` sender **never blocks** (overwrite semantics), so the + actor is never backpressured. With the epoch in the level, the watch is the **safe and cheap** + choice and the transition stream is the one whose safety degrades under load — the inverse of + the naive framing. (Prior art: `epoll` level-triggered mode, TCP sequence numbers, sticky + hardware fault registers — level + monotonic version deliver a must-not-miss fact to a slow + consumer without blocking the producer.) +- **Discipline:** every snapshot field must be **level or monotonic-cumulative, never a + per-event delta**, or overwrite loses it — which is exactly why `Lagged`'s per-event `count` + becomes the cumulative `total_lagged` the consumer diffs. +- The `watch` primitive is **runtime-neutral** (`async-watch`, extracted from + `tokio::sync::watch`, `event-listener` family) — *not* `tokio::sync::watch`, keeping tokio out + of `net-ws-api`'s graph exactly as the net-http surface chose `async-lock` over `tokio::sync`. +- An **explicitly lossy edge feed** off the actor serves audit/telemetry consumers that want the + ordered transition trail (the ADR-0014 Telemetry plane), kept *out* of the safety channel so + the safety channel carries no never-drop obligation for the audit log's sake. + +### 6. Backpressure: the §6 buffer mechanism — a dual-bound drop-oldest ring + +ADR-0032 §6 fixed the *guarantee* (never silently discard; drop **oldest data** on overflow; +emit `Lagged`; control bypasses; per-stream policy adapter-side). The mechanism: + +- **Control-bypass is structural.** The actor's read loop handles `Ping`/`Pong`/`Close`/liveness + inline and pushes only `Text`/`Binary` into the ring; it **always drains the socket** (§6 + rejects TCP-backpressure) and absorbs pressure on the data side by dropping, never by refusing + to read. Source wakeups use `event-listener`; an `mpsc` cannot drop-oldest. +- **Dual bound `min(count, bytes)`.** A frame-count-only bound bakes in IBKR's small-JSON + assumption; a generic transport receives multi-MB frames (Coinbase level2 snapshot), so + `N × frame_size` OOMs on venue #2. Byte-accounting is one `usize` (`frame.len()` is already in + hand), so the ring caps small-frame floods by count *and* memory by bytes, whichever trips + first; the byte default is generous (a few MB, per-venue tunable) so IBKR never touches it. A + single frame exceeding the byte cap is **kept** (older dropped, lag incremented) — never + discard the newest (§6). This is the standard slow-consumer shape (Redis + `client-output-buffer-limit`, Kafka `buffer.memory`, Netty `WriteBufferWaterMark`). +- **`Lagged` is a blunt, grammar-blind instrument — recorded as a consequence, not a gap.** A + single global `total_lagged` cannot attribute drops to a stream (per-stream rings would need + demux = venue grammar in the transport, the forbidden leak; and the dropped frames are gone). + So any increment forces the adapter to the **conservative union**: if any order stream is live, + reconcile orders *and* resnapshot MD. Where a venue carries sequence numbers (Coinbase + `heartbeat_counter`, Binance depth `U`/`u`), the *adapter* gets precise per-stream loss from + its own sequence tracking after demux; `total_lagged` is only a coarse "something dropped" hint. + ADR-0032 §6's "reconcile-on-`Lagged` (order) / drop-to-latest (MD)" must not be read as the + transport distinguishing the two. + +### 7. The circuit breaker inverts for transient loss and survives for permanent failure + +ADR-0031's `CircuitBreaker` *stops* calling a failing venue. For a market/order feed, going dark +is the emergency (ADR-0004 risk is blindest when the feed is down), so the breaker **inverts** +for transient loss — but only for transient loss: + +- **Transient / unknown** (`ErrorKind::Connection`/`Timeout`) → **retry forever, capped backoff.** + The "break" relocates to a **risk-layer trading halt** ([ADR-0022](0022-reliable-order-path-graduated-failure.md), + fed by the watch's `down_since`/`attempts`), not a transport give-up — the market-data / + OTP-supervisor standard: *the transport never stops trying to see; Core decides when to stop + acting.* +- **Permanent** (`ErrorKind::Auth`, protocol-version rejection) → a few retries (a session expiry + may re-auth away), then a **terminal `Unrecoverable` phase** — stop. Retrying a permanent + failure forever cannot succeed, and **worsens** the outage: *"IPs repeatedly disconnected may be + banned"* is ADR-0031's original "stop hammering a failing dependency" reappearing, and a ban + takes down healthy connections too. `Unrecoverable` also disambiguates two operational states a + climbing counter conflates — "a human must rotate the key" vs. "the network is flaky." + +Classification is by `ErrorKind` (grammar-free; `WsError: HasErrorKind`, ADR-0032 Consequences), +adapter-refinable via a hook (venue grammar: which close-code is permanent, à la gRPC +`UNAUTHENTICATED` vs. `UNAVAILABLE`). An optional `max_attempts → Failed{}` cap stays +**orthogonal** — voluntary give-up on a *non-critical* stream, a different axis from involuntary +permanent failure. + +### 8. Send-axis `RateLimit`, the control handle, and expiry ≠ death + +- **`SendRateLimit`** restores ADR-0031's *proactive* guard on the axis where sends happen: a + generic token bucket on `WsSink`, adapter-configured with the venue's inbound limit, **default + off/generous** so IBKR never notices. Without it, a reconnect resubscribe-burst (~100 lines > + Binance's ~5/s) floods → disconnect → reconnect → flood — a storm reconnect-backoff cannot stop + (backoff paces *connects*, not *sends*). `send()` awaiting a token is *backpressure-inside-`call`*, + consistent with ADR-0032 §2 (which rejected `Sink`'s `poll_ready` *handshake*, not all + backpressure). This pairs with `Reconnect` exactly as ADR-0031's `RateLimit` (never hit the + limit) pairs with `CircuitBreaker` (recover if you do). +- **`force_reconnect` is a control-handle verb, not a sink method** — keeping the data plane + (`sink.send`) and control plane separate, the same discipline the lifecycle read-channel + follows. It serves the *proactive* reconnect the venues demand (Binance 24h connection lifetime, + Coinbase JWT 2-min re-auth) as well as the reactive one. +- **Expiry ≠ death (adapter, verified).** IBKR `smd` self-termination is a **silent per-`conid` + death on a healthy socket** — every transport liveness detector stays green. So `force_reconnect` + is the *wrong* remedy (it would tear down ~100 healthy subscriptions); the adapter owns a + per-topic staleness timer → `umd+`/`smd+` **resubscribe**, and on `Resumed{epoch}` the adapter + replays subscriptions (ADR-0032 §5). Reconnect is escalated **only** when the adapter concludes + the *socket* is dead. + +### 9. Construction and mock-testability + +Mirroring the net-http `stack()`/`build()` split, for the same reason (the ordering invariants of +§3 are testable only over a deterministic leaf and clock): + +```rust +// net-ws-api — assembles the canonical stack over ANY leaf +pub fn stack(leaf: S, cfg: WsConfig, timer: T, auth: A, spawn: Sp) + -> impl ReconnectingConnector +where S: WsConnector + …, T: Timer, A: AuthSource, Sp: Spawn; + +// net-ws-tungstenite — builds the tungstenite leaf, then delegates to stack() +pub fn build(cfg: WsConfig, timer: T, auth: A, spawn: Sp, conn: ConnConfig) + -> impl ReconnectingConnector; +``` + +- **`WsConfig` is non-generic plain data** (timeouts, backoff + attempt-rate cap, buffer bounds + count+bytes, heartbeat/idle intervals, send-rate limit, permanent-error policy). **No `K`-generic** + — a WS send-limit is per-connection (one pipe), not per-endpoint-keyed, so there is no `RateKey` + and no boot-time coverage check (a genuine reduction versus the net-http surface). +- **New dev-only crate `oath-adapter-net-ws-mock`** (mirror of `net-http-mock`): `MockWsConnector` + (scripted frames + injectable disconnects and `ErrorKind`s), `MockTimer`, and **`MockSpawn` — a + test-controlled, single-threaded, manually-pumped executor**, not a tokio spawner. This is the + point of the `Spawn` seam: only a deterministic executor lets a test drive the actor step by step + and assert the invariants ("Auth re-stamps inside each reconnect," "auto-`Pong` below a full + buffer," "permanent vs. transient classification") without racing a background task — the + `Timer`-style "controllable, not a no-op" discipline applied to spawning. + +## Considered options + +- *Reconnect in the backend, not `net-ws-api`* — rejected: it removes the mock-clock/mock-spawn + testability that justifies the whole seam, and a second backend would rewrite the actor. The + `Spawn` seam keeps the resilience logic in the contract crate (as ADR-0031 keeps HTTP's). +- *Poll-driven reconnect (no spawn, `unfold` + shared cell)* — tenable, but the channel-backed + actor cleanly owns both halves (auto-`Pong` needs the sink from the recv path) and matches the + industry actor model; a shared-cell design gets gnarly once heartbeat-`Pong` and send-during-gap + are in play. +- *Control verbs on the sink, or a 4-tuple `connect()`* — rejected: the sink is the data plane; + and a 4-tuple would amend ADR-0032 §2's arity and force a meaningless control handle onto every + raw leaf. The usage type ≠ composition type (tower `Service` vs. `Layer`), so the richer handle + belongs only at the assembly boundary. +- *Transition-stream lifecycle, or naive watch of bare `ConnState`* — rejected (§5): the first + couples actor liveness to a slow consumer; the second coalesces away the feed-down edge. Watch of + an epoch-stamped snapshot is lossless for the safety fact and never blocks the actor. +- *Frame-count-only buffer bound* — rejected (§6): OOMs on a multi-MB-frame venue that the crate's + generality already includes. +- *Retry every failure forever, or give up after a cap* — rejected (§7): the first hammers a + permanent failure into a ban; the second blinds a critical feed on a transient hiccup. Classify: + retry transient forever, surface permanent as `Unrecoverable`. +- *No transport send limit* — rejected (§8): a resubscribe burst floods a venue with an inbound + cap into a disconnect/ban that reconnect-backoff cannot prevent. + +## Consequences + +- **New seam:** `Spawn` in `net-ws-api` (runtime-neutral, mirrors `Timer`); the backend supplies + the tokio impl. **New dep:** `async-watch` (runtime-neutral last-value channel, `event-listener` + family) on top of ADR-0032's set; still zero-runtime, zero-I/O. **New dev-only crate** + `oath-adapter-net-ws-mock` (`MockWsConnector`/`MockTimer`/`MockSpawn`, consumed only via + `[dev-dependencies]`, mirroring the net-http-mock production-reachability discipline). +- **`net-ws-api` gains** the reconnect actor, heartbeat/liveness, the dual-bound drop-oldest + buffer, `SendRateLimit`, `Tracing`, the `stack()` assembler, `ReconnectingConnection` + + `WsControl`, and `LifecycleSnapshot`; `WsConfig` (non-generic). `net-ws-tungstenite` owns `build()` + and the tokio `Spawn`/`Timer`. +- **The adapter (`oath-adapter-ibkr`) owns:** session keepalive (`tic`, `/tickle`), the per-topic + `smd` staleness timer + `umd+`/`smd+` refresh, subscription replay on `Resumed`, the + conservative reconcile-on-`Lagged`, sequence-gap detection where the venue offers it, the + `ErrorKind`→permanent classification refinement, and the `SendRateLimit`/backoff/timeout config + values. +- **Amends ADR-0032 (in place, same PR — not landed):** §"grounding case" (`smd` ~15 min, silent + per-topic expiry); §4 (`Unrecoverable` added to `ConnState`; the deferred watch-vs-stream + sub-choice resolved as a `watch` of `LifecycleSnapshot`; `Lagged` carried as cumulative + `total_lagged`). +- **ADR numbering:** this pair keeps 0032/0033; the net-http construction-surface amendments + (a separate, unmerged workstream) take **ADR-0034**, not 0032. + +## Relationships + +Completes the WS resilience stack **ADR-0032** deferred, on the ADR-0029 kernel (`Layer`, +`ErrorKind`, `Timer`) plus the new `Spawn` seam. Mirrors **ADR-0031** (the HTTP sibling) and +inherits its per-attempt-auth and proactive/reactive pacing shape, inverting the circuit breaker +for a must-maintain feed. Feeds the lifecycle channel to **ADR-0004** (risk) and **ADR-0022** +(graduated failure); defers subscription replay and order recovery to the adapter per **ADR-0006** +/ **ADR-0003**; routes `Tracing` to the **ADR-0014** Telemetry plane; rests on **ADR-0007** +(compile-time `impl` seams, no `dyn`). Glossary unchanged — `Spawn`, `ReconnectingConnection`, +`WsControl`, `LifecycleSnapshot` are implementation vocabulary; [CONTEXT.md](../../CONTEXT.md) is +domain-only, and IBKR/Binance/Coinbase values are reference data for the adapters. From 7dbfd34413a6afefc2bdcd32f4f4356dc0ac158c Mon Sep 17 00:00:00 2001 From: NotAProfDev <84450364+NotAProfDev@users.noreply.github.com> Date: Wed, 1 Jul 2026 18:44:12 +0000 Subject: [PATCH 3/4] docs(adr): address review on ADR-0032/0033 WS transport MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Review + CodeRabbit fixes on the unmerged pair: - ADR-0032 §4: drop `Lagged { count }` from the `ConnState` enum — lag is not a connection phase (a socket can be `Connected` *and* lagging); it is carried as the cumulative `total_lagged` field on the ADR-0033 §5 `LifecycleSnapshot`. Reconciles §4 with §5 (which already omitted it) and fixes the §6 `Lagged{count}` literal to reference the cumulative signal. - ADR-0033 §1/§9: define the `ReconnectingConnector` trait `stack()`/`build()` return (`impl ReconnectingConnector`) — previously undefined and easily read as a typo of `ReconnectingConnection`; add it to the glossary. - ADR-0033 §7: stop referencing an undefined `Failed{}` state; describe the optional `max_attempts` cap as a distinct optional terminal outcome not added to the core `ConnState`. - ADR-0033 §5: mark the top-level snapshot `epoch` canonical (the epoch echoed in `Connected`/`Resumed` is the same value) — one source of truth. - CodeRabbit: `WsSink::close(self)` (terminal, one-way shutdown) instead of `close(&mut self)`; reword the §6 buffer bound as a soft backlog budget with a never-drop-newest exception, not a hard memory cap. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...nsport-contract-duplex-frames-lifecycle.md | 11 +++-- ...ilience-reconnect-actor-watch-lifecycle.md | 47 +++++++++++++------ 2 files changed, 40 insertions(+), 18 deletions(-) diff --git a/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md b/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md index 6842d4f..9ac6bc4 100644 --- a/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md +++ b/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md @@ -58,8 +58,10 @@ fn connect(&self, handshake: http::Request<()>) WsSource: impl Stream> + Send // send half — sending one frame is request-shaped (one-shot), NOT a Sink: +// `close` takes `self` by value — shutdown is one-way and terminal, so the sink +// cannot be `send`-used after close is requested (enforced by the type system). trait WsSink { fn send(&mut self, f: Frame) -> impl Future> + Send; - fn close(&mut self) -> impl Future> + Send; } + fn close(self) -> impl Future> + Send; } ``` - **recv is `futures_core::Stream`**, not a hand-rolled pull iterator. This is the @@ -104,9 +106,12 @@ Connection health is a **third handle**, not an item interleaved into the data s Lifecycle: a last-value channel of ConnState // watch-style; delivery form resolved in ADR-0033 enum ConnState { Connected { epoch: u64 }, Stale, Reconnecting, Resumed { epoch: u64 }, - Lagged { count: u64 }, // buffer overflow (§6) Unrecoverable, // a classified non-transient failure — will not self-heal (ADR-0033 §7) } +// Buffer overflow (§6) is NOT a connection phase — it is orthogonal to `ConnState` +// (a connection can be `Connected` *and* lagging). The `Lagged` signal is carried as +// the monotonic cumulative `total_lagged` field on the ADR-0033 §5 `LifecycleSnapshot`, +// not as a variant here — see the delivery-form note below and §6. ``` ADR-0033 resolves the delivery form (a `watch` of an epoch-stamped `LifecycleSnapshot`, not a @@ -167,7 +172,7 @@ we set the consume rate — so frames can queue. Two facts fix the guarantee: out**. The guarantee: **the transport never silently discards; on overflow it drops oldest data frames -and emits `Lagged{count}` (§4).** The **per-stream drop/keep policy lives adapter-side, after +and signals the drop by advancing the cumulative `total_lagged` (the `Lagged` signal, §4).** The **per-stream drop/keep policy lives adapter-side, after demux** (MD → `LatestValue` drop-to-latest, ADR-0020; orders → reliable handling + reconcile-on-`Lagged`). Control frames bypass the buffer (§3). Two distinct "latest" mechanisms at two layers — coarse, grammar-blind drop-oldest at the transport (hence the signal); semantic diff --git a/docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md b/docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md index 2db58fa..c1d4e9a 100644 --- a/docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md +++ b/docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md @@ -62,14 +62,24 @@ trait WsConnector { // composition — ADR-0032 -> impl Future> + Send; } -struct ReconnectingConnection { // usage — new here; what build() yields +struct ReconnectingConnection { // usage — new here; what a ReconnectingConnector yields sink: WsSink, // {send, close} — minimal, as landed source: WsSource, // Stream> lifecycle: Lifecycle, // last-value watch of LifecycleSnapshot (§5) control: WsControl, // force_reconnect(), shutdown() } + +trait ReconnectingConnector { // usage seam — the assembled-stack analogue of WsConnector + fn connect(&self, h: http::Request<()>) // (note: -or the factory trait, -ion the product struct) + -> impl Future> + Send; +} ``` +`ReconnectingConnector` is to `ReconnectingConnection` what `WsConnector` is to the ADR-0032 §2 +triple: the factory trait `stack()`/`build()` return (§9), whose `connect` hands the adapter the +richer product exactly once. It is the *usage-seam* peer of the internal *composition-seam* +`WsConnector`. + The reconnect layer is the exact **`Layer → Service` analogue** from tower: you compose `Layer`s but *hold* a `Service` (`ServiceBuilder` yields a `Buffer>` used as a `Service`, never as a `Layer`). Composition unit ≠ product type. So the leaf never grows a @@ -154,7 +164,9 @@ snapshot**, *not* a transition stream and *not* a naive watch of bare `ConnState ```rust struct LifecycleSnapshot { phase: ConnState, // level: Connected/Stale/Reconnecting/Resumed/Unrecoverable - epoch: u64, // monotonic: bumped on every completed down-cycle + epoch: u64, // monotonic: bumped on every completed down-cycle. Canonical: the epoch + // echoed inside Connected{epoch}/Resumed{epoch} is this same value + // (both set from one counter) — consumers diff THIS field. down_since: Option, attempts: u64, // monotonic total_lagged: u64, // monotonic cumulative — NOT a per-event delta (see §6) @@ -200,14 +212,17 @@ emit `Lagged`; control bypasses; per-stream policy adapter-side). The mechanism: inline and pushes only `Text`/`Binary` into the ring; it **always drains the socket** (§6 rejects TCP-backpressure) and absorbs pressure on the data side by dropping, never by refusing to read. Source wakeups use `event-listener`; an `mpsc` cannot drop-oldest. -- **Dual bound `min(count, bytes)`.** A frame-count-only bound bakes in IBKR's small-JSON - assumption; a generic transport receives multi-MB frames (Coinbase level2 snapshot), so - `N × frame_size` OOMs on venue #2. Byte-accounting is one `usize` (`frame.len()` is already in - hand), so the ring caps small-frame floods by count *and* memory by bytes, whichever trips - first; the byte default is generous (a few MB, per-venue tunable) so IBKR never touches it. A - single frame exceeding the byte cap is **kept** (older dropped, lag incremented) — never - discard the newest (§6). This is the standard slow-consumer shape (Redis - `client-output-buffer-limit`, Kafka `buffer.memory`, Netty `WriteBufferWaterMark`). +- **Dual bound — a soft `min(count, bytes)` *backlog* budget, not a hard memory cap.** A + frame-count-only bound bakes in IBKR's small-JSON assumption; a generic transport receives + multi-MB frames (Coinbase level2 snapshot), so `N × frame_size` OOMs on venue #2. Byte-accounting + is one `usize` (`frame.len()` is already in hand), so the ring evicts oldest frames once *either* + the count *or* the accumulated bytes trips its bound, whichever comes first; the byte default is + generous (a few MB, per-venue tunable) so IBKR never touches it. The bound governs the *backlog*, + not a single in-flight frame: because the newest is never dropped (§6), a lone frame larger than + the whole byte budget is still **admitted** (older evicted, lag incremented) — so the effective + peak is `budget + one max frame`, a soft ceiling, not a strict one. This is the standard + slow-consumer shape (Redis `client-output-buffer-limit`, Kafka `buffer.memory`, Netty + `WriteBufferWaterMark`), all of which bound backlog rather than guaranteeing a hard ceiling. - **`Lagged` is a blunt, grammar-blind instrument — recorded as a consequence, not a gap.** A single global `total_lagged` cannot attribute drops to a stream (per-stream rings would need demux = venue grammar in the transport, the forbidden leak; and the dropped frames are gone). @@ -238,9 +253,11 @@ for transient loss — but only for transient loss: Classification is by `ErrorKind` (grammar-free; `WsError: HasErrorKind`, ADR-0032 Consequences), adapter-refinable via a hook (venue grammar: which close-code is permanent, à la gRPC -`UNAUTHENTICATED` vs. `UNAVAILABLE`). An optional `max_attempts → Failed{}` cap stays -**orthogonal** — voluntary give-up on a *non-critical* stream, a different axis from involuntary -permanent failure. +`UNAUTHENTICATED` vs. `UNAVAILABLE`). An optional `max_attempts` cap stays **orthogonal** — +voluntary give-up on a *non-critical* stream, a different axis from involuntary permanent failure. +This ADR does not add it to the core `ConnState` (ADR-0032 §4): if a deployment enables it, the +cap surfaces as its own terminal outcome, deliberately distinct from `Unrecoverable` (which is +*involuntary* — a classified permanent failure, not a give-up). ### 8. Send-axis `RateLimit`, the control handle, and expiry ≠ death @@ -346,6 +363,6 @@ inherits its per-attempt-auth and proactive/reactive pacing shape, inverting the for a must-maintain feed. Feeds the lifecycle channel to **ADR-0004** (risk) and **ADR-0022** (graduated failure); defers subscription replay and order recovery to the adapter per **ADR-0006** / **ADR-0003**; routes `Tracing` to the **ADR-0014** Telemetry plane; rests on **ADR-0007** -(compile-time `impl` seams, no `dyn`). Glossary unchanged — `Spawn`, `ReconnectingConnection`, -`WsControl`, `LifecycleSnapshot` are implementation vocabulary; [CONTEXT.md](../../CONTEXT.md) is +(compile-time `impl` seams, no `dyn`). Glossary unchanged — `Spawn`, `ReconnectingConnector`, +`ReconnectingConnection`, `WsControl`, `LifecycleSnapshot` are implementation vocabulary; [CONTEXT.md](../../CONTEXT.md) is domain-only, and IBKR/Binance/Coinbase values are reference data for the adapters. From cd92798ef1ff06061b632bce00519b73b8208de2 Mon Sep 17 00:00:00 2001 From: NotAProfDev <84450364+NotAProfDev@users.noreply.github.com> Date: Wed, 1 Jul 2026 18:52:53 +0000 Subject: [PATCH 4/4] docs(adr): tighten ADR-0032/0033 review prose MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Quality pass over the review-fix commit — no design change, only trims redundancy the fixes introduced: - ADR-0033 §1: drop the ReconnectingConnector explanatory paragraph (the inline comment + the following Layer->Service paragraph already carry the usage-vs-composition-seam distinction); fold the -or/-ion note into the block. - ADR-0033 §6: drop the closing clause restating "not a hard ceiling". - ADR-0033 §7: drop the parenthetical re-defining Unrecoverable. - ADR-0033 §5: tighten the canonical-epoch comment (no all-caps / mechanism noise). - ADR-0032 §4: tighten the Lagged-not-a-phase comment; make line 131 say the buffer layer "advances total_lagged (the Lagged signal)" rather than "emits Lagged", so it no longer reads as a ConnState variant parallel to Resumed/Stale. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...nsport-contract-duplex-frames-lifecycle.md | 10 +++---- ...ilience-reconnect-actor-watch-lifecycle.md | 26 +++++++------------ 2 files changed, 15 insertions(+), 21 deletions(-) diff --git a/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md b/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md index 9ac6bc4..a73569c 100644 --- a/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md +++ b/docs/adr/0032-websocket-transport-contract-duplex-frames-lifecycle.md @@ -109,9 +109,8 @@ enum ConnState { Unrecoverable, // a classified non-transient failure — will not self-heal (ADR-0033 §7) } // Buffer overflow (§6) is NOT a connection phase — it is orthogonal to `ConnState` -// (a connection can be `Connected` *and* lagging). The `Lagged` signal is carried as -// the monotonic cumulative `total_lagged` field on the ADR-0033 §5 `LifecycleSnapshot`, -// not as a variant here — see the delivery-form note below and §6. +// (a connection can be `Connected` *and* lagging), so the `Lagged` signal rides the +// cumulative `total_lagged` field of ADR-0033 §5's `LifecycleSnapshot`, not a variant here. ``` ADR-0033 resolves the delivery form (a `watch` of an epoch-stamped `LifecycleSnapshot`, not a @@ -128,8 +127,9 @@ resilience layer when it classifies a permanent failure rather than retrying it loop) and [ADR-0022](0022-reliable-order-path-graduated-failure.md) (graduated failure) — not just `Resumed`. - **It is the shared signal plane for the ADR-0033 layers** — the reconnect layer emits - `Resumed`, the heartbeat layer emits `Stale`, the buffer layer emits `Lagged` (§6). This is - the WS analogue of what `ErrorKind`/Telemetry is to the HTTP stack. + `Resumed`, the heartbeat layer emits `Stale`, the buffer layer advances `total_lagged` + (the `Lagged` signal, §6). This is the WS analogue of what `ErrorKind`/Telemetry is to the + HTTP stack. - **Epoch-stamped.** `Resumed{epoch}` bounds the adapter's reconcile window ("reconcile since epoch N") and disambiguates an in-flight `umd-` queued against the dead connection vs. the new one. Ordering is free: a fresh session is silent until resubscribe, so `Resumed` strictly diff --git a/docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md b/docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md index c1d4e9a..ce4fdcc 100644 --- a/docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md +++ b/docs/adr/0033-websocket-resilience-reconnect-actor-watch-lifecycle.md @@ -69,17 +69,12 @@ struct ReconnectingConnection { // usage — new here; what control: WsControl, // force_reconnect(), shutdown() } -trait ReconnectingConnector { // usage seam — the assembled-stack analogue of WsConnector - fn connect(&self, h: http::Request<()>) // (note: -or the factory trait, -ion the product struct) +trait ReconnectingConnector { // usage seam — the factory `stack()`/`build()` return (§9); + fn connect(&self, h: http::Request<()>) // the assembled-stack analogue of the leaf `WsConnector` -> impl Future> + Send; -} +} // (-or = the factory trait, -ion = its product struct) ``` -`ReconnectingConnector` is to `ReconnectingConnection` what `WsConnector` is to the ADR-0032 §2 -triple: the factory trait `stack()`/`build()` return (§9), whose `connect` hands the adapter the -richer product exactly once. It is the *usage-seam* peer of the internal *composition-seam* -`WsConnector`. - The reconnect layer is the exact **`Layer → Service` analogue** from tower: you compose `Layer`s but *hold* a `Service` (`ServiceBuilder` yields a `Buffer>` used as a `Service`, never as a `Layer`). Composition unit ≠ product type. So the leaf never grows a @@ -164,9 +159,9 @@ snapshot**, *not* a transition stream and *not* a naive watch of bare `ConnState ```rust struct LifecycleSnapshot { phase: ConnState, // level: Connected/Stale/Reconnecting/Resumed/Unrecoverable - epoch: u64, // monotonic: bumped on every completed down-cycle. Canonical: the epoch - // echoed inside Connected{epoch}/Resumed{epoch} is this same value - // (both set from one counter) — consumers diff THIS field. + epoch: u64, // monotonic: bumped on every completed down-cycle. Canonical source of + // truth — the value echoed in Connected{epoch}/Resumed{epoch} is this + // field; consumers diff it. down_since: Option, attempts: u64, // monotonic total_lagged: u64, // monotonic cumulative — NOT a per-event delta (see §6) @@ -222,7 +217,7 @@ emit `Lagged`; control bypasses; per-stream policy adapter-side). The mechanism: the whole byte budget is still **admitted** (older evicted, lag incremented) — so the effective peak is `budget + one max frame`, a soft ceiling, not a strict one. This is the standard slow-consumer shape (Redis `client-output-buffer-limit`, Kafka `buffer.memory`, Netty - `WriteBufferWaterMark`), all of which bound backlog rather than guaranteeing a hard ceiling. + `WriteBufferWaterMark`). - **`Lagged` is a blunt, grammar-blind instrument — recorded as a consequence, not a gap.** A single global `total_lagged` cannot attribute drops to a stream (per-stream rings would need demux = venue grammar in the transport, the forbidden leak; and the dropped frames are gone). @@ -253,11 +248,10 @@ for transient loss — but only for transient loss: Classification is by `ErrorKind` (grammar-free; `WsError: HasErrorKind`, ADR-0032 Consequences), adapter-refinable via a hook (venue grammar: which close-code is permanent, à la gRPC -`UNAUTHENTICATED` vs. `UNAVAILABLE`). An optional `max_attempts` cap stays **orthogonal** — +`UNAUTHENTICATED` vs. `UNAVAILABLE`). An optional `max_attempts` cap stays **orthogonal** — a voluntary give-up on a *non-critical* stream, a different axis from involuntary permanent failure. -This ADR does not add it to the core `ConnState` (ADR-0032 §4): if a deployment enables it, the -cap surfaces as its own terminal outcome, deliberately distinct from `Unrecoverable` (which is -*involuntary* — a classified permanent failure, not a give-up). +This ADR does not add it to the core `ConnState` (ADR-0032 §4): if enabled, it surfaces as its own +terminal outcome, distinct from `Unrecoverable`. ### 8. Send-axis `RateLimit`, the control handle, and expiry ≠ death