Skip to content

richat: add subscribe handshake observability#207

Merged
fanatid merged 5 commits into
lamports-dev:masterfrom
mindrunner:feat/subscribe-observability
May 19, 2026
Merged

richat: add subscribe handshake observability#207
fanatid merged 5 commits into
lamports-dev:masterfrom
mindrunner:feat/subscribe-observability

Conversation

@mindrunner

Copy link
Copy Markdown
Collaborator

Add three metrics to diagnose 3-5s zero-byte stalls observed on new gRPC subscribers in EWR/LAX that cause clients to send RST_STREAM before any data reaches them. No behavioral fix yet — we first want one deploy's worth of data to tell us which latency actually dominates.

Metrics added:

  • grpc_subscribe_filter_parse_seconds (histogram, labeled by x_subscription_id): seconds from subscribe2() entry until the SubscribeRequest is read off the wire and parsed into a Filter.
  • grpc_subscribe_time_to_first_message_seconds (histogram, labeled by x_subscription_id): seconds from the filter being applied until the worker loop pushes the first data message to the client. This is the latency the client actually perceives as "time to first byte of data".
  • grpc_subscribe_handshake_abandoned_total (counter, labeled by x_subscription_id): incremented when the client's request stream ends before a filter is ever set — i.e. client disconnected mid- handshake. Tells us how many stalls abandon pre-filter vs post-filter.

Plus a one-shot warn! log inside the ping task when a client has been connected for >3s without a filter set. Threshold matches the observed client timeout window.

Only the initial unset -> set transition is recorded for both histograms; subsequent filter updates (commitment change, etc.) are not part of the subscribe handshake and would skew the tails.

@mindrunner mindrunner requested a review from fanatid April 21, 2026 11:38
@mindrunner mindrunner force-pushed the feat/subscribe-observability branch from e4a15ba to 3151a73 Compare April 28, 2026 09:52
mindrunner and others added 5 commits May 19, 2026 10:29
Add three metrics to diagnose 3-5s zero-byte stalls observed on new gRPC
subscribers in EWR/LAX that cause clients to send RST_STREAM before any
data reaches them. No behavioral fix yet — we first want one deploy's
worth of data to tell us which latency actually dominates.

Metrics added:
- grpc_subscribe_filter_parse_seconds (histogram, labeled by
  x_subscription_id): seconds from subscribe2() entry until the
  SubscribeRequest is read off the wire and parsed into a Filter.
- grpc_subscribe_time_to_first_message_seconds (histogram, labeled by
  x_subscription_id): seconds from the filter being applied until the
  worker loop pushes the first data message to the client. This is the
  latency the client actually perceives as "time to first byte of data".
- grpc_subscribe_handshake_abandoned_total (counter, labeled by
  x_subscription_id): incremented when the client's request stream
  ends before a filter is ever set — i.e. client disconnected mid-
  handshake. Tells us how many stalls abandon pre-filter vs post-filter.

Plus a one-shot warn! log inside the ping task when a client has been
connected for >3s without a filter set. Threshold matches the observed
client timeout window.

Only the initial unset -> set transition is recorded for both
histograms; subsequent filter updates (commitment change, etc.) are not
part of the subscribe handshake and would skew the tails.
@fanatid fanatid force-pushed the feat/subscribe-observability branch from 675782f to a74ab39 Compare May 19, 2026 15:30
@fanatid fanatid merged commit 97e2e59 into lamports-dev:master May 19, 2026
2 checks passed
@fanatid

fanatid commented May 19, 2026

Copy link
Copy Markdown
Member

released as richat@10.1.0 - https://github.com/lamports-dev/richat/releases/tag/richat-v10.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants