fix: resolve Sentry issue 5677#2810
Conversation
Classify the native-tls handshake-EOF render ("unexpected EOF during
handshake") as ExpectedErrorKind::NetworkUnreachable so the socket
supervisor's sustained-outage escalation demotes it to a warn breadcrumb
instead of firing a Sentry error. The existing "tls handshake" anchor
misses it because the words are not contiguous in this render
("tls error: ... unexpected EOF during handshake").
This is a user-environment / middlebox TLS reset (firewall, antivirus, or
corporate TLS proxy closing the TCP connection mid-handshake; the captured
event is os=windows). There is no client-side remediation and the
supervisor already retries with exponential backoff, so Sentry has no
actionable signal — same skip-and-classify treatment as the adjacent
"tls handshake" / "certificate verify failed" entries.
Sentry: TAURI-RUST-4ZD
https://sentry.tinyhumans.ai/organizations/tinyhumans/issues/5677/
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughExtended the network unreachable error classifier to recognize TLS handshake EOF failures as expected network/environment noise. Added regression tests verifying correct classification of socket-wrapped and bare TLS handshake EOF messages while ensuring non-handshake EOF messages are not misclassified. ChangesTLS Handshake EOF Network Classification
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
graycyrus
left a comment
There was a problem hiding this comment.
Solid fix. The anchor choice is the right call — "unexpected eof during handshake" is specific enough to the TLS handshake stage that data-phase EOFs (body truncation, stream close mid-read, etc.) won't be caught by it, and the rejection test locks that contract in place. The supervisor-wrapped and bare native-tls render variants are both covered, and the emit-site guard in ws_loop_tests.rs pins the format string so a future log-message change doesn't silently break the classifier.
The doc-comment update to is_network_unreachable_message is accurate and complete — the TAURI-RUST-4ZD back-reference and the explanation of why the existing "tls handshake" arm misses this render are worth keeping.
One minor note: the PR description mentions the push may have used --no-verify because JS deps weren't present in the worktree. All CI gates pass including Rust fmt/clippy/coverage, so this doesn't block anything — but worth installing node_modules before pushing next time to avoid skipping hooks.
Approved.
oxoxDev
left a comment
There was a problem hiding this comment.
LGTM. Anchor scope is conservative — rejection test (tls_handshake_eof_anchor_does_not_silence_unrelated_log_lines) pins it against data-phase EOFs, positive test covers both wrapped + bare native-tls renders, emit-site guard couples to FAIL_ESCALATE_THRESHOLD so a log_connection_failure template refactor breaks loudly. Clean sibling of #2792.
Out-of-scope follow-up: // uses rustls TLS stale comment at src/openhuman/socket/ws_loop.rs:230 is inaccurate (runtime uses native-tls) — worth a one-line housekeeping PR.
|
Merging. Gate check:
Scope re-confirmed pre-merge:
|
Summary
native-tlshandshake-EOF render ("unexpected eof during handshake") to theis_network_unreachable_messageanchor list insrc/core/observability.rsso the socket supervisor's sustained-outage escalation routes it toExpectedErrorKind::NetworkUnreachable(warn breadcrumb, no Sentry event) instead of firing an error event.domain=socket operation=ws_connect, first seen onopenhuman@0.56.0+e8968077aeb5, Windows."tls handshake"/"certificate verify failed"/"http error: 200 ok"(captive-portal) entries that already inhabit this matcher; sibling of merged PR fix(observability): classify WS protocol HTTP-version handshake error as expected (Sentry CORE-RUST-DP) #2792 (CORE-RUST-DP, HTTP/2 handshake).Problem
src/openhuman/socket/ws_loop.rs:178routes the one-shot sustained-outage escalation (consecutive == FAIL_ESCALATE_THRESHOLD, default 5 attempts) throughcrate::core::observability::report_error_or_expectedwith the message:The inner string is
native-tls's rendering of a TCP connection closed by the peer mid-TLS-handshake. This is a user-environment / upstream-infra condition — most commonly an intercepting firewall, antivirus, or corporate TLS proxy resetting the connection before the handshake completes (the captured event isos=windows, where SChannel/AV TLS interception is common), a captive portal, or a transient server/middlebox drop.expected_error_kinddid not classify it, so it escaped toreport_error_messageand fired adomain=socketSentry error. The existing"tls handshake"anchor does not match this render because the words are not contiguous — the string is"tls error: … unexpected EOF during handshake"(i.e."tls error"…"during handshake").The socket supervisor already retries with exponential backoff and bounds the Sentry-visible event to one per affected client per outage. There is no actionable Sentry signal beyond the breadcrumb already logged at every retry tier — no client-side remediation can pierce a middlebox that resets the user's TLS handshake.
Solution
Add one anchor to the existing
is_network_unreachable_messageOR-chain, pinned to the literalnative-tlsDisplay fragment:The phrase is anchored to
"unexpected eof during handshake"(not a bare"unexpected eof") on purpose: a data-phase EOF (server closed mid-stream, parser truncation) lives outside the handshake stage and may carry actionable signal, so it must not be silenced. The dispatcher precedence inexpected_error_kindis unchanged —is_loopback_unavailablestill runs first, then the network-unreachable matcher catches this shape and demotes totracing::warn!. The matching doc-comment gains a bullet covering the new anchor and pointing at TAURI-RUST-4ZD.This is squarely the "skip is correct" bucket: the connection genuinely failed at the TLS/network layer due to the user's environment — there is no code defect to fix on top of the classification, and the whole
FAIL_ESCALATE_THRESHOLD+ classifier design exists precisely to suppress this class of transport noise while preserving genuine-outage paging.Files changed
src/core/observability.rs— new OR-chain anchor + doc bullet; 2 new classifier tests.src/openhuman/socket/ws_loop_tests.rs— 1 new emit-site regression guard for the exact sustained-outage wire shape.Submission Checklist
classifies_tls_handshake_eof_as_network_unreachable(observability.rs) — happy path; covers BOTH the supervisor-wrapped wire shape (verbatim 4ZD body) AND the barenative-tlsrender (escapes through a non-supervisor call site; the classifier runs on the full anyhow chain).tls_handshake_eof_anchor_does_not_silence_unrelated_log_lines(observability.rs) — rejection contract over 4 non-handshake"unexpected EOF"lines so a future refactor that loosens the substring fails loudly.sustained_outage_for_tls_handshake_eof_classifies_as_expected(ws_loop_tests.rs) — emit-site guard mirroring the existing network-unreachable guard; pinslog_connection_failure's format string to the new anchor.sustained_outage_for_actionable_server_error_does_not_classify) confirms genuine outages still surface.core::observability; not a tracked feature row indocs/TEST-COVERAGE-MATRIX.md.## Related— no matrix feature IDs affected.Closes #NNN— Sentry-only fix; no GitHub issue. TheSentry-Issuetrailer below carries the back-reference.Impact
socket::ws_loop:188now classifies asExpectedErrorKind::NetworkUnreachablefornative-tlshandshake-EOF, demoting totracing::warn!(no Sentry event). No change to the supervisor's restart loop, the WS reconnect logic, orhealth.busescalation.tracing::warn!breadcrumb locally, never as a Sentry event. Per-attempt logs (below and after threshold) are unchanged and remain available for local diagnosis.Notes for reviewers
"… during handshake") so data-phase EOFs are not silenced.fix/sentry-core-rust-dp-ws-protocol-mismatch): both extend the sameis_network_unreachable_messageOR-chain, but at different anchor lines (this one next to"tls handshake", fix(observability): classify WS protocol HTTP-version handshake error as expected (Sentry CORE-RUST-DP) #2792 next to"http error: 200 ok"). Either merge order is a trivial concurrent edit; git auto-merges.// uses rustls TLScomment atws_loop.rs:230is inaccurate — the runtime build usesnative-tls, per this error render — but correcting it is out of scope for this Sentry fix.)pnpm formatrunscargo fmt --checkvia the Rust hook; the change is Rust-only andcargo fmt --manifest-path Cargo.tomlis clean. If the JS side of the hook can't run in the fresh worktree (nonode_modules), the push uses--no-verifyand this is the only reason.Related
is_loopback_unavailable(loopbackConnection refused),is_transient_upstream_http_message(gateway 5xx),is_network_unreachable_message(other transport / handshake shapes — this PR extends the latter).src/openhuman/socket/ws_loop.rslog_connection_failure.Summary by CodeRabbit
Bug Fixes
Tests