Skip to content

fix: resolve Sentry issue 5677#2810

Merged
oxoxDev merged 1 commit into
tinyhumansai:mainfrom
CodeGhost21:fix/sentry-tauri-rust-4zd-tls-handshake-eof
May 28, 2026
Merged

fix: resolve Sentry issue 5677#2810
oxoxDev merged 1 commit into
tinyhumansai:mainfrom
CodeGhost21:fix/sentry-tauri-rust-4zd-tls-handshake-eof

Conversation

@CodeGhost21
Copy link
Copy Markdown
Contributor

@CodeGhost21 CodeGhost21 commented May 28, 2026

Summary

Problem

src/openhuman/socket/ws_loop.rs:178 routes the one-shot sustained-outage escalation (consecutive == FAIL_ESCALATE_THRESHOLD, default 5 attempts) through crate::core::observability::report_error_or_expected with the message:

[socket] Connection failed (sustained outage after 5 attempts):
WebSocket connect: TLS error: native-tls error: unexpected EOF during handshake

The inner string is native-tls's rendering of a TCP connection closed by the peer mid-TLS-handshake. This is a user-environment / upstream-infra condition — most commonly an intercepting firewall, antivirus, or corporate TLS proxy resetting the connection before the handshake completes (the captured event is os=windows, where SChannel/AV TLS interception is common), a captive portal, or a transient server/middlebox drop.

expected_error_kind did not classify it, so it escaped to report_error_message and fired a domain=socket Sentry error. The existing "tls handshake" anchor does not match this render because the words are not contiguous — the string is "tls error: … unexpected EOF during handshake" (i.e. "tls error""during handshake").

The socket supervisor already retries with exponential backoff and bounds the Sentry-visible event to one per affected client per outage. There is no actionable Sentry signal beyond the breadcrumb already logged at every retry tier — no client-side remediation can pierce a middlebox that resets the user's TLS handshake.

Solution

Add one anchor to the existing is_network_unreachable_message OR-chain, pinned to the literal native-tls Display fragment:

fn is_network_unreachable_message(lower: &str) -> bool {
    lower.contains("error sending request for url")
        // … existing transport / handshake-stage anchors …
        || lower.contains("tls handshake")
        || lower.contains("unexpected eof during handshake")   // ← new (TAURI-RUST-4ZD)
        || lower.contains("certificate verify failed")
        || lower.contains("http error: 200 ok")
}

The phrase is anchored to "unexpected eof during handshake" (not a bare "unexpected eof") on purpose: a data-phase EOF (server closed mid-stream, parser truncation) lives outside the handshake stage and may carry actionable signal, so it must not be silenced. The dispatcher precedence in expected_error_kind is unchanged — is_loopback_unavailable still runs first, then the network-unreachable matcher catches this shape and demotes to tracing::warn!. The matching doc-comment gains a bullet covering the new anchor and pointing at TAURI-RUST-4ZD.

This is squarely the "skip is correct" bucket: the connection genuinely failed at the TLS/network layer due to the user's environment — there is no code defect to fix on top of the classification, and the whole FAIL_ESCALATE_THRESHOLD + classifier design exists precisely to suppress this class of transport noise while preserving genuine-outage paging.

Files changed

  • src/core/observability.rs — new OR-chain anchor + doc bullet; 2 new classifier tests.
  • src/openhuman/socket/ws_loop_tests.rs — 1 new emit-site regression guard for the exact sustained-outage wire shape.

Submission Checklist

If a section does not apply to this change, mark the item as N/A with a one-line reason. Do not delete items.

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy — 3 new tests:
    • classifies_tls_handshake_eof_as_network_unreachable (observability.rs) — happy path; covers BOTH the supervisor-wrapped wire shape (verbatim 4ZD body) AND the bare native-tls render (escapes through a non-supervisor call site; the classifier runs on the full anyhow chain).
    • tls_handshake_eof_anchor_does_not_silence_unrelated_log_lines (observability.rs) — rejection contract over 4 non-handshake "unexpected EOF" lines so a future refactor that loosens the substring fails loudly.
    • sustained_outage_for_tls_handshake_eof_classifies_as_expected (ws_loop_tests.rs) — emit-site guard mirroring the existing network-unreachable guard; pins log_connection_failure's format string to the new anchor.
  • Diff coverage ≥ 80% — every new line (the OR-chain anchor and the doc-comment bullet) is exercised by the new tests; the actionable-error counterpart (sustained_outage_for_actionable_server_error_does_not_classify) confirms genuine outages still surface.
  • (N/A) Coverage matrix updated — observability classifier change is behaviour-only inside core::observability; not a tracked feature row in docs/TEST-COVERAGE-MATRIX.md.
  • (N/A) All affected feature IDs from the matrix are listed under ## Related — no matrix feature IDs affected.
  • No new external network dependencies introduced — pure in-process classifier addition; no mock backend needed (lib unit tests only).
  • (N/A) Manual smoke checklist updated — observability classifier; no user-visible UI surface, no release-cut behaviour change.
  • (N/A) Linked issue closed via Closes #NNN — Sentry-only fix; no GitHub issue. The Sentry-Issue trailer below carries the back-reference.

Impact

  • Runtime: Desktop (Tauri shell) + core. The sustained-outage escalation at socket::ws_loop:188 now classifies as ExpectedErrorKind::NetworkUnreachable for native-tls handshake-EOF, demoting to tracing::warn! (no Sentry event). No change to the supervisor's restart loop, the WS reconnect logic, or health.bus escalation.
  • Performance / noise: Net positive — removes the TAURI-RUST-4ZD event stream for users behind TLS-intercepting firewalls/AV/proxies.
  • Security: None — no new code paths, no new headers, no PII, no auth surface.
  • Migration / compatibility: None — additive substring entry in an existing OR-chain.
  • Observability trade-off: a sustained outage caused solely by a TLS handshake reset now shows up only as a tracing::warn! breadcrumb locally, never as a Sentry event. Per-attempt logs (below and after threshold) are unchanged and remain available for local diagnosis.

Notes for reviewers

  • Risk level: low. Single additive substring in an existing OR-chain + tests; no behavioural change beyond the demotion of this one wire shape. The anchor is conservative ("… during handshake") so data-phase EOFs are not silenced.
  • Conflict-adjacency with PR fix(observability): classify WS protocol HTTP-version handshake error as expected (Sentry CORE-RUST-DP) #2792 (fix/sentry-core-rust-dp-ws-protocol-mismatch): both extend the same is_network_unreachable_message OR-chain, but at different anchor lines (this one next to "tls handshake", fix(observability): classify WS protocol HTTP-version handshake error as expected (Sentry CORE-RUST-DP) #2792 next to "http error: 200 ok"). Either merge order is a trivial concurrent edit; git auto-merges.
  • Follow-up: none required. (The stale // uses rustls TLS comment at ws_loop.rs:230 is inaccurate — the runtime build uses native-tls, per this error render — but correcting it is out of scope for this Sentry fix.)
  • Pre-push pnpm format runs cargo fmt --check via the Rust hook; the change is Rust-only and cargo fmt --manifest-path Cargo.toml is clean. If the JS side of the hook can't run in the fresh worktree (no node_modules), the push uses --no-verify and this is the only reason.

Related

Summary by CodeRabbit

  • Bug Fixes

    • Improved error classification to correctly identify TLS handshake failures as network-related errors, ensuring they are handled appropriately as expected environmental issues rather than unexpected failures.
  • Tests

    • Added regression tests to validate proper classification of TLS handshake connection failures.

Review Change Stack

Classify the native-tls handshake-EOF render ("unexpected EOF during
handshake") as ExpectedErrorKind::NetworkUnreachable so the socket
supervisor's sustained-outage escalation demotes it to a warn breadcrumb
instead of firing a Sentry error. The existing "tls handshake" anchor
misses it because the words are not contiguous in this render
("tls error: ... unexpected EOF during handshake").

This is a user-environment / middlebox TLS reset (firewall, antivirus, or
corporate TLS proxy closing the TCP connection mid-handshake; the captured
event is os=windows). There is no client-side remediation and the
supervisor already retries with exponential backoff, so Sentry has no
actionable signal — same skip-and-classify treatment as the adjacent
"tls handshake" / "certificate verify failed" entries.

Sentry: TAURI-RUST-4ZD
https://sentry.tinyhumans.ai/organizations/tinyhumans/issues/5677/
@CodeGhost21 CodeGhost21 requested a review from a team May 28, 2026 04:48
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b9c3accc-d9a6-4ec3-ae0b-c0bf64fce538

📥 Commits

Reviewing files that changed from the base of the PR and between 3f2e2f2 and 9145065.

📒 Files selected for processing (2)
  • src/core/observability.rs
  • src/openhuman/socket/ws_loop_tests.rs

📝 Walkthrough

Walkthrough

Extended the network unreachable error classifier to recognize TLS handshake EOF failures as expected network/environment noise. Added regression tests verifying correct classification of socket-wrapped and bare TLS handshake EOF messages while ensuring non-handshake EOF messages are not misclassified.

Changes

TLS Handshake EOF Network Classification

Layer / File(s) Summary
Classify TLS handshake EOF as network unreachable
src/core/observability.rs, src/openhuman/socket/ws_loop_tests.rs
Extended is_network_unreachable_message to recognize the unexpected eof during handshake phrase and added unit tests covering socket-wrapped and bare native-tls message shapes, plus a negative test ensuring non-handshake EOF messages are not misclassified.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • tinyhumansai/openhuman#2063: Adds LoopbackUnavailable matcher that takes precedence over NetworkUnreachable in the same observability error classification logic.
  • tinyhumansai/openhuman#2309: Extends is_network_unreachable_message with additional substring matches for Wave 4 network transport failures.

Suggested labels

rust-core, sentry-traced-bug

Suggested reviewers

  • graycyrus
  • senamakel

Poem

🐰 A TLS handshake gone astray,
Now caught before it hops away,
With "unexpected EOF" tamed,
Network noise, rightfully named! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title references a Sentry issue number but lacks specificity about what was actually fixed in the code. It does not convey the substantive change: adding TLS handshake EOF detection to the network unreachability classifier. Replace the generic issue reference with a concrete description of the fix, such as 'fix: classify TLS handshake EOF errors as network unreachable' to better convey the technical change.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. sentry-traced-bug Bug identified via Sentry triage labels May 28, 2026
@oxoxDev oxoxDev self-assigned this May 28, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix. The anchor choice is the right call — "unexpected eof during handshake" is specific enough to the TLS handshake stage that data-phase EOFs (body truncation, stream close mid-read, etc.) won't be caught by it, and the rejection test locks that contract in place. The supervisor-wrapped and bare native-tls render variants are both covered, and the emit-site guard in ws_loop_tests.rs pins the format string so a future log-message change doesn't silently break the classifier.

The doc-comment update to is_network_unreachable_message is accurate and complete — the TAURI-RUST-4ZD back-reference and the explanation of why the existing "tls handshake" arm misses this render are worth keeping.

One minor note: the PR description mentions the push may have used --no-verify because JS deps weren't present in the worktree. All CI gates pass including Rust fmt/clippy/coverage, so this doesn't block anything — but worth installing node_modules before pushing next time to avoid skipping hooks.

Approved.

Copy link
Copy Markdown
Contributor

@oxoxDev oxoxDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Anchor scope is conservative — rejection test (tls_handshake_eof_anchor_does_not_silence_unrelated_log_lines) pins it against data-phase EOFs, positive test covers both wrapped + bare native-tls renders, emit-site guard couples to FAIL_ESCALATE_THRESHOLD so a log_connection_failure template refactor breaks loudly. Clean sibling of #2792.

Out-of-scope follow-up: // uses rustls TLS stale comment at src/openhuman/socket/ws_loop.rs:230 is inaccurate (runtime uses native-tls) — worth a one-line housekeeping PR.

@oxoxDev
Copy link
Copy Markdown
Contributor

oxoxDev commented May 28, 2026

Merging.

Gate check:

  • mergeable: MERGEABLE / mergeStateStatus: CLEAN / reviewDecision: APPROVED
  • 2 maintainer approvals: @graycyrus, @oxoxDev
  • CI: 27 SUCCESS across full Rust + TS + Linux/macOS/Windows E2E matrix
  • 5 SKIPPED are gated full-shard E2E jobs (*-full) + Invite eligible contributor — expected
  • 3 CANCELLED (PR Submission Checklist, Coverage Matrix Sync, Markdown Link Check) each have a superseding SUCCESS rerun of the same name — superseded, not failing

Scope re-confirmed pre-merge:

  • Single additive anchor "unexpected eof during handshake" in is_network_unreachable_message (src/core/observability.rs) + 3 regression tests (1 classifier happy-path covering both supervisor-wrapped and bare native-tls render, 1 rejection-contract guard over 4 non-handshake unexpected EOF shapes, 1 emit-site guard in ws_loop_tests.rs)
  • Anchor is conservative ("… during handshake", not bare "unexpected eof") — data-phase EOFs remain unsilenced
  • No conflict with merged fix(observability): classify WS protocol HTTP-version handshake error as expected (Sentry CORE-RUST-DP) #2792 (same OR-chain, different anchor line — git auto-merges)

@oxoxDev oxoxDev merged commit 1884e10 into tinyhumansai:main May 28, 2026
35 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. sentry-traced-bug Bug identified via Sentry triage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants