fix(observability): demote channel supervisor restart noise (Sentry TAURI-RUST-15)#2691
fix(observability): demote channel supervisor restart noise (Sentry TAURI-RUST-15)#2691oxoxDev wants to merge 4 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds ChangesChannel Supervisor Restart Classification and Demotion
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
graycyrus
left a comment
There was a problem hiding this comment.
Clean observability fix — well-scoped, well-tested, additive only.
What this does: Adds ChannelSupervisorRestart classifier tier to demote noisy per-restart Sentry events (TAURI-RUST-15 ~11.4k events, TAURI-RUST-BB ~815 events) to info! breadcrumbs. Supervisor wrapper format is language-agnostic, catching both English Discord gateway bodies and OS-localized variants (Chinese WSAETIMEDOUT) that slip through the English-only is_network_unreachable_message anchors.
| Area | Verdict |
|---|---|
| Correctness | Three-anchor predicate (starts_with("channel ") + contains(" error:") + contains("; restarting")) is tight enough to avoid false positives while catching all supervisor-wrapped bodies regardless of locale. Precedence before NetworkUnreachable/LoopbackUnavailable is correct and well-commented. |
| Tests | 6 new tests: English body, Chinese WSAETIMEDOUT, multi-channel names, precedence pinning, negative cases (4 rejection patterns), smoke test through report_error_or_expected. Solid coverage. |
| Safety | health.bus / FAIL_ESCALATE_THRESHOLD path for sustained outages is untouched — no observability blind spot introduced. Existing demotion paths preserved for non-supervisor call sites. |
| Performance | Net positive — eliminates ~12k Sentry events/14d. |
098856e to
43b04ae
Compare
|
Actionable comments posted: 0 |
bab1991 to
34bd215
Compare
|
Actionable comments posted: 0 |
…AURI-RUST-15) Self-hosted Sentry's #1 unresolved tauri-rust issue by event count (`Channel discord error: error sending request for url ...; restarting`, ~11.4 k events / 14d) and its Chinese-Windows WSAETIMEDOUT variant (TAURI-RUST-BB, ~815 events) both originate from the channel supervisor loop in `channels::runtime::supervision::spawn_supervised_listener`. The supervisor already restarts the listener with its own exponential backoff; sustained outages still surface through `health.bus` / `FAIL_ESCALATE_THRESHOLD`. Per-restart messages carry no actionable Sentry signal. Previous path: `expected_error_kind` matched the English Discord body against `is_network_unreachable_message`, which demotes to `tracing::warn!` — still a Sentry event (just at lower severity). The Chinese-Windows variant escaped the English-only anchors entirely and emitted as a full Sentry error. Fix: add a new `ChannelSupervisorRestart` classifier tier anchored on the Rust supervisor wrapper format (`"Channel <name> error: <inner>; restarting"`) — language-agnostic so it covers OS-localized inner errors. Precedence is checked BEFORE `is_loopback_unavailable` and `is_network_unreachable_message` so the supervisor wrap always wins. Demotes to `tracing::info!` (breadcrumb only — no Sentry event). Tests cover: English Discord gateway shape, Chinese WSAETIMEDOUT variant, four additional channel names (slack/telegram/whatsapp/ gmessages), precedence over `NetworkUnreachable`, rejection of generic non-supervisor restart notes (`systemd: docker.service; restarting`), and a smoke test routing the verbatim Sentry body through `report_error_or_expected`. Sentry-Issue: TAURI-RUST-15 Sentry-Issue: TAURI-RUST-BB
CI workflows (test, build, coverage, etc.) trigger on default pull_request types (opened/synchronize/reopened); they did not fire when this PR was first opened. Force a synchronize event so the full matrix runs.
…rvisorRestart The new `is_channel_supervisor_restart_message` classifier added in this PR takes precedence over `is_network_unreachable_message` in `expected_error_kind`. The pre-existing supervision test `supervision_discord_gateway_reqwest_failure_classifies_as_expected` asserted `NetworkUnreachable` — update it to assert `ChannelSupervisorRestart`, matching the new precedence + the broader language-agnostic anchor introduced for TAURI-RUST-15/-BB. Sentry-Issue: TAURI-RUST-15
34bd215 to
8fc2116
Compare
|
Actionable comments posted: 0 |
…SupervisorRestart After rebase onto current `upstream/main`, the existing test `channel_supervisor_operation_timed_out_classifies_as_expected` (added in a sibling PR before this rebase landed) now hits the new `ChannelSupervisorRestart` precedence path instead of `NetworkUnreachable`. The new classifier is the broader anchor — it covers every ETIMEDOUT / WSAETIMEDOUT / hyper-prose supervisor-wrap shape the old test pinned, plus OS-localized variants the English-only `NetworkUnreachable` would have missed. Update the assertion + comment to reflect the new precedence and tier difference (`ChannelSupervisorRestart` demotes to `info!`, vs `warn!` for `NetworkUnreachable`). Sentry-Issue: TAURI-RUST-15
8fc2116 to
e4f553f
Compare
|
The CI failure here is a stale-base issue — Opened #2879 with the 3 functional commits rebased on current |
|
Closing in favour of #2879 (rebased on current main, stale-base CI failure resolved). |
Summary
ExpectedErrorKind::ChannelSupervisorRestartclassifier tier insrc/core/observability.rs, demoting per-restart messages fromchannels::runtime::supervision::spawn_supervised_listenerto atracing::info!breadcrumb (no Sentry event).is_channel_supervisor_restart_messagepredicate anchors on the Rust supervisor wrapper format ("Channel <name> error: <inner>; restarting") — language-agnostic, so it catches OS-localized inner errors that the English-onlyis_network_unreachable_messageanchors miss.is_loopback_unavailableandis_network_unreachable_messageso the supervisor wrap always wins over its inner-error substrings.tauri-rusttop-1 by event count (TAURI-RUST-15, ~11.4 k events / 14d) and its Chinese-Windows WSAETIMEDOUT variant (TAURI-RUST-BB, ~815 events).Problem
The supervised-listener loop at
src/openhuman/channels/runtime/supervision.rs:60reports every transient channel restart throughreport_error_or_expected:The supervisor restarts the listener with its own exponential backoff and sustained outages surface via
health.bus/FAIL_ESCALATE_THRESHOLD(a separate path). The per-restart message itself carries no actionable Sentry signal.Two failure modes today:
expected_error_kind, matchesis_network_unreachable_message(because the inner reqwest error contains"error sending request for url"), and demotes totracing::warn!. The Sentry tracing layer still captureswarn!as a Sentry event (just at lower severity). Across multi-user × Discord-channel × intermittent-network this produces 11,408 events / 14d on self-hostedtauri-rust.IO error: 由于连接方在一段时间后没有正确答复或连接的主机没有反应... (os error 10060)) doesn't match any English-only anchor inis_network_unreachable_message, so it escapes classification entirely and emits as a full Sentry error (TAURI-RUST-BB, ~815 events).Solution
Add a new
ChannelSupervisorRestartclassifier tier that:"channel "prefix +" error:"separator +"; restarting"trailer) so it covers OS-localized variants without false-positiving on generic restart logs.NetworkUnreachableandLoopbackUnavailableso the supervisor wrap always wins over its inner-error substrings.tracing::info!— breadcrumb only, no Sentry event. Sustained outages still page via the separatehealth.bus/FAIL_ESCALATE_THRESHOLDpath, which is unaffected by this change.Existing demotion paths (
NetworkUnreachable→warn!,LoopbackUnavailable→debug!) are preserved for non-supervisor call sites.Submission Checklist
src/core/observability.rs(English Discord shape, Chinese WSAETIMEDOUT, four additional channel names, precedence overNetworkUnreachable, rejection of generic non-supervisor restart logs, smoke test routing throughreport_error_or_expected).report_error_or_expectedsmoke test covers the dispatch path end-to-end.core::observabilitymodule; not a tracked feature row indocs/TEST-COVERAGE-MATRIX.md.## Related— no matrix feature IDs affected (observability internals).Closes #NNNin the## Relatedsection — Sentry-only fix; no GitHub issue. Sentry-Issue trailers in## Relatedprovide the back-reference.Impact
info!instead ofwarn!/error!. No behavioural change to the supervisor's restart loop or tohealth.busescalation; only the Sentry/log severity tier changes.tauri-rust, reducing dashboard noise and event-ingestion cost.health.bus/FAIL_ESCALATE_THRESHOLDpath. The demoted breadcrumbs are still captured in the local tracing log and are available for correlation when investigating a sustained outage manually.Related
.claude/rules/phases/07-cleanup.md) currently matchesOPENHUMAN-(TAURI|REACT|CORE)-[A-Z0-9]+/BACKEND-ALPHAHUMAN-[A-Z0-9]+; widen to accept self-hostedTAURI-RUST-*/TAURI-REACT-*/CORE-RUST-*/BACKEND-NODEJS-*prefixes so the cleanup hook can auto-flip these IDs to resolved post-merge. Separate workflow-tooling PR.AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/sentry-15-channel-supervisor-restartValidation Run
pnpm --filter openhuman-app format:check— no frontend changes.pnpm typecheck— no TypeScript changes.cargo test --lib core::observability→ 94 passed (6 new), 0 failed.cargo fmt --checkclean;cargo check --manifest-path Cargo.tomlclean (only pre-existingslack_backfilldead-code warnings unrelated to this change).cargo clippy --libwarning count = 168 (matches main baseline).app/src-tauri/changes.Validation Blocked
command:N/Aerror:N/Aimpact:N/ABehavior Changes
channels::runtime::supervision::spawn_supervised_listenernow classify asExpectedErrorKind::ChannelSupervisorRestartand emit attracing::info!(breadcrumb only) instead of routing throughNetworkUnreachable(warn!— still a Sentry event) or escaping classification (full error event for OS-localized inner errors).health.busescalation are all unchanged. Effect is observability-only (Sentry dashboard noise reduction).Parity Contract
NetworkUnreachableandLoopbackUnavailabledemotion paths are unchanged for non-supervisor call sites. The new tier intercepts only the canonical supervisor wrapper format ("Channel <name> error: <inner>; restarting") — bodies lacking the wrapper prefix continue to flow through the existing matchers exactly as before.channel_supervisor_restart_does_not_classify_unrelated_restart_notespins the rejection contract (generic restart logs without the"Channel <name> error:"preamble must NOT classify).channel_supervisor_restart_precedence_over_network_unreachablepins that supervisor-wrap bodies that ALSO matchis_network_unreachable_messageroute to the new tier, not the old one.Duplicate / Superseded PR Handling
Summary by CodeRabbit
New Features
Bug Fixes
Tests