Skip to content

Improve gateway and Feishu runtime observability#1180

Merged
chumyin merged 2 commits intoeastreams:devfrom
chumyin:feat/runtime-reliability-20260410
Apr 17, 2026
Merged

Improve gateway and Feishu runtime observability#1180
chumyin merged 2 commits intoeastreams:devfrom
chumyin:feat/runtime-reliability-20260410

Conversation

@chumyin
Copy link
Copy Markdown
Collaborator

@chumyin chumyin commented Apr 10, 2026

Summary

  • Problem: loong gateway run and Feishu channel runtime startup could look idle even when the runtime had started or inbound events were arriving, which left operators without enough evidence to tell whether the bot was listening, deduplicating, or dispatching work.
  • Why it matters: issue [Feature]: Improve log output for better observability and debug #1083 reports a real blocked debugging loop where chat and outbound send both worked, but inbound Feishu traffic produced no actionable logs.
  • What changed: added privacy-safe structured tracing for gateway runtime startup/control-surface readiness and for Feishu webhook/websocket startup, ingress, deduplication, and successful inbound processing milestones.
  • What did not change (scope boundary): this does not add a new telemetry framework, does not change runtime behavior, and does not log message bodies, tokens, or approval payloads.

Linked Issues

Change Type

  • Bug fix
  • Feature
  • Refactor
  • Documentation
  • Security hardening
  • CI / workflow / release

Touched Areas

  • Kernel / policy / approvals
  • Contracts / protocol / spec
  • Daemon / CLI / install
  • Providers / routing
  • Tools
  • Browser automation
  • Channels / integrations
  • ACP / conversation / session runtime
  • Memory / context assembly
  • Config / migration / onboarding
  • Docs / contributor workflow
  • CI / release / workflows

Risk Track

  • Track A (routine / low-risk)
  • Track B (higher-risk / policy-impacting)

Validation

  • cargo fmt --all -- --check
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo test --workspace --locked
  • cargo test --workspace --all-features --locked
  • Relevant architecture / dep-graph / docs checks for touched areas
  • Additional scenario, benchmark, or manual checks when behavior changed
  • If this changes config/env fallback, limits, or defaults: include before/after behavior and regression coverage for explicit path, fallback path, and boundary values
  • If tests mutate process-global env: document how state is restored or serialized

Commands and evidence:

cargo fmt --all -- --check
./scripts/check_architecture_boundaries.sh
./scripts/check_dep_graph.sh
CARGO_TARGET_DIR=<redacted-target-dir> cargo test -p loongclaw-app runtime_backend_supports_local_abort_for_running_prompt --lib -- --nocapture
CARGO_TARGET_DIR=<redacted-target-dir> cargo test -p loongclaw-app ensure_session_falls_back_to_sessions_new_when_ensure_has_no_identifiers --lib -- --nocapture
CARGO_TARGET_DIR=<redacted-target-dir> cargo test -p loongclaw-app runtime_backend_executes_session_turn_and_controls --lib -- --nocapture
CARGO_TARGET_DIR=<redacted-target-dir> cargo test -p loongclaw-app doctor_accepts_path_discovered_fake_version_command --lib -- --nocapture
CARGO_TARGET_DIR=<redacted-target-dir> cargo test -p loongclaw-app browser_companion_session_start_reports_balanced_execution_tier --lib -- --nocapture
CARGO_TARGET_DIR=<redacted-target-dir> cargo test -p loongclaw gateway_owner_state --test integration -- --nocapture
CARGO_TARGET_DIR=<redacted-target-dir> cargo test --workspace --locked -j 1
CARGO_TARGET_DIR=<redacted-target-dir> cargo clippy --workspace --all-targets --all-features -- -D warnings
CARGO_TARGET_DIR=<redacted-target-dir> cargo test --workspace --all-features --locked -j 1 -- --test-threads=1

Notes:

  • An earlier all-features run with default test scheduling surfaced timing-sensitive noise in unrelated shell/browser companion tests; rerunning with serialized test threads passed cleanly and no changed-area regressions remained.
  • The changed logging paths are metadata-only. No message body, token, or secret payload logging was introduced.

User-visible / Operator-visible Changes

  • Operators now get explicit structured logs when the gateway runtime starts and when the control surface becomes ready.
  • Feishu webhook and websocket runtimes now log startup metadata, accepted inbound/card callback events, deduplication outcomes, and successful inbound processing.

Failure Recovery

  • Fast rollback or disable path: revert commit e25d497ee to restore prior logging behavior.
  • Observable failure symptoms reviewers should watch for: overly noisy logs on busy Feishu channels, or any accidental exposure of request contents beyond the added metadata fields.

Reviewer Focus

  • crates/daemon/src/gateway/service.rs: startup vs control-surface-ready log boundaries and field selection.
  • crates/app/src/channel/feishu/webhook.rs: metadata-only logging and dedupe/process-success points.
  • crates/app/src/channel/feishu/websocket.rs: websocket session ingress logging without altering dispatch semantics.

Summary by CodeRabbit

  • Chores
    • Enhanced structured logging and observability across Feishu channel handlers (webhook and websocket modes) and gateway runtime components. Added startup event logs, request/event processing logs, and control surface readiness indicators to improve monitoring and debugging capabilities.

Gateway-run and Feishu channel sessions could appear idle even when the
runtime had started and was receiving inbound traffic. This adds
privacy-safe structured tracing at the startup, control-surface-ready,
and inbound-event milestones so operators can distinguish between
startup, ingress, deduplication, and successful dispatch without
inspecting message bodies or secrets.

Constraint: Observability must improve without logging message bodies, tokens, or private endpoints beyond existing configured bind/url metadata
Rejected: Add a broader telemetry/event framework | too broad for this bounded runtime reliability slice
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep future Feishu/gateway logs metadata-only unless the repository adopts a reviewed redaction policy first
Tested: cargo fmt --all -- --check
Tested: ./scripts/check_architecture_boundaries.sh
Tested: ./scripts/check_dep_graph.sh
Tested: CARGO_TARGET_DIR=/Users/chum/.cache/loongclaw-runtime-reliability-target cargo test -p loongclaw-app runtime_backend_supports_local_abort_for_running_prompt --lib -- --nocapture
Tested: CARGO_TARGET_DIR=/Users/chum/.cache/loongclaw-runtime-reliability-target cargo test -p loongclaw-app ensure_session_falls_back_to_sessions_new_when_ensure_has_no_identifiers --lib -- --nocapture
Tested: CARGO_TARGET_DIR=/Users/chum/.cache/loongclaw-runtime-reliability-target cargo test -p loongclaw-app runtime_backend_executes_session_turn_and_controls --lib -- --nocapture
Tested: CARGO_TARGET_DIR=/Users/chum/.cache/loongclaw-runtime-reliability-target cargo test -p loongclaw-app doctor_accepts_path_discovered_fake_version_command --lib -- --nocapture
Tested: CARGO_TARGET_DIR=/Users/chum/.cache/loongclaw-runtime-reliability-target cargo test -p loongclaw-app browser_companion_session_start_reports_balanced_execution_tier --lib -- --nocapture
Tested: CARGO_TARGET_DIR=/Users/chum/.cache/loongclaw-runtime-reliability-target cargo test -p loongclaw gateway_owner_state --test integration -- --nocapture
Tested: CARGO_TARGET_DIR=/Users/chum/.cache/loongclaw-runtime-reliability-target cargo test --workspace --locked -j 1
Tested: CARGO_TARGET_DIR=/Users/chum/.cache/loongclaw-runtime-reliability-target cargo clippy --workspace --all-targets --all-features -- -D warnings
Tested: CARGO_TARGET_DIR=/Users/chum/.cache/loongclaw-runtime-reliability-target cargo test --workspace --all-features --locked -j 1 -- --test-threads=1
Not-tested: End-to-end Feishu delivery against a live tenant or gateway deployment outside local verification
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 10, 2026

📝 Walkthrough

Walkthrough

This PR adds structured logging throughout the Feishu channel implementation (webhook and websocket modes) and gateway service runtime to improve observability. New logs capture startup information, webhook requests, event processing outcomes, and gateway runtime status. Two accessor methods are added to FeishuWebhookState to expose internal string fields for logging purposes.

Changes

Cohort / File(s) Summary
Feishu Webhook Handler
crates/app/src/channel/feishu/webhook.rs
Added two pub(super) accessor methods (configured_account_id(), account_id()) to FeishuWebhookState. Extended handlers with debug/info logs for incoming webhook requests, parsed actions (UrlVerification, CardCallback, Inbound), deduplication outcomes, and successful inbound event processing.
Feishu Channel Startup
crates/app/src/channel/feishu/mod.rs, crates/app/src/channel/feishu/websocket.rs
Added info logs with runtime context (config path, account identifiers, selection defaults, transport mode) at webhook/websocket channel initialization and session startup. Added debug log in websocket frame handling for decoded event payload metadata.
Gateway Service Runtime
crates/daemon/src/gateway/service.rs
Added as_str() method to GatewayRuntimeEntryPoint enum. Added info logs before gateway runtime acquisition (with surface count and metadata) and after control surface binding (with address, port, token path details).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested labels

daemon, channels, size: S

Suggested reviewers

  • gh-xj

Poem

🐰 A rabbit hops through logs so bright,
With telemetry beaming in the night,
Webhooks and sockets, now they sing,
Of startup states and gateway rings!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Improve gateway and Feishu runtime observability' accurately and clearly summarizes the main objective of the PR, which is adding structured tracing logs for gateway and Feishu channel runtime startup and operational milestones.
Linked Issues check ✅ Passed The PR successfully addresses #1083 by adding structured tracing logs at runtime startup, control-surface readiness, webhook/websocket ingress, deduplication, and inbound processing milestones, enabling operators to distinguish idle states from active inbound traffic handling.
Out of Scope Changes check ✅ Passed All changes are directly scoped to improving observability through structured logging in gateway and Feishu channel runtimes; no unrelated functionality modifications or behavioral changes outside the stated objectives were introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chumyin chumyin self-assigned this Apr 10, 2026
@github-actions github-actions Bot added daemon Daemon binary, CLI entrypoints, and install flow. channels Channel adapters and external integration surfaces. size: S Small pull request: 51-200 changed lines. labels Apr 10, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/app/src/channel/feishu/websocket.rs`:
- Around line 270-277: The tracing::info call is logging the raw websocket URL
(variable url) which can contain ephemeral auth tokens; update the code so it
does not log sensitive query params or fragments — compute a redacted_url (e.g.,
using url::Url to strip query and fragment or log only the
scheme+host+path/origin) and use that variable instead of %url in the
tracing::info call (keep configured_account_id and account_id as-is). Ensure the
redaction removes any query string/ticket/token information before logging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 381d42d7-3ad8-4d1a-a0a3-34a1ad2e9ff1

📥 Commits

Reviewing files that changed from the base of the PR and between e4734ce and e25d497.

📒 Files selected for processing (4)
  • crates/app/src/channel/feishu/mod.rs
  • crates/app/src/channel/feishu/webhook.rs
  • crates/app/src/channel/feishu/websocket.rs
  • crates/daemon/src/gateway/service.rs

Comment on lines +270 to +277
tracing::info!(
target: "loongclaw.channel.feishu",
transport = "websocket",
configured_account_id = %state.configured_account_id(),
account_id = %state.account_id(),
url = %url,
"connecting feishu websocket session"
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging raw websocket endpoint URLs.

url may include ephemeral auth material in query params (tickets/tokens). Logging it verbatim can leak secrets to logs.

🔐 Minimal redaction fix
 async fn run_feishu_websocket_session(
     state: &FeishuWebhookState,
     url: &str,
     ws_config: &FeishuWsEndpointClientConfig,
     stop: ChannelServeStopHandle,
 ) -> CliResult<()> {
+    let redacted_url = url.split('?').next().unwrap_or(url);
     tracing::info!(
         target: "loongclaw.channel.feishu",
         transport = "websocket",
         configured_account_id = %state.configured_account_id(),
         account_id = %state.account_id(),
-        url = %url,
+        url = %redacted_url,
         "connecting feishu websocket session"
     );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tracing::info!(
target: "loongclaw.channel.feishu",
transport = "websocket",
configured_account_id = %state.configured_account_id(),
account_id = %state.account_id(),
url = %url,
"connecting feishu websocket session"
);
let redacted_url = url.split('?').next().unwrap_or(url);
tracing::info!(
target: "loongclaw.channel.feishu",
transport = "websocket",
configured_account_id = %state.configured_account_id(),
account_id = %state.account_id(),
url = %redacted_url,
"connecting feishu websocket session"
);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/app/src/channel/feishu/websocket.rs` around lines 270 - 277, The
tracing::info call is logging the raw websocket URL (variable url) which can
contain ephemeral auth tokens; update the code so it does not log sensitive
query params or fragments — compute a redacted_url (e.g., using url::Url to
strip query and fragment or log only the scheme+host+path/origin) and use that
variable instead of %url in the tracing::info call (keep configured_account_id
and account_id as-is). Ensure the redaction removes any query
string/ticket/token information before logging.

@github-actions github-actions Bot added size: M Medium pull request: 201-500 changed lines. and removed size: S Small pull request: 51-200 changed lines. labels Apr 17, 2026
@chumyin chumyin merged commit c561ed3 into eastreams:dev Apr 17, 2026
18 checks passed
@chumyin chumyin deleted the feat/runtime-reliability-20260410 branch April 17, 2026 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channels Channel adapters and external integration surfaces. daemon Daemon binary, CLI entrypoints, and install flow. size: M Medium pull request: 201-500 changed lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Improve log output for better observability and debug

1 participant