Skip to content

[reliability] Observability: sync-health metrics & signals #88

@pvg13

Description

@pvg13

Problem

Sync failures are currently invisible until a user notices missing data: no metrics for sync lag, catch-up rounds, per-peer ack-age, relay bytes, or divergence repairs. Operators and the consuming app cannot tell whether sync is healthy or quietly broken.

Proposed approach (Phase 3; pairs with the relay-cost telemetry of #TELEMETRY)

Add counters/gauges and expose them via NetworkEvent / diagnostics:

  • replication/sync lag and time-since-last-converged (per peer)
  • catch-up + RBSR round counts, divergence-repair count
  • per-peer last-acked cursor age
  • relay-byte trend (shared with the relay-cost telemetry issue)

Files

wavesyncdb/src/engine/mod.rs, wavesyncdb/src/network_status.rs (NetworkEvent), wavesyncdb/src/diagnostics.rs (or a new metrics.rs).

Ref: docs/research/sync-reliability.md §6 P2 / §5 (convergence verification & observability).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions