Skip to content

fix(web): wire up the full version-vector catch-up path (#57, #59)#58

Merged
pvg13 merged 2 commits into
mainfrom
fix/web-version-vector-catchup
May 13, 2026
Merged

fix(web): wire up the full version-vector catch-up path (#57, #59)#58
pvg13 merged 2 commits into
mainfrom
fix/web-version-vector-catchup

Conversation

@pvg13

@pvg13 pvg13 commented May 13, 2026

Copy link
Copy Markdown
Owner

Closes #57 and #58 issue followup #59.

The fix lands in three commits, in the order the bugs surface:

  1. fix(web): trigger version-vector catch-up on swarm ConnectionEstablished (#57) — Adds the queue-based trigger so the web actually asks peers for catch-up data on connect. Without this, the web only ever saw real-time Push traffic.
  2. chore(fmt): run rustfmt on parse_table_name — CI fmt nit from the earlier round.
  3. fix(web): apply ChangesetResponse + bump snapshot timeout to 30s (#59) — Handles the responses (the previous arm was a no-op so the catch-up data was dropped on the floor) and matches the native side's 30s with_request_timeout so catch-up against a populated peer doesn't time out at 10s.

Together these make a fresh WebSyncClient converge with native peers within one round-trip on connect, even with months of historical data.

#57 — version-vector trigger

run_swarm now triggers a VersionVector catch-up request when a non-relay peer's ConnectionEstablished fires, paralleling run_loopback's offline→online edge. The trigger lives on a new pending_version_vectors: Vec<LibPeerId> queue drained at the top of each loop iteration (mirrors pending_announces) — direct synchronous call from handle_event would re-enter wasm_bindgen_futures::Inner::run while holding the swarm borrow.

#59 sub-bug 1 — apply the response

handle_snapshot_event's Message::Response { .. } arm was a no-op marked "PushAck etc. — no-op for now", so ChangesetResponse arrived and was silently dropped. The loopback path doesn't hit this because its responder sends a fresh Push request through out_tx instead of send_response, which routes through the existing Push handler that already calls apply_remote_changeset.

The new handler mirrors the native engine's handle_changeset_response (engine/sync_handler.rs:52-93): verify topic, verify HMAC against bytes that include the real your_last_db_version value, reconstruct the same SyncChangeset shape the inbound-Push path builds, broadcast on inbound_tx, then call apply_remote_changeset — which already persists each winning change, broadcasts on resolved_tx, and updates peer_versions[peer].

#59 sub-bug 2 — 30s snapshot timeout

request_response::Config::default() carries a 10s request timeout. Field repro: against a phone with weeks of synced data, the responder's get_changes_since(0) scan + JSON serialization + HMAC + circuit-relay hop exceeded 10s and surfaced as OutboundFailure: Timeout while waiting for a response. The native side already uses with_request_timeout(Duration::from_secs(30)) on its snapshot behaviour (engine/behaviour.rs:72); web now matches.

Invariants preserved

  • HMAC mandatory on ALL message paths. The new outbound request is signed when group_key is configured; the new inbound response handler drops unauthenticated ChangesetResponse silently. Mirrors what the native engine does in both directions.
  • db_version=0 semantics. get_peer_version returns 0 for peers never seen before, so a fresh tab's first VersionVector carries your_last_db_version=0 and asks for full history — the only mechanism for initial state transfer.
  • Swarm not held across awaits. The drain phase takes &mut swarm for the synchronous send_request, but get_peer_version().await and db_version.lock().await happen before any swarm borrow. The response handler awaits apply_remote_changeset but only with &state + &peer, releasing the swarm borrow first.

Test plan

Automated:

  • cargo check --target wasm32-unknown-unknown -p wavesyncdb --features web,dioxus — clean
  • cargo check --target wasm32-unknown-unknown -p example-qr-pairing -p wavesync-website — clean
  • cargo clippy -p wavesyncdb -p wavesyncdb_derive -p wavesync_relay -- -D warnings — clean
  • cargo test -p wavesyncdb --lib — 167 unit tests pass
  • cargo fmt --check — clean

Manual repro (requires a browser + relay + at least one peer holding history):

  1. Boot wavesync_relay.
  2. Boot a native test peer (tests-e2e/test-peer) joined to the same relay+topic+passphrase. Submit a few rows so the peer has history.
  3. Open a browser tab on clean IndexedDB and run WebSyncClient::connect_via_relay(...).
  4. Before: use_synced_table::<E>(handle) stays empty.
  5. After: within one round-trip of relay_connected = true, the console logs swarm: requesting catch-up from <peer> since db_version=0 (we are at 0) followed by WebSyncClient: received ChangesetResponse from <peer> with N changes, and the rows materialize.

Known followup (deliberately not in this PR)

  • Reconnect-storm dedup: the queue can fire multiple VersionVectors for the same peer if ConnectionClosed+ConnectionEstablished cycle inside a single loop iteration. Wasteful, not incorrect.
  • Response chunking: if real-data catch-ups ever exceed even 30s, split the response into multiple Push request fans instead of one fat ChangesetResponse. Avoids a RAM cliff on very large histories.

…hed (#57)

A fresh `WebSyncClient` joining a relay-mediated mesh was receiving
only *future* writes from peers. Historical state already on the
peers (rows written before this tab existed) never landed in
IndexedDB because the swarm path never sent a `VersionVector`
request when a peer connected. The loopback path already did this
correctly on every offline→online transition (`send_version_vector`,
called from `run_loopback`'s `was_online` edge), but `run_swarm`'s
`ConnectionEstablished` arm only inserted the peer into `connected`
and pushed a status update.

The persistence side was always ready — `BrowserStore` writes
`peer_versions[peer]` on every successful inbound Push and exposes
`get_peer_version()` — but nothing read it. After this commit, every
non-relay `ConnectionEstablished` enqueues the peer into a new
`pending_version_vectors: Vec<LibPeerId>` queue that's drained at
the top of each `run_swarm` loop iteration, paralleling the
existing `pending_announces` pattern. The drain calls the new
`send_version_vector_swarm` helper which mirrors the loopback
variant but routes the request via
`swarm.behaviour_mut().snapshot.send_request(&peer, req)`.

Why deferred via queue instead of called synchronously from
`handle_event`: same wasm_bindgen_futures executor-reentrancy
hazard the relay branch already documents at length — calling
`request_response.send_request` synchronously inside
`ConnectionEstablished` can wake a task whose poll re-enters
`Inner::run` while we still hold the outer borrow from
`swarm.select_next_some()`. The queue → drain pattern keeps all
async send sites at the top of the loop, between selects.

Invariants preserved:

- HMAC mandatory on ALL request paths. When `group_key` is configured
  the new request is signed with `gk.mac(&bytes)` before being sent,
  identical to the loopback variant. Peers silently drop
  unauthenticated `VersionVector` requests.
- `db_version=0` is the new-peer onboarding signal. `get_peer_version`
  returns 0 for peers never seen before, so a fresh tab's first
  `VersionVector` to each peer carries `your_last_db_version=0`,
  asking for full history. This is the only mechanism for initial
  state transfer over the swarm path.
- Swarm cannot be held across awaits. The drain phase takes
  `&mut swarm` for `send_version_vector_swarm`'s own
  `swarm.behaviour_mut().snapshot.send_request` call but the
  `get_peer_version().await` and `state.db_version.lock().await`
  happen before any swarm borrow — the swarm is only touched at
  the final synchronous `send_request` line.

Validation:

- `cargo check --target wasm32-unknown-unknown -p wavesyncdb --features web,dioxus`: clean
- `cargo check --target wasm32-unknown-unknown -p example-qr-pairing -p wavesync-website`: clean
- `cargo clippy -p wavesyncdb -p wavesyncdb_derive -p wavesync_relay -- -D warnings`: clean
- `cargo test -p wavesyncdb --lib`: 167 unit tests pass
- `cargo fmt --check`: clean

No automated test for this path — it requires a real browser dialing
a real relay with at least one connected peer holding history. Manual
repro plan is in the PR body.

Known followup (deliberately not in this PR): on rapid reconnect
storms the queue can fire multiple `VersionVector`s for the same
peer if `ConnectionClosed`+`ConnectionEstablished` cycle inside a
single loop iteration. The receiver responds idempotently so this
is wasteful rather than incorrect. Switch the queue to a `HashSet`
or add a per-peer in-flight tracker if production traffic shows
this matters.

Closes #57
@pvg13 pvg13 force-pushed the fix/web-version-vector-catchup branch from a1f8dfc to 522ea5a Compare May 13, 2026 12:43
Follow-up to the version-vector trigger from #57. With the trigger in
place the web sends the request and the native peer responds, but two
issues remained that left the catch-up still broken in the field.

## Sub-bug 1: response was dropped on the floor

`handle_snapshot_event`'s `Message::Response { .. }` arm was a no-op
("PushAck etc. — no-op for now"), so the `ChangesetResponse` carrying
the actual catch-up data ran straight to `/dev/null`. The loopback
path doesn't hit this because its responder sends a fresh `Push`
request through `out_tx` instead of a `send_response`, which routes
through the regular Push handler that already calls
`apply_remote_changeset`. The swarm path had neither apply logic nor
a "Push back" fallback.

Fix: the response arm now matches on `SyncResponse::ChangesetResponse`,
mirrors the native engine's `handle_changeset_response`
(`engine/sync_handler.rs:52-93`):

- verify the response's `topic` against ours,
- verify the HMAC against bytes that include the *real*
  `your_last_db_version` value (the sender computed the tag over the
  real value; using a placeholder would always fail verification),
- log the count + responder's db_version for diagnostic visibility,
- reconstruct the same `SyncChangeset` shape the inbound-Push path
  builds, broadcast it on `inbound_tx`, and call
  `apply_remote_changeset` — which already persists each winning
  change, broadcasts on `resolved_tx`, and updates
  `peer_versions[peer]`. No separate `set_peer_version` is needed.

`PushAck` and `IdentityAck` keep their no-op semantics.

## Sub-bug 2: 10s default timeout was too short for full history

`request_response::Config::default()` carries a 10s request timeout.
For a fresh tab pulling months of history against a populated peer
over a circuit relay, the responder's `get_changes_since(0)` scan +
JSON serialization + HMAC + circuit-relay hop frequently exceeded
10s, surfacing as
`OutboundFailure: Timeout while waiting for a response`.

The native side already uses `with_request_timeout(Duration::from_secs(30))`
on its snapshot behaviour (`engine/behaviour.rs:72`). The web was
asymmetrically slower than the native side's tolerance; matching the
30s value keeps the two ends symmetric. If real-world catch-ups ever
need more, chunking the response into multiple Pushes is the next
lever (followup, not this PR).

Invariants preserved (same as #57):

- HMAC mandatory on ALL message paths — the response handler drops
  unauthenticated `ChangesetResponse` silently.
- `db_version=0` semantics still drive full-history onboarding.
- Swarm not held across awaits — `apply_remote_changeset` is awaited
  but only `&state` (and `&peer`) are passed; the swarm borrow ends
  before the await.

Closes #59
@pvg13 pvg13 changed the title fix(web): trigger version-vector catch-up on swarm ConnectionEstablished (#57) fix(web): wire up the full version-vector catch-up path (#57, #59) May 13, 2026
@pvg13 pvg13 merged commit e2a166c into main May 13, 2026
1 check passed
@pvg13 pvg13 deleted the fix/web-version-vector-catchup branch May 13, 2026 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Web client misses historical data — run_swarm doesn't trigger version-vector catch-up on ConnectionEstablished

1 participant