fix(web): wire up the full version-vector catch-up path (#57, #59)#58
Merged
Conversation
…hed (#57) A fresh `WebSyncClient` joining a relay-mediated mesh was receiving only *future* writes from peers. Historical state already on the peers (rows written before this tab existed) never landed in IndexedDB because the swarm path never sent a `VersionVector` request when a peer connected. The loopback path already did this correctly on every offline→online transition (`send_version_vector`, called from `run_loopback`'s `was_online` edge), but `run_swarm`'s `ConnectionEstablished` arm only inserted the peer into `connected` and pushed a status update. The persistence side was always ready — `BrowserStore` writes `peer_versions[peer]` on every successful inbound Push and exposes `get_peer_version()` — but nothing read it. After this commit, every non-relay `ConnectionEstablished` enqueues the peer into a new `pending_version_vectors: Vec<LibPeerId>` queue that's drained at the top of each `run_swarm` loop iteration, paralleling the existing `pending_announces` pattern. The drain calls the new `send_version_vector_swarm` helper which mirrors the loopback variant but routes the request via `swarm.behaviour_mut().snapshot.send_request(&peer, req)`. Why deferred via queue instead of called synchronously from `handle_event`: same wasm_bindgen_futures executor-reentrancy hazard the relay branch already documents at length — calling `request_response.send_request` synchronously inside `ConnectionEstablished` can wake a task whose poll re-enters `Inner::run` while we still hold the outer borrow from `swarm.select_next_some()`. The queue → drain pattern keeps all async send sites at the top of the loop, between selects. Invariants preserved: - HMAC mandatory on ALL request paths. When `group_key` is configured the new request is signed with `gk.mac(&bytes)` before being sent, identical to the loopback variant. Peers silently drop unauthenticated `VersionVector` requests. - `db_version=0` is the new-peer onboarding signal. `get_peer_version` returns 0 for peers never seen before, so a fresh tab's first `VersionVector` to each peer carries `your_last_db_version=0`, asking for full history. This is the only mechanism for initial state transfer over the swarm path. - Swarm cannot be held across awaits. The drain phase takes `&mut swarm` for `send_version_vector_swarm`'s own `swarm.behaviour_mut().snapshot.send_request` call but the `get_peer_version().await` and `state.db_version.lock().await` happen before any swarm borrow — the swarm is only touched at the final synchronous `send_request` line. Validation: - `cargo check --target wasm32-unknown-unknown -p wavesyncdb --features web,dioxus`: clean - `cargo check --target wasm32-unknown-unknown -p example-qr-pairing -p wavesync-website`: clean - `cargo clippy -p wavesyncdb -p wavesyncdb_derive -p wavesync_relay -- -D warnings`: clean - `cargo test -p wavesyncdb --lib`: 167 unit tests pass - `cargo fmt --check`: clean No automated test for this path — it requires a real browser dialing a real relay with at least one connected peer holding history. Manual repro plan is in the PR body. Known followup (deliberately not in this PR): on rapid reconnect storms the queue can fire multiple `VersionVector`s for the same peer if `ConnectionClosed`+`ConnectionEstablished` cycle inside a single loop iteration. The receiver responds idempotently so this is wasteful rather than incorrect. Switch the queue to a `HashSet` or add a per-peer in-flight tracker if production traffic shows this matters. Closes #57
a1f8dfc to
522ea5a
Compare
Follow-up to the version-vector trigger from #57. With the trigger in place the web sends the request and the native peer responds, but two issues remained that left the catch-up still broken in the field. ## Sub-bug 1: response was dropped on the floor `handle_snapshot_event`'s `Message::Response { .. }` arm was a no-op ("PushAck etc. — no-op for now"), so the `ChangesetResponse` carrying the actual catch-up data ran straight to `/dev/null`. The loopback path doesn't hit this because its responder sends a fresh `Push` request through `out_tx` instead of a `send_response`, which routes through the regular Push handler that already calls `apply_remote_changeset`. The swarm path had neither apply logic nor a "Push back" fallback. Fix: the response arm now matches on `SyncResponse::ChangesetResponse`, mirrors the native engine's `handle_changeset_response` (`engine/sync_handler.rs:52-93`): - verify the response's `topic` against ours, - verify the HMAC against bytes that include the *real* `your_last_db_version` value (the sender computed the tag over the real value; using a placeholder would always fail verification), - log the count + responder's db_version for diagnostic visibility, - reconstruct the same `SyncChangeset` shape the inbound-Push path builds, broadcast it on `inbound_tx`, and call `apply_remote_changeset` — which already persists each winning change, broadcasts on `resolved_tx`, and updates `peer_versions[peer]`. No separate `set_peer_version` is needed. `PushAck` and `IdentityAck` keep their no-op semantics. ## Sub-bug 2: 10s default timeout was too short for full history `request_response::Config::default()` carries a 10s request timeout. For a fresh tab pulling months of history against a populated peer over a circuit relay, the responder's `get_changes_since(0)` scan + JSON serialization + HMAC + circuit-relay hop frequently exceeded 10s, surfacing as `OutboundFailure: Timeout while waiting for a response`. The native side already uses `with_request_timeout(Duration::from_secs(30))` on its snapshot behaviour (`engine/behaviour.rs:72`). The web was asymmetrically slower than the native side's tolerance; matching the 30s value keeps the two ends symmetric. If real-world catch-ups ever need more, chunking the response into multiple Pushes is the next lever (followup, not this PR). Invariants preserved (same as #57): - HMAC mandatory on ALL message paths — the response handler drops unauthenticated `ChangesetResponse` silently. - `db_version=0` semantics still drive full-history onboarding. - Swarm not held across awaits — `apply_remote_changeset` is awaited but only `&state` (and `&peer`) are passed; the swarm borrow ends before the await. Closes #59
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #57 and #58 issue followup #59.
The fix lands in three commits, in the order the bugs surface:
fix(web): trigger version-vector catch-up on swarm ConnectionEstablished (#57)— Adds the queue-based trigger so the web actually asks peers for catch-up data on connect. Without this, the web only ever saw real-timePushtraffic.chore(fmt): run rustfmt on parse_table_name— CI fmt nit from the earlier round.fix(web): apply ChangesetResponse + bump snapshot timeout to 30s (#59)— Handles the responses (the previous arm was a no-op so the catch-up data was dropped on the floor) and matches the native side's 30swith_request_timeoutso catch-up against a populated peer doesn't time out at 10s.Together these make a fresh
WebSyncClientconverge with native peers within one round-trip on connect, even with months of historical data.#57 — version-vector trigger
run_swarmnow triggers aVersionVectorcatch-up request when a non-relay peer'sConnectionEstablishedfires, parallelingrun_loopback's offline→online edge. The trigger lives on a newpending_version_vectors: Vec<LibPeerId>queue drained at the top of each loop iteration (mirrorspending_announces) — direct synchronous call fromhandle_eventwould re-enterwasm_bindgen_futures::Inner::runwhile holding the swarm borrow.#59 sub-bug 1 — apply the response
handle_snapshot_event'sMessage::Response { .. }arm was a no-op marked "PushAck etc. — no-op for now", soChangesetResponsearrived and was silently dropped. The loopback path doesn't hit this because its responder sends a freshPushrequest throughout_txinstead ofsend_response, which routes through the existing Push handler that already callsapply_remote_changeset.The new handler mirrors the native engine's
handle_changeset_response(engine/sync_handler.rs:52-93): verify topic, verify HMAC against bytes that include the realyour_last_db_versionvalue, reconstruct the sameSyncChangesetshape the inbound-Push path builds, broadcast oninbound_tx, then callapply_remote_changeset— which already persists each winning change, broadcasts onresolved_tx, and updatespeer_versions[peer].#59 sub-bug 2 — 30s snapshot timeout
request_response::Config::default()carries a 10s request timeout. Field repro: against a phone with weeks of synced data, the responder'sget_changes_since(0)scan + JSON serialization + HMAC + circuit-relay hop exceeded 10s and surfaced asOutboundFailure: Timeout while waiting for a response. The native side already useswith_request_timeout(Duration::from_secs(30))on its snapshot behaviour (engine/behaviour.rs:72); web now matches.Invariants preserved
group_keyis configured; the new inbound response handler drops unauthenticatedChangesetResponsesilently. Mirrors what the native engine does in both directions.db_version=0semantics.get_peer_versionreturns 0 for peers never seen before, so a fresh tab's firstVersionVectorcarriesyour_last_db_version=0and asks for full history — the only mechanism for initial state transfer.&mut swarmfor the synchronoussend_request, butget_peer_version().awaitanddb_version.lock().awaithappen before any swarm borrow. The response handler awaitsapply_remote_changesetbut only with&state+&peer, releasing the swarm borrow first.Test plan
Automated:
cargo check --target wasm32-unknown-unknown -p wavesyncdb --features web,dioxus— cleancargo check --target wasm32-unknown-unknown -p example-qr-pairing -p wavesync-website— cleancargo clippy -p wavesyncdb -p wavesyncdb_derive -p wavesync_relay -- -D warnings— cleancargo test -p wavesyncdb --lib— 167 unit tests passcargo fmt --check— cleanManual repro (requires a browser + relay + at least one peer holding history):
wavesync_relay.tests-e2e/test-peer) joined to the same relay+topic+passphrase. Submit a few rows so the peer has history.WebSyncClient::connect_via_relay(...).use_synced_table::<E>(handle)stays empty.relay_connected = true, the console logsswarm: requesting catch-up from <peer> since db_version=0 (we are at 0)followed byWebSyncClient: received ChangesetResponse from <peer> with N changes, and the rows materialize.Known followup (deliberately not in this PR)
VersionVectors for the same peer ifConnectionClosed+ConnectionEstablishedcycle inside a single loop iteration. Wasteful, not incorrect.Pushrequest fans instead of one fatChangesetResponse. Avoids a RAM cliff on very large histories.