Skip to content

fix(web): _withInnerWebClient re-entrancy — depth-tracked inline run#152

Open
WiktorStarczewski wants to merge 3 commits into
mainfrom
wiktor/fix-withinnerwebclient-deadlock
Open

fix(web): _withInnerWebClient re-entrancy — depth-tracked inline run#152
WiktorStarczewski wants to merge 3 commits into
mainfrom
wiktor/fix-withinnerwebclient-deadlock

Conversation

@WiktorStarczewski
Copy link
Copy Markdown
Collaborator

Summary

_withInnerWebClient(fn) deadlocks any time fn calls back into a proxied async wasm method on the same inner client — which is exactly the entire point of the escape hatch. Every Miden Wallet usage of the local-prove (offscreen-doc) pipeline on Chrome MV3 hangs indefinitely because of this.

Fix is a depth-tracked re-entrancy on _serializeWasmCall: while _withInnerWebClient's callback is running, the inner WebClient's _withInnerLockDepth is > 0 and _serializeWasmCall runs its callback inline instead of enqueueing it on the same chain that the outer slot is already awaiting. External callers still queue behind the outer slot — no behavior change for any other code path.

The deadlock

_withInnerWebClient(fn) wraps fn in _serializeWasmCall(() => fn(inner)). The intent: hold the chain slot for the entire callback so all inner wasm calls run atomically.

Inside fn, any call to a proxied async wasm method — inner.getInputNote(...) (proxy-fallback wraps it), inner.executeTransaction(...) (wrapper class method wraps it), inner.submitProvenTransaction(...), inner.applyTransaction(...) — also calls _serializeWasmCall(fnB). Since the chain is currently held by the outer slot (fnA), fnB enqueues after fnA. But fnA is awaiting fnB's result. fnB is waiting for fnA to settle. Classic re-entrant-lock deadlock with no timeout.

Diagnostic trace from the wallet's proveLocallyViaOffscreen reproducing it:

t=35.7s  proveLocallyViaOffscreen inside SDK lock; building exec args
t=35.7s  consumeNoteId buildExecuteArgs: calling getInputNote
         ← never returns; test times out at 5 min

getInputNote never returns. Downstream, the wallet's SW SyncManager times out on withWasmClientLock while the consume holds the wallet's outer mutex, and surfaces a misleading "cannot reach the miden node" banner.

Fix

_serializeWasmCall(fn) {
  if (this._withInnerLockDepth > 0) {
    return Promise.resolve().then(fn);  // run inline; outer slot already held
  }
  const result = this._wasmCallChain.catch(() => {}).then(fn);
  this._wasmCallChain = result.catch(() => {});
  return result;
}

_withInnerWebClient(fn) {
  return this.#inner._serializeWasmCall(async () => {
    this.#inner._withInnerLockDepth = (this.#inner._withInnerLockDepth || 0) + 1;
    try {
      return await fn(this.#inner);
    } finally {
      this.#inner._withInnerLockDepth--;
    }
  });
}

External callers still queue behind the outer slot — they hit _serializeWasmCall with _withInnerLockDepth === 0 and go through the chain. The chain still serializes against them; the outer slot only resolves after fn (including every inline re-entry) settles.

Safety contract (documented inline)

Re-entrancy is safe only if the caller of _withInnerWebClient holds an external mutex preventing concurrent access to this same client instance via other code paths during fn. The chain serializes against external callers but if an external task runs during one of fn's awaits and calls into the SDK, it will see _withInnerLockDepth > 0 and run inline, racing wasm-bindgen's borrow check. The wallet's own withWasmClientLock discipline satisfies this; the method is @internal so callers should already be aware.

Tests

Added crates/web-client/test/with_inner_web_client_reentrancy.test.ts with four cases against the live SDK:

  1. Proxy-fallback read inside fn (inner.getInputNote(...)) doesn't deadlock.
  2. Wrapper-class method inside fn (inner.syncState()) doesn't deadlock — exercises the re-entry path through wrapper-defined methods.
  3. External SDK callers still queue behind an in-flight _withInnerWebClient slot (ordering invariant).
  4. Rejection in fn releases the depth counter so subsequent calls work.

Each test wraps the call in Promise.race with an explicit timeout so the deadlock case fails loud with TIMEOUT instead of hanging the suite.

Empirical verification with Miden Wallet stress harness

Patched the Miden Wallet to consume this branch via file: deps, enabled local proving on both stress wallets, ran the 5-tx stress against testnet.

Pre-fix: every consume hung at getInputNote; zero sends completed; 5-minute Playwright timeout.

Post-fix: 5 of 5 sends completed end-to-end through the offscreen-doc prove pipeline:

min p50 p95 max successful
sendLatencyMs 10.8 s 15.7 s 17.7 s 17.7 s 5

10–18 s/send is the expected offscreen-doc MT-WASM prove window. [mt-offscreen-prove] tx_completed prove_ms=… fires for every op.

Remaining stress failures are downstream of miden-vm#3144 (BasicFungibleFaucet procedure-digest drift across miden-assembly patch versions) — the test faucet renders as "Unknown" instead of "TST", which the stress harness's getByText('TST', exact: true) selector can't match. Unrelated to this fix.

Test plan

  • CI green
  • Manual: link this branch into Miden Wallet via scripts/dev-with-web-sdk-pr.sh, enable local proving in Settings, send a tx → should complete in ~10 s instead of hanging
  • Manual: same flow with delegated proving → unchanged from baseline (no regression on the delegate path)

Copy link
Copy Markdown
Collaborator

@SantiagoPittella SantiagoPittella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a "trusted handle" instead of a flag? Essentially replacing the "flip a global flag while fn runs" logic with "hand fn a different object whose methods don't go through _serializeWasmCall at all":

  // In MidenClient
  _withInnerWebClient(fn) {
    return this.#inner._serializeWasmCall(() => {
      const trusted = createTrustedHandle(this.#inner); // bypasses chain
      return fn(trusted);
    });
  }

createTrustedHandle(inner) would return an object that exposes the same method surface as the proxy-wrapped inner, but each method calls the underlying wasm-bindgen client directly, skipping _serializeWasmCall entirely. The serialization is provided once, by the outer _serializeWasmCall slot that wraps the whole fn.

With this approach we can remove the shared counter and global flag, the external callers don't need to know about this.

If you prefer to keep the existing approach, I think that the new tests aren't running in the CI, would you mind to fix that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants