Skip to content

Fix SMB streaming-write deadlock; bump smb2 to 0.9.0#16

Merged
vdavid merged 15 commits into
mainfrom
investigate/smb-deadlock
May 17, 2026
Merged

Fix SMB streaming-write deadlock; bump smb2 to 0.9.0#16
vdavid merged 15 commits into
mainfrom
investigate/smb-deadlock

Conversation

@vdavid
Copy link
Copy Markdown
Owner

@vdavid vdavid commented May 17, 2026

Summary

  • Fixes a reproducible deadlock in SmbVolume::write_from_stream that hung large concurrent copies to SMB destinations (originally surfaced as a 30+ min freeze in the copy dialog against a real QNAP NAS, 15369 files × 7–8 MB on OverwriteSmaller).
  • Root cause: the streaming-fallback write path held the Arc<Mutex<Option<SmbClient>>> for the entire upload duration because smb2::FileWriter<'a> borrowed &'a mut Connection from the client. Under sustained concurrent pressure (writes interleaved with the conflict pre-flight get_metadata stats), the nested two-phase lock pattern starved waiters.
  • Architectural fix in smb2 0.9.0: FileWriter now owns its Connection (cheap Arc::clone) and Arc<Tree>, no lifetime parameter. cmdr's write_from_stream drives both the compound fast-path and the streaming fallback on a single owned Connection clone obtained via clone_session. No mutex is held across SMB I/O. N concurrent streaming writes on one SmbVolume now pipeline N WRITE chains over a single SMB session, multiplexed by MessageId in smb2's receiver task.
  • Ships a regression test (smb_integration_concurrent_streaming_writes_no_deadlock) that reproduces the original bug shape (200 files × 1 MB through smb-maxreadsize's 64 KB max_write, OverwriteSmaller with 140 same-size conflicts + 60 actual copies, concurrency 8). Verified 5/5 hangs at the 120 s timeout pre-fix and 5/5 pass in 5–14 s post-fix.
  • Net diff drops dead code (acquire_client, LoggedClientGuard) and keeps the lightweight clone_session ticket logging behind RUST_LOG=cmdr_lib::file_system::volume::smb=debug for future SMB diagnostics.

Test plan

  • Repo-wide rust check green locally (29 integration tests pass; the new regression test gracefully skips when the smb-maxreadsize fixture isn't running — internal-smb-maxreadsize isn't in cmdr's vendored .compose/ set yet, follow-up to promote it).
  • Regression test verified against a fresh container (docker compose -f ~/projects-git/vdavid/smb2/tests/docker/internal/docker-compose.yml up -d smb-maxreadsize): 5/5 pass post-fix, 5/5 hang pre-fix.
  • Manual end-to-end against the user's real QNAP, OverwriteSmaller 200 × 7 MB copy: pre-fix hangs at 300 s; post-fix completes in ~107 s.
  • smb2 0.9.0 release gates green (just check-all, just test-docker, just test-consumer); 5 pre-existing live-integration failures (Pi/Kerberos infrastructure) are unrelated to the writer change.

Follow-up

  • Promote smb-maxreadsize from smb2's internal/ fixtures to its consumer-class set, then re-vendor cmdr's .compose/ and add to start.sh core, so the regression test runs in CI without manual fixture setup.

vdavid added 15 commits May 16, 2026 21:24
Bumped clippy 1.95 flags three pre-existing sort_by(|a,b| ...
cmp(...)) calls in volumes_linux/mod.rs as unnecessary_sort_by.
Switched to sort_by_key(|v| v.name.to_lowercase()).

Cfg-gated for Linux so verified by CI on Linux, not locally.
`mtp.spec.ts: copies file from MTP to local` was the lone outlier among the MTP→local / local→MTP copy tests that skipped the established belt-and-suspenders pattern. Failed once under the slow-suite load profile (rust-tests-linux + Docker SMB containers + everything else running concurrently): BE copy completed in 26 ms (50-byte `report.txt` landed at `/right/`), FE got the `Copy complete` log 1.2 s later, but the destination pane's PaneStateStore stayed at `files: []` for the full 15 s `mcpAwaitItem` budget — both safety nets raced.

What happened:

- The FilePane's per-pane notify-rs watcher (200 ms debounce) apparently missed the FSEvents add for `/right/report.txt`. The drive indexer's separate reconciler did see it (`live_heartbeat events=1` at 13:09:56), so the file genuinely was on disk; the per-pane debouncer just didn't deliver. macOS FSEvents under disk-and-CPU pressure occasionally drops events between debouncer batches.
- The `refreshPanesAfterTransfer` → `refreshPaneListing(rightPaneRef)` IPC fired from `handleTransferComplete` either didn't fire or didn't reach the BE (no `stall_probe::listing_task` for `/right` for 14 s after BE complete). Tauri event-loop saturation under the slow-suite load profile is the likely cause.

The TransferProgressDialog also visibly stayed in "Scanning 0/0/0" at the last captured recording frame even though the `Copy complete` log fired — same root cause (Svelte reactive update starved by the event loop), but cosmetic for the user (next frame would have closed it; in practice the user never sees the timeout the test hits).

This is not a prod bug — a user always has at least one rescuer (watcher or post-op refresh) on the happy path, and can press F5 if both race. It's a test robustness issue: the 15 s budget plus zero fallback path can't survive the worst-case load profile that `--include-slow` produces.

Fix: bring this test in line with tests 11 (`copies file from local to MTP`, mtp.spec.ts:360) and 27 (`copies 50 MB file from MTP to local`, mtp.spec.ts:1073), which already use the defensive pattern:

1. `pollFs` for the destination file on disk (the authoritative truth, independent of UI plumbing).
2. Explicit `mcpCall('refresh', {})` to force the pane re-list deterministically.
3. `mcpAwaitItem` against the fresh PaneStateStore.

The test now passes regardless of whether the watcher or the auto-refresh raced. Verified locally in isolation (4.3 s).
- Viewer and Settings windows open with `focus: false` when running under `CMDR_E2E_MODE=1`. The plugin drives the DOM via the Unix socket and doesn't need OS focus.
- Settings re-open path skips the `focus-self` event in E2E too, so reopening an already-open Settings window doesn't grab focus either.
- macOS `show_main_window` uses `[NSWindow orderFront:]` instead of `makeKeyAndOrderFront:` in E2E, so the initial main-window appearance doesn't steal focus from whatever the user is working in.
- Prod/dev behavior unchanged.
- New typed variants `VolumeError::DeletePending` and `WriteOperationError::DeletePending` so the listing and write-error paths can carry the state end-to-end instead of falling through to the generic `IoError`.
- `map_smb_error` dispatches on `err.status() == Some(NtStatus::DELETE_PENDING)` before the kind match (smb2 currently classifies it as `ErrorKind::Other`).
- `kinds::delete_pending` gives users a transient "File is being removed" message that explains an open handle is keeping the file alive, plus a wait/restart suggestion. Replaces the misleading "disk needs attention" copy.
- Tests in `smb.rs`, `volume_copy_tests.rs`, and `friendly_error/mod.rs` pin the mapping, the friendly copy shape, and the no-"error"/"failed" rule for the new variant.
Pure observability for the production deadlock where the client mutex
appears to be held indefinitely after a streaming-fallback write.
No behavior changes — locks acquire and release exactly as before.

- Process-global `CLIENT_LOCK_TICKET` (AtomicU64) assigns a unique id to
  every acquire of the SmbClient mutex.
- `LoggedClientGuard<'a>` wraps `MutexGuard<Option<SmbClient>>` with
  Deref/DerefMut + Drop, so existing callers (`guard.as_mut()...`) are
  unchanged and the release log fires automatically.
- `acquire_client` and `clone_session` log `waiting` / `acquired` / `released`
  with ticket, caller name, share, wait time, and held duration.
- Bail paths (no-session) emit a matching release log so we don't see
  ghost "waiting" entries without resolution.
- Workspace `[patch.crates-io] smb2 = { path = "../smb2-smb-deadlock" }` so
  the worktree picks up local smb2 instrumentation. Investigation-only;
  not for main.

Grep pattern: `grep client-mutex: cmdr.log | sort -k4` groups by ticket.
- Bumps `smb2` to 0.9.0, which rebuilt `FileWriter` to own its `Connection` + `Arc<Tree>` instead of borrowing `&'a mut Connection`. See smb2's `investigate/smb-deadlock` commit for the API shape.
- `SmbVolume::write_from_stream` is now lock-free for the entire upload. Both the compound fast-path and the streaming fallback drive their write on a single `clone_session()` result (cheap `Arc::clone` of `Connection` + `Arc::clone` of `Arc<Tree>`). The client mutex is held only for the few microseconds of the clone itself, never across I/O. N concurrent `write_from_stream` calls on one `SmbVolume` now pipeline N WRITE chains over a single SMB session, multiplexed by `MessageId` in smb2's receiver task.
- Cancellation in the streaming path falls back to a second `clone_session()` for the partial-file delete (the writer's connection is gone by then). Best-effort; unchanged semantics.
- Marks `acquire_client` `#[allow(dead_code)]` and updates its doc comment — it's no longer called from any hot path. Kept in the file as a candidate for follow-up removal once we're confident no reconnect/diagnostic caller is coming back; `SmbClient` is `!Clone` so the guarded `&mut SmbClient` shape might still be useful.
- Updates `volume/CLAUDE.md`: replaces the "SMB write streaming fallback still holds the client mutex" Gotcha with a forward pointer ("no longer applicable as of smb2 0.9"), adds a Decision capturing the new design and the deadlock-fix Why, and amends the session-split Decision to reflect that the streaming-write path now uses the same clone-and-release shape as the read paths.
- Why: under sustained concurrent pressure on QNAP (Phase C, 200 × 7 MB OverwriteSmaller), the old two-phase pattern (brief `clone_session` fast-path probe → drop → long `acquire_client` for the streaming fallback) deadlocked reliably. The fallback was forced to hold the client mutex for the upload's duration because the old `FileWriter<'a>` borrowed `&'a mut Connection`. Phase C `smb_integration_phase_c_deadlock_repro_qnap` now PASSES in ~108 s (versus the pre-fix 300 s+ hang).
Replaces the throwaway Phase C / Docker-variants experimental block with a single, scoped regression test for the SMB streaming-write deadlock fixed in commit `efb15479`.

- Adds `smb_integration_concurrent_streaming_writes_no_deadlock`, gated behind `#[ignore]` and the smb2 `smb-maxreadsize` Docker fixture on port 10454. 200 × 1 MB files, 140 OverwriteSmaller conflicts + 60 actual copies, concurrency=8 — the production shape that originally surfaced the bug.
- Helpers (`MutexCaptureLogger`, `install_mutex_capture_logger`, `cleanup_test_prefix`, `connect_docker_smb_volume`, `run_concurrent_write_pass`) live alongside the test and reuse the existing `client-mutex:` / smb2 `recv:` ring-buffer dump on hang.
- Cleanup is parameterized on a `TEST_PREFIX_ROOT = "_test/cmdr-regression-"` constant. Two `#[test]` units lock down the safety assert and the prefix shape.
- Removed: QNAP-specific test + smb2 `.env` loader (touched the user's NAS), and the flaky `v2_slow`, `v4_concurrency32`, `v5_loop` Docker variants.
- Verified: 3 consecutive runs against post-fix code passed in 4.8 s / 5.6 s / 10.6 s.

Follow-up: promote `smb-maxreadsize` from smb2's `internal` fixtures to consumer-class so cmdr's `.compose/` vendors it. That would let this test run in CI without manual fixture setup.
`acquire_client` had its sole hot-path caller removed in `efb15479` —
the streaming write path now drives on an owned `Connection` clone via
smb2 0.9's owned `FileWriter`. Nothing else called it. `LoggedClientGuard`
existed only to instrument that one method's release point, so it goes
with it.

`clone_session` keeps its ticketed `client-mutex:` debug logs: that's
the one remaining lock acquire on the SMB session and a useful future
diagnostic. The mutex is held for microseconds per call now (just to
clone the underlying `Arc<Connection>`), so the logs sit at `debug`
where they fire only when explicitly enabled.

Also updates the colocated CLAUDE.md and the historical comment in
`write_from_stream` to point at the regression test that pins the
no-deadlock shape, not the deleted Phase C QNAP test.
The workspace-level `[patch.crates-io] smb2 = { path = ... }` was the
sidecar for developing the FileWriter rewrite against a local smb2
worktree. With smb2 v0.9.0 published, the desktop crate resolves the
real artifact via its normal `smb2 = "0.9.0"` dep.
The repo-wide rust check (run via nextest with `--run-ignored only`)
discovers all `smb_integration_*` tests. The streaming-write regression
needs smb-maxreadsize on port 10454, which cmdr's vendored `.compose/`
set doesn't include yet (it lives in smb2's `internal/` fixtures). Until
the fixture is promoted to smb2's consumer-class set and re-vendored,
the test now TCP-probes port 10454 with a 1 s deadline and logs a
warning + returns clean if the server isn't there.

When the fixture IS up — manual run, future CI once vendored — the
test proceeds and continues to verify the no-deadlock invariant.
smb2 0.9.1 promoted the chunked-IO fixture from its `internal/` set
to its consumer-class set. Re-vendor `.compose/` to pick up the new
`smb-consumer-maxreadsize` Dockerfile + smb.conf, bump the dep, and
add the container to `start.sh core` so the local rust check (and CI)
spin it up automatically. Provides a small 64 KB max_read/max_write
fixture for any test that needs to force the streaming-fallback path
without bringing in the smb2 internal-fixture compose stack.
Drop the TCP-probe graceful-skip and point the test at port 10494,
which `./apps/desktop/test/smb-servers/start.sh core` (and CI) now
start automatically via the vendored `smb-consumer-maxreadsize`
container. The test now actually runs on every Rust check, guarding
the no-deadlock invariant for real instead of skipping.

Port resolves from `SMB_CONSUMER_MAXREADSIZE_PORT` with a default of
10494, mirroring the pattern used elsewhere in the smb_integration_*
suite (the `smb2::testing::*_port()` helpers live behind the smb-e2e
feature, so plain integration tests hardcode the default).
The two `Escape closes the X window (production binding)` tests in
settings.spec.ts and viewer.spec.ts went through `keyboard.press('Escape')`,
which delivers the keystroke via the OS. Under Xvfb on Linux CI the
keystroke routinely lands on the main window (the previous test left
focus there, or the open-animation hasn't transferred it yet), the
scoped webview never sees it, and the test sits waiting for a
window-close that doesn't come. The existing two-attempt focus poll
helped but didn't eliminate the race — both tests flaked on the most
recent PR run.

Switch both to a synthetic `document.dispatchEvent(new KeyboardEvent('keydown',
{ key: 'Escape', bubbles: true }))` via `evaluate`. That bubbles through
the `<svelte:window on:keydown>` listener regardless of OS focus, which
is what the test is actually meant to exercise (the handler's close
logic). The Tauri/webkit2gtk OS → webview event pipeline isn't cmdr's
responsibility to test.

Pattern matches `network-toggle.spec.ts` and `file-operations.spec.ts`,
which already use synthetic Escape dispatch for the same reason. Drops
the helper `tryEscape` / focus-poll machinery; the test is now linear.
…ndor

Two related touch-ups around the vendored SMB compose set:

- Add `apps/desktop/test/smb-servers/.compose/` to `.oxfmtrc.json`'s
  `ignorePatterns`. Without it, `pnpm oxfmt .` (and CI's `check oxfmt`)
  rewrites YAML quoting in the vendored `docker-compose.yml`, which
  diverges from smb2's source of truth and creates churn on every
  re-vendor. The .compose/ files belong to smb2; cmdr just ships them.
- Replace the `rm -rf` / `cp -r` re-vendor recipe with an
  `rsync --exclude=VENDORED.md`. The old recipe deleted the very
  VENDORED.md it told the reader to consult, which made step 2 destroy
  step 1's docs. Also add a short section explaining the formatter
  exclusion so the next person bumping smb2 knows why it's there.
@vdavid vdavid merged commit 1a1b304 into main May 17, 2026
10 checks passed
@vdavid vdavid deleted the investigate/smb-deadlock branch May 17, 2026 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant