perf(motion): ~12x faster motion estimation (dask dedup + cheaper kernel) by daharoni · Pull Request #320 · miniscope/minian

daharoni · 2026-06-03T15:22:01Z

Summary

Speeds up motion estimation (minian/motion_correction.py) by ~12-13x per core on realistic movie lengths, with no meaningful change in output. Two independent, compounding gains:

b1 - dask task-duplication fix. est_motion_part exposed each chunk's (template, shifts) result as two separate dask graph nodes. After the dask >=2025 modernization, custom_arr_optimize became a pass-through, so the optimizer fused each leaf kernel into both consumers and emitted it under two keys, running est_motion_chunk once per branch. The recursion now keeps each chunk's result as a single delayed tuple and reduces via a _est_motion_reduce adapter, splitting only at the very end. (The old behavior was also subtly wrong: duplicated tasks re-ran the in-place-mutating kernel on shared views.)
b2 - cheaper per-frame kernel. The rigid path now warps with cv2.warpAffine instead of sitk.Resample, and registers via a direct cross-correlation with parabolic sub-pixel refinement (_xcorr_subpixel) reusing per-frame rfft2 results cached once per chunk. The non-rigid (BSpline/mesh_size) path is unchanged.

Verified results

Baseline (pre-perf) vs this branch, demo movie 480x752, single worker (synchronous scheduler = total work):

frames	blocks	old	new	speedup	dedup
250	1	27.2s	10.3s	2.7x	1.0x
500	2	102.5s	16.0s	6.4x	2.0x
1000	4	367.5s	31.6s	11.6x	4.0x
2000	8	763.7s	60.6s	12.6x	4.0x

Per-op micro-benchmarks (best of 5): transform sitk.Resample 4.1ms -> cv2.warpAffine 0.47ms (8.8x); registration skimage phase_cross_correlation(upsample=100) 65ms/pair -> cached parabolic xcorr 12ms/pair (5.3x).

The two axes decompose cleanly: the 1-block row is pure b2 (no duplication possible there); the dedup column is the b1 multiplier, which plateaus at ~4x (it scales with reduction-tree depth, ~log in block count, not linearly with length).

Accuracy

Registration shifts vs the old method: maxΔ 0.043 px, meanΔ 0.009 px.
cv2 vs sitk warp: <1.7/255 max, ~0.06 mean intensity difference (interior), well below imaging noise.
est_motion_chunk output is bit-for-bit identical to the pre-cleanup commit across 3D/4D x alt-error cases (the loop consolidation in this branch is behavior-preserving).

Parallel scaling

These are single-worker (total-work) numbers; multi-worker speedup stacks on top. Measured on the new code (2000 frames / 8 leaf blocks, threaded scheduler, inner libs pinned to 1 thread): 2 workers 1.75x (88%), 4 workers 2.63x (66%), 8 workers 3.83x (48%). Efficiency is bounded by leaf-block count + the reduction tail, so it improves on longer recordings.

Notes

Base is modernize-minian (this branch was cut from it; it depends on the dask >=2025 modernization).
Profiling/benchmark harnesses used to develop and verify this were intentionally kept local (untracked; .gitignore lists them) rather than shipped.

est_motion_part exposed each chunk kernel's tuple output as two separate dask nodes (res[0] -> template branch, res[1] -> shift branch). On dask >=2025 custom_arr_optimize became a no-op pass-through, so its keep_patterns no longer kept est_motion_chunk unfused; the default optimizer fused each leaf kernel with its dask-array input and, reached through both branches, emitted it under two path-dependent keys. Each kernel therefore ran once per branch and the redundancy compounded with recursion depth (a ~chunk-count blow-up), making motion estimation scale superlinearly with recording length. Keep each chunk's result as a single delayed (template, shifts) tuple and feed the whole tuple into the next reduction via _est_motion_reduce, which stacks templates and gathers shifts inside one delayed. The tuple is split only once, at the end, for the two returned arrays. No da.optimize and no stack/concatenate/.blocks round-tripping. Demo data (480x752): 1000 frames / 4 blocks 373s -> 83s synchronous (4.5x), 167s -> 26s threaded (6.5x); scaling is now linear and the win grows with recording length. Verified bitwise-identical to the prior algorithm (maxdiff 0) and exec==expected kernel calls up to 20 blocks. The old code was also subtly incorrect: duplicated tasks re-ran the in-place-mutating kernel on already-mutated shared input views. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…arnesses scripts/profile_motion_correction.py: scaling study (frames vs time) plus a cProfile/line_profiler breakdown of the est_motion_chunk kernel. scripts/diagnose_mc_scaling.py: isolates orchestration cost (graph-build vs compute), counts kernel executions vs expected to detect task duplication, and samples peak RSS. Both run against the demo movies. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The rigid path of transform_perframe used sitk.Resample for every frame at every pyramid level (template building) and for the final apply_transform output. cv2.warpAffine with INTER_LINEAR is the same bilinear translation warp but ~7x faster per call, cutting estimate_motion time ~1.7x on demo data (500 frames: 43.7s -> 25.6s). cv2 samples sub-pixel offsets on a 1/32 fixed-point grid, so resampled intensities differ from sitk by <~0.06 mean / ~1.5 max-at-edge on a 0-255 scale (exact at 0.0/0.5 offsets), well below imaging noise. Downstream: the estimate_motion shifts move by <=0.037px and the corrected movie by 0.048 mean intensity. The BSpline (non-rigid) path is unchanged. Shift direction verified to match the previous sitk.TranslationTransform convention. scripts/bench_mc_kernel.py: micro-benchmark comparing warp and registration candidates for speed and accuracy-vs-current on real demo frames. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ation The rigid path of est_motion_perframe called skimage.phase_cross_correlation with upsample_factor=100, which recomputes the rfft2 of both frames every call and refines the peak with an expensive matrix-DFT upsampling. Replace it with a direct cross-correlation (matching normalization=None) whose sub-pixel peak is refined by a 1D parabolic fit, and precompute each frame's rfft2 once per est_motion_chunk call so it is reused across the two registrations the frame participates in (as src and as a neighbour's dst). Adds ~1.2-1.6x on top of the cv2 warp change (full b2 ~1.5x over the pre-b2 baseline on demo data, measured conservatively under thermal load). The BSpline (non-rigid) path still uses phase_cross_correlation for its initialization, so the upsample parameter is retained for it. Accuracy: estimated shifts move <=0.08px mean / 0.61px max vs upsample=100 on the full 2000-frame demo; the end-to-end pipeline motion_sum stays within the test tolerance ([391,-252] -> [404,-243], abs=30). Parabolic subpixel agrees with upsample=100 to ~0.04px on raw frames. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Collapse the mirrored i<mid / i>mid branches in est_motion_chunk into a single direction decision plus one branch on frame representation; the FFT-threading and slc/didx logic are no longer duplicated. Output is bit-for-bit identical (verified across 3D/4D x alt-error cases). Trim three comments/docstrings that re-explained the dask-dedup and cv2-warp rationale in design-doc prose, fix two doc references anchored to deleted code, and drop a dead total= computation in the profiling script. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The scripts/ profiling and scaling-diagnosis harnesses were dev tools used to develop and justify the motion-correction perf work; they reference private internals and aren't part of the library or test suite. Untrack them (kept on disk locally) and gitignore scripts/ so they stay out of the PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

daharoni · 2026-06-03T15:25:29Z

It was really bothering me how slow the motion correction was and it seemed to hit an increased slowdown after switching to modern dask. This PR corrects for inefficiencies that were occuring in modern dask while also speeding up some of the internals.

codecov · 2026-06-03T15:55:22Z

Codecov Report

❌ Patch coverage is 3.37079% with 86 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
minian/motion_correction.py	3.37%	86 Missing ⚠️

📢 Thoughts on this report? Let us know!

daharoni · 2026-06-03T17:43:25Z

motion correction like other methods are lacking unit tests. I think we can hold off on that until I make a PR for my minian simulation module which can generate small test data for these individual steps in minian.

daharoni · 2026-06-03T18:10:02Z

17 seconds instead of the previous 4 to 8 minutes! And the motion corrected output metrics and video look rock solid.

sneakers-the-rat · 2026-06-03T18:57:45Z

I wonder if we should cap off the "modernize minian" work at just getting it to run, release 1.3.0, and then put all improvements in >2.0?

daharoni · 2026-06-03T20:00:28Z

My concern is it looks like the changes in >=2025 dask have likely other impacts on performance. I would lean towards not doing the 1.3 release of modernize-minian and instead jump to a v2.0 with it being a reasonably performant pipeline along with easier install and testing.

If we put out a v1.3 with modern dependencies but leave performance as is, I feel like those that grab v1.3 minian will be in a worse position than just keeping the current newest release and then jumping to v2 whenever they decide to do an update.

But I don't know best practices for these sorts of changes and releases.

sneakers-the-rat · 2026-06-03T20:32:53Z

The purpose of a 1.3 release is to have it be a near-exact drop-in for the cases where people have built lots of stuff around the workflow functioning exactly as is, which we have no visibility into. I don't understand why we wouldn't do that, since its pretty much done as is.

Everyone else who doesn't have some built out stuff that expects the API to remain unchanged (and this will probably be most people) can just jump straight to 2 when its released.

It would be pretty much exactly the same amount of work to do two releases at two different points in the commit history as it would be to do one, and by doing two we add value of easier installs with a more gradual path to future versions while also getting the benefits of releasing new features with a clearly marked version bump.

daharoni · 2026-06-04T19:51:12Z

@sneakers-the-rat, let me know if you have any major concerns here. otherwise we can merge into the new v2-integration branch to start collecting v2 improvements.

daharoni and others added 6 commits June 2, 2026 23:44

Merge branch 'modernize-minian' into perf/motion-correction-dask-dedup

36c75af

daharoni requested a review from sneakers-the-rat June 3, 2026 17:49

Merge branch 'modernize-minian' into perf/motion-correction-dask-dedup

e0b6b8b

daharoni mentioned this pull request Jun 4, 2026

KilledWorker during save motion #120

Closed

daharoni changed the base branch from modernize-minian to v2-integration June 4, 2026 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(motion): ~12x faster motion estimation (dask dedup + cheaper kernel)#320

perf(motion): ~12x faster motion estimation (dask dedup + cheaper kernel)#320
daharoni wants to merge 8 commits into
v2-integrationfrom
perf/motion-correction-dask-dedup

daharoni commented Jun 3, 2026

Uh oh!

daharoni commented Jun 3, 2026

Uh oh!

codecov Bot commented Jun 3, 2026

Uh oh!

daharoni commented Jun 3, 2026

Uh oh!

daharoni commented Jun 3, 2026 •

edited

Loading

Uh oh!

sneakers-the-rat commented Jun 3, 2026

Uh oh!

daharoni commented Jun 3, 2026

Uh oh!

sneakers-the-rat commented Jun 3, 2026

Uh oh!

daharoni commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daharoni commented Jun 3, 2026

Summary

Verified results

Accuracy

Parallel scaling

Notes

Uh oh!

daharoni commented Jun 3, 2026

Uh oh!

codecov Bot commented Jun 3, 2026

Codecov Report

Uh oh!

daharoni commented Jun 3, 2026

Uh oh!

daharoni commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sneakers-the-rat commented Jun 3, 2026

Uh oh!

daharoni commented Jun 3, 2026

Uh oh!

sneakers-the-rat commented Jun 3, 2026

Uh oh!

daharoni commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daharoni commented Jun 3, 2026 •

edited

Loading