perf(motion): ~12x faster motion estimation (dask dedup + cheaper kernel)#320
perf(motion): ~12x faster motion estimation (dask dedup + cheaper kernel)#320daharoni wants to merge 8 commits into
Conversation
est_motion_part exposed each chunk kernel's tuple output as two separate dask nodes (res[0] -> template branch, res[1] -> shift branch). On dask >=2025 custom_arr_optimize became a no-op pass-through, so its keep_patterns no longer kept est_motion_chunk unfused; the default optimizer fused each leaf kernel with its dask-array input and, reached through both branches, emitted it under two path-dependent keys. Each kernel therefore ran once per branch and the redundancy compounded with recursion depth (a ~chunk-count blow-up), making motion estimation scale superlinearly with recording length. Keep each chunk's result as a single delayed (template, shifts) tuple and feed the whole tuple into the next reduction via _est_motion_reduce, which stacks templates and gathers shifts inside one delayed. The tuple is split only once, at the end, for the two returned arrays. No da.optimize and no stack/concatenate/.blocks round-tripping. Demo data (480x752): 1000 frames / 4 blocks 373s -> 83s synchronous (4.5x), 167s -> 26s threaded (6.5x); scaling is now linear and the win grows with recording length. Verified bitwise-identical to the prior algorithm (maxdiff 0) and exec==expected kernel calls up to 20 blocks. The old code was also subtly incorrect: duplicated tasks re-ran the in-place-mutating kernel on already-mutated shared input views. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…arnesses scripts/profile_motion_correction.py: scaling study (frames vs time) plus a cProfile/line_profiler breakdown of the est_motion_chunk kernel. scripts/diagnose_mc_scaling.py: isolates orchestration cost (graph-build vs compute), counts kernel executions vs expected to detect task duplication, and samples peak RSS. Both run against the demo movies. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The rigid path of transform_perframe used sitk.Resample for every frame at every pyramid level (template building) and for the final apply_transform output. cv2.warpAffine with INTER_LINEAR is the same bilinear translation warp but ~7x faster per call, cutting estimate_motion time ~1.7x on demo data (500 frames: 43.7s -> 25.6s). cv2 samples sub-pixel offsets on a 1/32 fixed-point grid, so resampled intensities differ from sitk by <~0.06 mean / ~1.5 max-at-edge on a 0-255 scale (exact at 0.0/0.5 offsets), well below imaging noise. Downstream: the estimate_motion shifts move by <=0.037px and the corrected movie by 0.048 mean intensity. The BSpline (non-rigid) path is unchanged. Shift direction verified to match the previous sitk.TranslationTransform convention. scripts/bench_mc_kernel.py: micro-benchmark comparing warp and registration candidates for speed and accuracy-vs-current on real demo frames. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation The rigid path of est_motion_perframe called skimage.phase_cross_correlation with upsample_factor=100, which recomputes the rfft2 of both frames every call and refines the peak with an expensive matrix-DFT upsampling. Replace it with a direct cross-correlation (matching normalization=None) whose sub-pixel peak is refined by a 1D parabolic fit, and precompute each frame's rfft2 once per est_motion_chunk call so it is reused across the two registrations the frame participates in (as src and as a neighbour's dst). Adds ~1.2-1.6x on top of the cv2 warp change (full b2 ~1.5x over the pre-b2 baseline on demo data, measured conservatively under thermal load). The BSpline (non-rigid) path still uses phase_cross_correlation for its initialization, so the upsample parameter is retained for it. Accuracy: estimated shifts move <=0.08px mean / 0.61px max vs upsample=100 on the full 2000-frame demo; the end-to-end pipeline motion_sum stays within the test tolerance ([391,-252] -> [404,-243], abs=30). Parabolic subpixel agrees with upsample=100 to ~0.04px on raw frames. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collapse the mirrored i<mid / i>mid branches in est_motion_chunk into a single direction decision plus one branch on frame representation; the FFT-threading and slc/didx logic are no longer duplicated. Output is bit-for-bit identical (verified across 3D/4D x alt-error cases). Trim three comments/docstrings that re-explained the dask-dedup and cv2-warp rationale in design-doc prose, fix two doc references anchored to deleted code, and drop a dead total= computation in the profiling script. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The scripts/ profiling and scaling-diagnosis harnesses were dev tools used to develop and justify the motion-correction perf work; they reference private internals and aren't part of the library or test suite. Untrack them (kept on disk locally) and gitignore scripts/ so they stay out of the PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
It was really bothering me how slow the motion correction was and it seemed to hit an increased slowdown after switching to modern dask. This PR corrects for inefficiencies that were occuring in modern dask while also speeding up some of the internals. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
motion correction like other methods are lacking unit tests. I think we can hold off on that until I make a PR for my minian simulation module which can generate small test data for these individual steps in minian. |
|
I wonder if we should cap off the "modernize minian" work at just getting it to run, release 1.3.0, and then put all improvements in >2.0? |
|
My concern is it looks like the changes in >=2025 dask have likely other impacts on performance. I would lean towards not doing the 1.3 release of modernize-minian and instead jump to a v2.0 with it being a reasonably performant pipeline along with easier install and testing. If we put out a v1.3 with modern dependencies but leave performance as is, I feel like those that grab v1.3 minian will be in a worse position than just keeping the current newest release and then jumping to v2 whenever they decide to do an update. But I don't know best practices for these sorts of changes and releases. |
|
The purpose of a 1.3 release is to have it be a near-exact drop-in for the cases where people have built lots of stuff around the workflow functioning exactly as is, which we have no visibility into. I don't understand why we wouldn't do that, since its pretty much done as is. Everyone else who doesn't have some built out stuff that expects the API to remain unchanged (and this will probably be most people) can just jump straight to 2 when its released. It would be pretty much exactly the same amount of work to do two releases at two different points in the commit history as it would be to do one, and by doing two we add value of easier installs with a more gradual path to future versions while also getting the benefits of releasing new features with a clearly marked version bump. |
|
@sneakers-the-rat, let me know if you have any major concerns here. otherwise we can merge into the new |


Summary
Speeds up motion estimation (
minian/motion_correction.py) by ~12-13x per core on realistic movie lengths, with no meaningful change in output. Two independent, compounding gains:est_motion_partexposed each chunk's(template, shifts)result as two separate dask graph nodes. After the dask >=2025 modernization,custom_arr_optimizebecame a pass-through, so the optimizer fused each leaf kernel into both consumers and emitted it under two keys, runningest_motion_chunkonce per branch. The recursion now keeps each chunk's result as a single delayed tuple and reduces via a_est_motion_reduceadapter, splitting only at the very end. (The old behavior was also subtly wrong: duplicated tasks re-ran the in-place-mutating kernel on shared views.)cv2.warpAffineinstead ofsitk.Resample, and registers via a direct cross-correlation with parabolic sub-pixel refinement (_xcorr_subpixel) reusing per-framerfft2results cached once per chunk. The non-rigid (BSpline/mesh_size) path is unchanged.Verified results
Baseline (pre-perf) vs this branch, demo movie 480x752, single worker (synchronous scheduler = total work):
Per-op micro-benchmarks (best of 5): transform
sitk.Resample4.1ms ->cv2.warpAffine0.47ms (8.8x); registration skimagephase_cross_correlation(upsample=100)65ms/pair -> cached parabolic xcorr 12ms/pair (5.3x).The two axes decompose cleanly: the 1-block row is pure b2 (no duplication possible there); the
dedupcolumn is the b1 multiplier, which plateaus at ~4x (it scales with reduction-tree depth, ~log in block count, not linearly with length).Accuracy
est_motion_chunkoutput is bit-for-bit identical to the pre-cleanup commit across 3D/4D x alt-error cases (the loop consolidation in this branch is behavior-preserving).Parallel scaling
These are single-worker (total-work) numbers; multi-worker speedup stacks on top. Measured on the new code (2000 frames / 8 leaf blocks, threaded scheduler, inner libs pinned to 1 thread): 2 workers 1.75x (88%), 4 workers 2.63x (66%), 8 workers 3.83x (48%). Efficiency is bounded by leaf-block count + the reduction tail, so it improves on longer recordings.
Notes
modernize-minian(this branch was cut from it; it depends on the dask >=2025 modernization)..gitignorelists them) rather than shipped.