Skip to content

perf(motion): ~12x faster motion estimation (dask dedup + cheaper kernel)#320

Open
daharoni wants to merge 8 commits into
v2-integrationfrom
perf/motion-correction-dask-dedup
Open

perf(motion): ~12x faster motion estimation (dask dedup + cheaper kernel)#320
daharoni wants to merge 8 commits into
v2-integrationfrom
perf/motion-correction-dask-dedup

Conversation

@daharoni
Copy link
Copy Markdown
Contributor

@daharoni daharoni commented Jun 3, 2026

Summary

Speeds up motion estimation (minian/motion_correction.py) by ~12-13x per core on realistic movie lengths, with no meaningful change in output. Two independent, compounding gains:

  • b1 - dask task-duplication fix. est_motion_part exposed each chunk's (template, shifts) result as two separate dask graph nodes. After the dask >=2025 modernization, custom_arr_optimize became a pass-through, so the optimizer fused each leaf kernel into both consumers and emitted it under two keys, running est_motion_chunk once per branch. The recursion now keeps each chunk's result as a single delayed tuple and reduces via a _est_motion_reduce adapter, splitting only at the very end. (The old behavior was also subtly wrong: duplicated tasks re-ran the in-place-mutating kernel on shared views.)
  • b2 - cheaper per-frame kernel. The rigid path now warps with cv2.warpAffine instead of sitk.Resample, and registers via a direct cross-correlation with parabolic sub-pixel refinement (_xcorr_subpixel) reusing per-frame rfft2 results cached once per chunk. The non-rigid (BSpline/mesh_size) path is unchanged.

Verified results

Baseline (pre-perf) vs this branch, demo movie 480x752, single worker (synchronous scheduler = total work):

frames blocks old new speedup dedup
250 1 27.2s 10.3s 2.7x 1.0x
500 2 102.5s 16.0s 6.4x 2.0x
1000 4 367.5s 31.6s 11.6x 4.0x
2000 8 763.7s 60.6s 12.6x 4.0x

Per-op micro-benchmarks (best of 5): transform sitk.Resample 4.1ms -> cv2.warpAffine 0.47ms (8.8x); registration skimage phase_cross_correlation(upsample=100) 65ms/pair -> cached parabolic xcorr 12ms/pair (5.3x).

The two axes decompose cleanly: the 1-block row is pure b2 (no duplication possible there); the dedup column is the b1 multiplier, which plateaus at ~4x (it scales with reduction-tree depth, ~log in block count, not linearly with length).

Accuracy

  • Registration shifts vs the old method: maxΔ 0.043 px, meanΔ 0.009 px.
  • cv2 vs sitk warp: <1.7/255 max, ~0.06 mean intensity difference (interior), well below imaging noise.
  • est_motion_chunk output is bit-for-bit identical to the pre-cleanup commit across 3D/4D x alt-error cases (the loop consolidation in this branch is behavior-preserving).

Parallel scaling

These are single-worker (total-work) numbers; multi-worker speedup stacks on top. Measured on the new code (2000 frames / 8 leaf blocks, threaded scheduler, inner libs pinned to 1 thread): 2 workers 1.75x (88%), 4 workers 2.63x (66%), 8 workers 3.83x (48%). Efficiency is bounded by leaf-block count + the reduction tail, so it improves on longer recordings.

Notes

  • Base is modernize-minian (this branch was cut from it; it depends on the dask >=2025 modernization).
  • Profiling/benchmark harnesses used to develop and verify this were intentionally kept local (untracked; .gitignore lists them) rather than shipped.

daharoni and others added 6 commits June 2, 2026 23:44
est_motion_part exposed each chunk kernel's tuple output as two separate
dask nodes (res[0] -> template branch, res[1] -> shift branch). On dask
>=2025 custom_arr_optimize became a no-op pass-through, so its
keep_patterns no longer kept est_motion_chunk unfused; the default
optimizer fused each leaf kernel with its dask-array input and, reached
through both branches, emitted it under two path-dependent keys. Each
kernel therefore ran once per branch and the redundancy compounded with
recursion depth (a ~chunk-count blow-up), making motion estimation scale
superlinearly with recording length.

Keep each chunk's result as a single delayed (template, shifts) tuple and
feed the whole tuple into the next reduction via _est_motion_reduce, which
stacks templates and gathers shifts inside one delayed. The tuple is split
only once, at the end, for the two returned arrays. No da.optimize and no
stack/concatenate/.blocks round-tripping.

Demo data (480x752): 1000 frames / 4 blocks 373s -> 83s synchronous (4.5x),
167s -> 26s threaded (6.5x); scaling is now linear and the win grows with
recording length. Verified bitwise-identical to the prior algorithm
(maxdiff 0) and exec==expected kernel calls up to 20 blocks. The old code
was also subtly incorrect: duplicated tasks re-ran the in-place-mutating
kernel on already-mutated shared input views.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…arnesses

scripts/profile_motion_correction.py: scaling study (frames vs time) plus a
cProfile/line_profiler breakdown of the est_motion_chunk kernel.
scripts/diagnose_mc_scaling.py: isolates orchestration cost (graph-build vs
compute), counts kernel executions vs expected to detect task duplication,
and samples peak RSS. Both run against the demo movies.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The rigid path of transform_perframe used sitk.Resample for every frame at
every pyramid level (template building) and for the final apply_transform
output. cv2.warpAffine with INTER_LINEAR is the same bilinear translation
warp but ~7x faster per call, cutting estimate_motion time ~1.7x on demo
data (500 frames: 43.7s -> 25.6s).

cv2 samples sub-pixel offsets on a 1/32 fixed-point grid, so resampled
intensities differ from sitk by <~0.06 mean / ~1.5 max-at-edge on a 0-255
scale (exact at 0.0/0.5 offsets), well below imaging noise. Downstream: the
estimate_motion shifts move by <=0.037px and the corrected movie by 0.048
mean intensity. The BSpline (non-rigid) path is unchanged. Shift direction
verified to match the previous sitk.TranslationTransform convention.

scripts/bench_mc_kernel.py: micro-benchmark comparing warp and registration
candidates for speed and accuracy-vs-current on real demo frames.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation

The rigid path of est_motion_perframe called skimage.phase_cross_correlation
with upsample_factor=100, which recomputes the rfft2 of both frames every
call and refines the peak with an expensive matrix-DFT upsampling. Replace
it with a direct cross-correlation (matching normalization=None) whose
sub-pixel peak is refined by a 1D parabolic fit, and precompute each frame's
rfft2 once per est_motion_chunk call so it is reused across the two
registrations the frame participates in (as src and as a neighbour's dst).

Adds ~1.2-1.6x on top of the cv2 warp change (full b2 ~1.5x over the
pre-b2 baseline on demo data, measured conservatively under thermal load).
The BSpline (non-rigid) path still uses phase_cross_correlation for its
initialization, so the upsample parameter is retained for it.

Accuracy: estimated shifts move <=0.08px mean / 0.61px max vs upsample=100
on the full 2000-frame demo; the end-to-end pipeline motion_sum stays within
the test tolerance ([391,-252] -> [404,-243], abs=30). Parabolic subpixel
agrees with upsample=100 to ~0.04px on raw frames.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collapse the mirrored i<mid / i>mid branches in est_motion_chunk into a
single direction decision plus one branch on frame representation; the
FFT-threading and slc/didx logic are no longer duplicated. Output is
bit-for-bit identical (verified across 3D/4D x alt-error cases).

Trim three comments/docstrings that re-explained the dask-dedup and
cv2-warp rationale in design-doc prose, fix two doc references anchored to
deleted code, and drop a dead total= computation in the profiling script.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The scripts/ profiling and scaling-diagnosis harnesses were dev tools used
to develop and justify the motion-correction perf work; they reference
private internals and aren't part of the library or test suite. Untrack them
(kept on disk locally) and gitignore scripts/ so they stay out of the PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@daharoni
Copy link
Copy Markdown
Contributor Author

daharoni commented Jun 3, 2026

It was really bothering me how slow the motion correction was and it seemed to hit an increased slowdown after switching to modern dask. This PR corrects for inefficiencies that were occuring in modern dask while also speeding up some of the internals.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

❌ Patch coverage is 3.37079% with 86 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
minian/motion_correction.py 3.37% 86 Missing ⚠️

📢 Thoughts on this report? Let us know!

@daharoni
Copy link
Copy Markdown
Contributor Author

daharoni commented Jun 3, 2026

motion correction like other methods are lacking unit tests. I think we can hold off on that until I make a PR for my minian simulation module which can generate small test data for these individual steps in minian.

@daharoni daharoni requested a review from sneakers-the-rat June 3, 2026 17:49
@daharoni
Copy link
Copy Markdown
Contributor Author

daharoni commented Jun 3, 2026

17 seconds instead of the previous 4 to 8 minutes! And the motion corrected output metrics and video look rock solid.

image image

@sneakers-the-rat
Copy link
Copy Markdown
Contributor

I wonder if we should cap off the "modernize minian" work at just getting it to run, release 1.3.0, and then put all improvements in >2.0?

@daharoni
Copy link
Copy Markdown
Contributor Author

daharoni commented Jun 3, 2026

My concern is it looks like the changes in >=2025 dask have likely other impacts on performance. I would lean towards not doing the 1.3 release of modernize-minian and instead jump to a v2.0 with it being a reasonably performant pipeline along with easier install and testing.

If we put out a v1.3 with modern dependencies but leave performance as is, I feel like those that grab v1.3 minian will be in a worse position than just keeping the current newest release and then jumping to v2 whenever they decide to do an update.

But I don't know best practices for these sorts of changes and releases.

@sneakers-the-rat
Copy link
Copy Markdown
Contributor

The purpose of a 1.3 release is to have it be a near-exact drop-in for the cases where people have built lots of stuff around the workflow functioning exactly as is, which we have no visibility into. I don't understand why we wouldn't do that, since its pretty much done as is.

Everyone else who doesn't have some built out stuff that expects the API to remain unchanged (and this will probably be most people) can just jump straight to 2 when its released.

It would be pretty much exactly the same amount of work to do two releases at two different points in the commit history as it would be to do one, and by doing two we add value of easier installs with a more gradual path to future versions while also getting the benefits of releasing new features with a clearly marked version bump.

@daharoni daharoni changed the base branch from modernize-minian to v2-integration June 4, 2026 19:34
@daharoni
Copy link
Copy Markdown
Contributor Author

daharoni commented Jun 4, 2026

@sneakers-the-rat, let me know if you have any major concerns here. otherwise we can merge into the new v2-integration branch to start collecting v2 improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants