Skip to content

Modernize minian - packaging and api updates#301

Merged
sneakers-the-rat merged 131 commits into
masterfrom
modernize-minian
Jun 4, 2026
Merged

Modernize minian - packaging and api updates#301
sneakers-the-rat merged 131 commits into
masterfrom
modernize-minian

Conversation

@sneakers-the-rat
Copy link
Copy Markdown
Contributor

@sneakers-the-rat sneakers-the-rat commented May 29, 2026

This is a combined PR, where we are collecting smaller changes in this branch and reviewing them as we go. We are trying to update minian to make it runnable on more recent versions of python, and make use of more recent versions of its dependent packages. This should resolve most outstanding issues with installation!

It is currently WIP.

Contains PRs:

TODO:

wmau and others added 30 commits May 24, 2022 12:13
param dropped the lowercase Parameter-type aliases (param.list,
param.boolean, etc.) some time ago; only the PEP-8 capitalized forms
remain. Triggered an AttributeError when CNMFViewer(...) was
constructed (cell 102 of pipeline.ipynb).
- NetworkX 2.7 renamed `from_numpy_matrix` -> `from_numpy_array` and
  `from_scipy_sparse_matrix` -> `from_scipy_sparse_array`; the old names
  were removed in NetworkX 3.0.
- NumPy 1.20 deprecated the dtype aliases `np.int`, `np.float`, `np.bool`,
  etc.; NumPy 1.24 removed them. Use Python builtins (`int`, `float`)
  instead, which give the same int64/float64 on 64-bit platforms.

Surfaced at cell 51 of pipeline.ipynb (seeds_merge -> label_connected)
on a modern NetworkX/NumPy stack.
… with pd.concat

pandas 2.0 (April 2023) removed `DataFrame.append()` after deprecating it in
1.4. Fold the one-row append into the surrounding pd.concat call by wrapping
the new row's dict in a single-row DataFrame.

Triggered an AttributeError at cell 52 of pipeline.ipynb (initA -> seed
correlation construction).
…m_delay_optimize

Modern xarray passes `flush=True` (and may pass other kwargs in future)
through the dask optimization chain, but custom_arr_optimize's signature
rejects them. The sibling function custom_delay_optimize already takes
`**kwargs` for the same reason; this brings the two to parity.

Triggered `TypeError: custom_arr_optimize() got an unexpected keyword
argument 'flush'` when save_minian(...) called ds.to_zarr() (cell 9 of
pipeline.ipynb).
…gufunc output

`update_spatial_block` is decorated `@darr.as_gufunc(output_dtypes=float)`
but its body always returns a `sparse.COO`. The meta mismatch was tolerated
by older dask but breaks in modern dask, which dispatches by chunk meta:
`darr.block(A_new.tolist())` sees a mix of sparse-meta chunks (the `darr.array(sparse.zeros(...))`
fallback) and dense-meta chunks (from the gufunc), and the sparse backend
rejects the dense ones with `ValueError: All arrays must be instances of
SparseArray`.

Force sparse meta on each `update_spatial_block` result via `map_blocks(sparse.COO, ...)`.
The downstream code already calls `.map_blocks(lambda a: a.todense(), ...)`
on the assembled array (line ~413), confirming the assembled result is
expected to be sparse-backed.

Triggered at cell 61-62 of pipeline.ipynb (update_spatial(...)).
…atrix(coo)

`A_inter` and `A_usum` are pydata/sparse arrays, so the expression
`A_inter / (A_usum - A_inter) > jac_thres` produces a `sparse.COO` of bool.
Passing that to `scipy.sparse.csc_matrix(...)` makes scipy fall back to
`np.asarray(coo)`, which triggers pydata sparse's `__array__`. Pydata
sparse refuses to auto-densify by default (`AUTO_DENSIFY=False`) — a
guardrail against accidental memory blow-ups — so scipy can't construct
the matrix and raises:

  RuntimeError: Cannot convert a sparse array to dense automatically.
  To manually densify, use the todense method.

Use pydata sparse's `.tocsc()` method instead — it returns a scipy
`csc_matrix` directly, no dense intermediate.

Triggered at cell 71 of pipeline.ipynb (update_temporal(...)).
xarray tightened up boolean-indexer handling: using a *dask* boolean array
as an indexer produces an output of unknown shape, which dask/xarray can't
model cleanly, so xarray now refuses outright with:

  KeyError: 'Indexing with a boolean dask array is not allowed.
  This will result in a dask array of unknown shape. Such arrays
  are unsupported by Xarray. Please compute the indexer first
  using .compute()'

`umask` is per-(height, width) here — small enough that .compute() is
essentially free. Add a brief comment explaining the rationale.

Surfaces inside CNMFViewer.update_AC when the callback fires (cell 102+,
during interactive use of the demo pipeline).
…xed children

Modern HoloViews (>= 1.16) enforces type uniformity inside NdLayout. The
two children here are heterogeneous — `hv_pul` is an NdOverlay (the
datashaded simulated-pulse curves) and `hv_A` is an Image (the spatial
footprint). Older HoloViews tolerated mixed types; recent versions raise:

  AssertionError: NdLayout must only contain one type of object,
  not both Image and NdOverlay.

Replace the inner NdLayout with a permissive `Layout` (built via `+`)
and preserve the per-panel labels via `.relabel()`. The `Current Unit`
kdim is dropped because it was only used as a header label, not as a
navigation dimension.

Verified that the other two `hv.NdLayout` call sites in visualization.py
have type-uniform children (AdjointLayouts in VArrayViewer; Overlay in
visualize_spatial_update) and don't need this change.

Surfaced at cell 71 of pipeline.ipynb during rendering of
visualize_temporal_update output.
Commit b2372ce flattened the .opts(group=dict(...)) constructions in
visualization.py, including the `opts_pts = dict(plot=..., style=...)`
build in visualize_seeds. The downstream code one branch below still
expects the old nested structure:

  opts_pts["style"]["color"] = "white"   # KeyError: 'style'

Once opts_pts is flat, the equivalent is a flat assignment:

  opts_pts["color"] = "white"

Triggers when visualize_seeds is called with no mask (the `else:` branch
at line ~1760).
HoloViews 1.16+ removed the .opts(style=..., plot=..., norm=...)
options-group form. The library source was flattened in commit b2372ce,
but four code cells in the canonical demo notebook still use the legacy
form via the spread-dict spelling:

  opts = dict(
      plot=dict(height=..., width=..., colorbar=True),
      style=dict(cmap="Viridis"),
  )
  ...
  .opts(**opts)

Same removed wire format, just spelled indirectly. Flattening produces:

  opts = dict(
      height=..., width=..., colorbar=True,
      cmap="Viridis",
  )

Cells affected: 182, 185, 236, 238 (JSON cell indices).
…ple data

`cnmfviewer.unit_labels` is itself an `xr.DataArray`. Modern xarray rejects
DataArrays as the data half of a `(dims, data)` coord tuple because it's
ambiguous whether to use the array's own coords/indexes or the new tuple's
dims, raising:

  TypeError: Variable 'unit_labels': Using a DataArray object to construct
  a variable is ambiguous, please extract the data using the .data property.

Use `.values` (numpy array) to make the intent explicit. Applied to all
five assign_coords sites in this cell (A, C, S, c0, b0).

Surfaces at cell 110 of pipeline.ipynb.
pymetis has zero Windows wheels on PyPI for any version (confirmed across
cp310-cp314), and its from-source build requires MSVC + the METIS C library
which pip can't get on its own. This blocks `pip install minian` on Windows
entirely, which is a problem for the upcoming PyPI publish.

Make the dependency optional:
- Move `pymetis` to a new `[project.optional-dependencies]` `performance` extra.
- Lazy-import in cnmf.py with an `_HAS_PYMETIS` flag.
- Add a `_partition_graph(G, n_parts)` helper that uses pymetis when
  available and otherwise falls back to NetworkX's
  `greedy_modularity_communities` (pure Python, ships with networkx).
  Slower for very large graphs but produces a valid per-node partition
  label, which is all `graph_optimize_corr` needs — the partition quality
  affects compute chunking, not result correctness.
- Defensive try/except around the NetworkX call to handle the
  `n_parts > number_of_connected_components` edge case.

Install paths now:
- `pip install minian` — works everywhere, uses NetworkX fallback.
- `pip install 'minian[performance]'` — opt-in pymetis, faster on
  large graphs, requires platform with a working pymetis install
  (Linux/macOS via PyPI; Windows via conda-forge).
The cluster cell hard-codes `memory_limit="2GB"` per worker. This was
fine on older dask (lax memory enforcement, would spill to disk and keep
going), but modern dask kills any worker that exceeds the limit. Cells
that force monolithic chunks — e.g. `xr.concat([...], "width")
.chunk({"width": -1})` followed by `.max().compute()` — need a single
chunk of the full width-slab in memory, which exceeds 2GB on the demo
data and KilledWorkers the cluster out after 4 retries.

Two changes:

- Make `memory_limit` env-driven via `MINIAN_MEM_LIMIT`, mirroring the
  existing `MINIAN_NWORKERS` pattern. Docker images, workshop launchers,
  and individual users can override without editing the notebook.

- Bump the default from "2GB" to "4GB". With `n_workers=4` this targets
  a 16GB-RAM laptop, which runs the demo pipeline top-to-bottom on
  modern dask. Smaller machines override the env var down.

Memory is per-worker, not pooled — each dask worker is a separate
process. The doc comment in the parameter cell calls this out so users
configuring smaller machines understand the relationship between
`MINIAN_NWORKERS` and `MINIAN_MEM_LIMIT`.
Plain string "msCam[0-9]+\.avi$" contains `\.`, which Python 3.12+
treats as `SyntaxWarning: invalid escape sequence`. Future Python
versions will upgrade this to SyntaxError. The intent is clearly a
regex literal, so prefix with `r` to make it explicit.

The markdown cell documenting the parameter already uses `r"..."` in
its example, so this brings the actual code into sync with the docs.

(Same change should land in cross-registration.ipynb if it has the
same pattern; verified it doesn't on this branch.)
…tashade()

The modern HoloViews rasterizer doesn't consume `normalization` on
`datashade()`; the call site has been emitting

  WARNING:param.datashade: Parameter(s) [normalization] not consumed by
  any element rasterizer.

since at least HoloViews 1.16. The value passed ("linear") was the
historical default in any case, so removing the kwarg is behavior-
neutral and silences the warning.

If a future call site needs non-linear normalization, the modern API is
`shade(..., cnorm="log"|"eq_hist"|...)` rather than a kwarg on
`datashade`.
…nstead

dask added aggressive memory enforcement in distributed 2022.10. By
default, when a worker exceeds 95% of `memory_limit` it is terminated;
after 4 retries the task is marked failed (`KilledWorker`). This broke
minian's canonical pipeline because operations like update_background
and write_video transiently materialize the residual movie (Y - A·C) at
full size — easily several GB even on the demo dataset — which exceeds
the historical 2GB-per-worker default.

Older dask (pre-2022.10) handled the same situation by spilling chunks
to disk under memory pressure and continuing. The pipeline finished;
it was just slower when memory-constrained. That's the behavior the
canonical demo was originally tested against.

Restore it by disabling the terminate threshold:

  dask.config.set({"distributed.worker.memory.terminate": False})

set just before the LocalCluster is created. Workers still pause and
spill at the existing thresholds, they just don't get killed. The
practical effect: bulk operations (update_background, write_video,
etc.) complete on small per-worker memory budgets at the cost of disk
I/O, exactly as the original pipeline was designed for.

Combined with the env-driven memory_limit knob (previous commit),
users with larger machines can opt in to less spilling by bumping
MINIAN_MEM_LIMIT; users on workshop laptops get the original
pipeline-finishes-no-matter-what behavior.
daharoni and others added 16 commits June 1, 2026 19:39
dep cooldowns in lockfile resolution!
The new lockfile-current check installed PDM unpinned, but exclude-newer
in [tool.pdm.resolution] is only honored on PDM >= 2.26.9 (older versions
silently ignore it), matching the floor already pinned in testandcov.yml.
Also align checkout@v6 and Python 3.14 with the lint job, drop the no-op
pip cache, and add the missing trailing newline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ensure the lockfile is up to date with the current pyproject.toml
Rework update_meta() for the zarr-only API on modernize-minian:
- drop the removed netcdf/backend branching and call open_minian(f_path)
  with its current signature
- write each variable to a temp .zarr store and swap, instead of
  overwriting in place while it is still lazily backed by that store
- preserve dataset-level attrs and correct the meta_dict docstring
  (keys are dim names, values are directory levels)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover the two behaviors of update_meta: assigning metadata coordinates
from the directory hierarchy without altering the underlying data, and
only touching directories that match the pattern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The os.walk unpacking no longer uses the file-name list, so bind it to
`_` to make the intent explicit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The keys are dimension names and the values are directory levels, not
the reverse, as the code (and the toy_data tests) have always used. Fix
the parameter description and the example to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Strengthen the coverage beyond "the function runs":
- save several variables into one minian dir (as a real dataset does)
  and confirm all of them get the coords and reopen/merge cleanly
- assert update_meta after the fact is identical to saving with
  meta_dict up front (data, coords, and attrs), via assert_identical
- carry a non-empty variable attr through to prove it is preserved

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback: instead of reading each variable and rewriting
its entire zarr store to attach scalar metadata coordinates, append a
coords-only dataset with `to_zarr(mode="a")`. This writes just the small
coordinate arrays and leaves the data chunks in place.

Side benefits: the read-while-overwrite hazard disappears (we never
rewrite existing chunks), and variable attrs are preserved for free
since the data variables are never touched. The multi-var equivalence
test confirms the result is identical to save_minian(meta_dict=...).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sneakers-the-rat
Copy link
Copy Markdown
Contributor Author

sneakers-the-rat commented Jun 3, 2026

So let's set a closing scope on this - what else do we want in a 1.3.0 release that is focused on "the same minian as before, just runnable on more recent python versions" ?

Once we complete those tasks, we can merge this and make a v1.3.0 tag and split that off into a v1 branch in case we want to backport future features (I am not sure how necessary that will be, sorta depends on if we start getting issues requesting it/backwards compatibility issues with a 2.0) and then resume a normal PR workflow releasing >=2.0 that may have breaking changes. That can happen before we get control of the PyPI project, and we can make the PyPI release whenever we do, the tag will still be there.

Trying to prevent us from wandering too far making (good! desirable!) changes in this branch that go beyond the scope of a function-neutral update

@daharoni
Copy link
Copy Markdown
Contributor

daharoni commented Jun 3, 2026

I think we maybe are good to go? Should we merge the pypi PR before merging to master?

daharoni and others added 4 commits June 3, 2026 14:51
Drop the one-time PyPI setup reminders (trusted-publisher registration,
pypi environment creation) and the first-release version-decision notes now
that minian is claimed on PyPI and the publish workflow is wired. Keep the
durable bits: SCM-tag versioning, how to cut a release (cz bump / manual
tag), and the workflow_dispatch dry run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Register minian on PyPI: publishing workflow + metadata
chore: merge master into modernize-minian (resolve README conflict)
@daharoni daharoni marked this pull request as ready for review June 4, 2026 00:26
@daharoni daharoni self-requested a review June 4, 2026 02:11
Copy link
Copy Markdown
Contributor

@daharoni daharoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are possibly ready to merge to master and set this as v1.3

@sneakers-the-rat
Copy link
Copy Markdown
Contributor Author

Let's go baby

@sneakers-the-rat sneakers-the-rat merged commit e76fd20 into master Jun 4, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment