Modernize minian - packaging and api updates#301
Conversation
param dropped the lowercase Parameter-type aliases (param.list, param.boolean, etc.) some time ago; only the PEP-8 capitalized forms remain. Triggered an AttributeError when CNMFViewer(...) was constructed (cell 102 of pipeline.ipynb).
- NetworkX 2.7 renamed `from_numpy_matrix` -> `from_numpy_array` and `from_scipy_sparse_matrix` -> `from_scipy_sparse_array`; the old names were removed in NetworkX 3.0. - NumPy 1.20 deprecated the dtype aliases `np.int`, `np.float`, `np.bool`, etc.; NumPy 1.24 removed them. Use Python builtins (`int`, `float`) instead, which give the same int64/float64 on 64-bit platforms. Surfaced at cell 51 of pipeline.ipynb (seeds_merge -> label_connected) on a modern NetworkX/NumPy stack.
… with pd.concat pandas 2.0 (April 2023) removed `DataFrame.append()` after deprecating it in 1.4. Fold the one-row append into the surrounding pd.concat call by wrapping the new row's dict in a single-row DataFrame. Triggered an AttributeError at cell 52 of pipeline.ipynb (initA -> seed correlation construction).
…m_delay_optimize Modern xarray passes `flush=True` (and may pass other kwargs in future) through the dask optimization chain, but custom_arr_optimize's signature rejects them. The sibling function custom_delay_optimize already takes `**kwargs` for the same reason; this brings the two to parity. Triggered `TypeError: custom_arr_optimize() got an unexpected keyword argument 'flush'` when save_minian(...) called ds.to_zarr() (cell 9 of pipeline.ipynb).
…gufunc output `update_spatial_block` is decorated `@darr.as_gufunc(output_dtypes=float)` but its body always returns a `sparse.COO`. The meta mismatch was tolerated by older dask but breaks in modern dask, which dispatches by chunk meta: `darr.block(A_new.tolist())` sees a mix of sparse-meta chunks (the `darr.array(sparse.zeros(...))` fallback) and dense-meta chunks (from the gufunc), and the sparse backend rejects the dense ones with `ValueError: All arrays must be instances of SparseArray`. Force sparse meta on each `update_spatial_block` result via `map_blocks(sparse.COO, ...)`. The downstream code already calls `.map_blocks(lambda a: a.todense(), ...)` on the assembled array (line ~413), confirming the assembled result is expected to be sparse-backed. Triggered at cell 61-62 of pipeline.ipynb (update_spatial(...)).
…atrix(coo) `A_inter` and `A_usum` are pydata/sparse arrays, so the expression `A_inter / (A_usum - A_inter) > jac_thres` produces a `sparse.COO` of bool. Passing that to `scipy.sparse.csc_matrix(...)` makes scipy fall back to `np.asarray(coo)`, which triggers pydata sparse's `__array__`. Pydata sparse refuses to auto-densify by default (`AUTO_DENSIFY=False`) — a guardrail against accidental memory blow-ups — so scipy can't construct the matrix and raises: RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method. Use pydata sparse's `.tocsc()` method instead — it returns a scipy `csc_matrix` directly, no dense intermediate. Triggered at cell 71 of pipeline.ipynb (update_temporal(...)).
xarray tightened up boolean-indexer handling: using a *dask* boolean array as an indexer produces an output of unknown shape, which dask/xarray can't model cleanly, so xarray now refuses outright with: KeyError: 'Indexing with a boolean dask array is not allowed. This will result in a dask array of unknown shape. Such arrays are unsupported by Xarray. Please compute the indexer first using .compute()' `umask` is per-(height, width) here — small enough that .compute() is essentially free. Add a brief comment explaining the rationale. Surfaces inside CNMFViewer.update_AC when the callback fires (cell 102+, during interactive use of the demo pipeline).
…xed children Modern HoloViews (>= 1.16) enforces type uniformity inside NdLayout. The two children here are heterogeneous — `hv_pul` is an NdOverlay (the datashaded simulated-pulse curves) and `hv_A` is an Image (the spatial footprint). Older HoloViews tolerated mixed types; recent versions raise: AssertionError: NdLayout must only contain one type of object, not both Image and NdOverlay. Replace the inner NdLayout with a permissive `Layout` (built via `+`) and preserve the per-panel labels via `.relabel()`. The `Current Unit` kdim is dropped because it was only used as a header label, not as a navigation dimension. Verified that the other two `hv.NdLayout` call sites in visualization.py have type-uniform children (AdjointLayouts in VArrayViewer; Overlay in visualize_spatial_update) and don't need this change. Surfaced at cell 71 of pipeline.ipynb during rendering of visualize_temporal_update output.
Commit b2372ce flattened the .opts(group=dict(...)) constructions in visualization.py, including the `opts_pts = dict(plot=..., style=...)` build in visualize_seeds. The downstream code one branch below still expects the old nested structure: opts_pts["style"]["color"] = "white" # KeyError: 'style' Once opts_pts is flat, the equivalent is a flat assignment: opts_pts["color"] = "white" Triggers when visualize_seeds is called with no mask (the `else:` branch at line ~1760).
HoloViews 1.16+ removed the .opts(style=..., plot=..., norm=...) options-group form. The library source was flattened in commit b2372ce, but four code cells in the canonical demo notebook still use the legacy form via the spread-dict spelling: opts = dict( plot=dict(height=..., width=..., colorbar=True), style=dict(cmap="Viridis"), ) ... .opts(**opts) Same removed wire format, just spelled indirectly. Flattening produces: opts = dict( height=..., width=..., colorbar=True, cmap="Viridis", ) Cells affected: 182, 185, 236, 238 (JSON cell indices).
…ple data `cnmfviewer.unit_labels` is itself an `xr.DataArray`. Modern xarray rejects DataArrays as the data half of a `(dims, data)` coord tuple because it's ambiguous whether to use the array's own coords/indexes or the new tuple's dims, raising: TypeError: Variable 'unit_labels': Using a DataArray object to construct a variable is ambiguous, please extract the data using the .data property. Use `.values` (numpy array) to make the intent explicit. Applied to all five assign_coords sites in this cell (A, C, S, c0, b0). Surfaces at cell 110 of pipeline.ipynb.
pymetis has zero Windows wheels on PyPI for any version (confirmed across cp310-cp314), and its from-source build requires MSVC + the METIS C library which pip can't get on its own. This blocks `pip install minian` on Windows entirely, which is a problem for the upcoming PyPI publish. Make the dependency optional: - Move `pymetis` to a new `[project.optional-dependencies]` `performance` extra. - Lazy-import in cnmf.py with an `_HAS_PYMETIS` flag. - Add a `_partition_graph(G, n_parts)` helper that uses pymetis when available and otherwise falls back to NetworkX's `greedy_modularity_communities` (pure Python, ships with networkx). Slower for very large graphs but produces a valid per-node partition label, which is all `graph_optimize_corr` needs — the partition quality affects compute chunking, not result correctness. - Defensive try/except around the NetworkX call to handle the `n_parts > number_of_connected_components` edge case. Install paths now: - `pip install minian` — works everywhere, uses NetworkX fallback. - `pip install 'minian[performance]'` — opt-in pymetis, faster on large graphs, requires platform with a working pymetis install (Linux/macOS via PyPI; Windows via conda-forge).
The cluster cell hard-codes `memory_limit="2GB"` per worker. This was
fine on older dask (lax memory enforcement, would spill to disk and keep
going), but modern dask kills any worker that exceeds the limit. Cells
that force monolithic chunks — e.g. `xr.concat([...], "width")
.chunk({"width": -1})` followed by `.max().compute()` — need a single
chunk of the full width-slab in memory, which exceeds 2GB on the demo
data and KilledWorkers the cluster out after 4 retries.
Two changes:
- Make `memory_limit` env-driven via `MINIAN_MEM_LIMIT`, mirroring the
existing `MINIAN_NWORKERS` pattern. Docker images, workshop launchers,
and individual users can override without editing the notebook.
- Bump the default from "2GB" to "4GB". With `n_workers=4` this targets
a 16GB-RAM laptop, which runs the demo pipeline top-to-bottom on
modern dask. Smaller machines override the env var down.
Memory is per-worker, not pooled — each dask worker is a separate
process. The doc comment in the parameter cell calls this out so users
configuring smaller machines understand the relationship between
`MINIAN_NWORKERS` and `MINIAN_MEM_LIMIT`.
Plain string "msCam[0-9]+\.avi$" contains `\.`, which Python 3.12+ treats as `SyntaxWarning: invalid escape sequence`. Future Python versions will upgrade this to SyntaxError. The intent is clearly a regex literal, so prefix with `r` to make it explicit. The markdown cell documenting the parameter already uses `r"..."` in its example, so this brings the actual code into sync with the docs. (Same change should land in cross-registration.ipynb if it has the same pattern; verified it doesn't on this branch.)
…tashade()
The modern HoloViews rasterizer doesn't consume `normalization` on
`datashade()`; the call site has been emitting
WARNING:param.datashade: Parameter(s) [normalization] not consumed by
any element rasterizer.
since at least HoloViews 1.16. The value passed ("linear") was the
historical default in any case, so removing the kwarg is behavior-
neutral and silences the warning.
If a future call site needs non-linear normalization, the modern API is
`shade(..., cnorm="log"|"eq_hist"|...)` rather than a kwarg on
`datashade`.
…nstead
dask added aggressive memory enforcement in distributed 2022.10. By
default, when a worker exceeds 95% of `memory_limit` it is terminated;
after 4 retries the task is marked failed (`KilledWorker`). This broke
minian's canonical pipeline because operations like update_background
and write_video transiently materialize the residual movie (Y - A·C) at
full size — easily several GB even on the demo dataset — which exceeds
the historical 2GB-per-worker default.
Older dask (pre-2022.10) handled the same situation by spilling chunks
to disk under memory pressure and continuing. The pipeline finished;
it was just slower when memory-constrained. That's the behavior the
canonical demo was originally tested against.
Restore it by disabling the terminate threshold:
dask.config.set({"distributed.worker.memory.terminate": False})
set just before the LocalCluster is created. Workers still pause and
spill at the existing thresholds, they just don't get killed. The
practical effect: bulk operations (update_background, write_video,
etc.) complete on small per-worker memory budgets at the cost of disk
I/O, exactly as the original pipeline was designed for.
Combined with the env-driven memory_limit knob (previous commit),
users with larger machines can opt in to less spilling by bumping
MINIAN_MEM_LIMIT; users on workshop laptops get the original
pipeline-finishes-no-matter-what behavior.
dep cooldowns in lockfile resolution!
The new lockfile-current check installed PDM unpinned, but exclude-newer in [tool.pdm.resolution] is only honored on PDM >= 2.26.9 (older versions silently ignore it), matching the floor already pinned in testandcov.yml. Also align checkout@v6 and Python 3.14 with the lint job, drop the no-op pip cache, and add the missing trailing newline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ensure the lockfile is up to date with the current pyproject.toml
Automatic weekly update of lockfile
Rework update_meta() for the zarr-only API on modernize-minian: - drop the removed netcdf/backend branching and call open_minian(f_path) with its current signature - write each variable to a temp .zarr store and swap, instead of overwriting in place while it is still lazily backed by that store - preserve dataset-level attrs and correct the meta_dict docstring (keys are dim names, values are directory levels) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover the two behaviors of update_meta: assigning metadata coordinates from the directory hierarchy without altering the underlying data, and only touching directories that match the pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The os.walk unpacking no longer uses the file-name list, so bind it to `_` to make the intent explicit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The keys are dimension names and the values are directory levels, not the reverse, as the code (and the toy_data tests) have always used. Fix the parameter description and the example to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Strengthen the coverage beyond "the function runs": - save several variables into one minian dir (as a real dataset does) and confirm all of them get the coords and reopen/merge cleanly - assert update_meta after the fact is identical to saving with meta_dict up front (data, coords, and attrs), via assert_identical - carry a non-empty variable attr through to prove it is preserved Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback: instead of reading each variable and rewriting its entire zarr store to attach scalar metadata coordinates, append a coords-only dataset with `to_zarr(mode="a")`. This writes just the small coordinate arrays and leaves the data chunks in place. Side benefits: the read-while-overwrite hazard disappears (we never rewrite existing chunks), and variable attrs are preserved for free since the data variables are never touched. The multi-var equivalence test confirms the result is identical to save_minian(meta_dict=...). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
updated update_meta()
|
So let's set a closing scope on this - what else do we want in a 1.3.0 release that is focused on "the same minian as before, just runnable on more recent python versions" ? Once we complete those tasks, we can merge this and make a v1.3.0 tag and split that off into a Trying to prevent us from wandering too far making (good! desirable!) changes in this branch that go beyond the scope of a function-neutral update |
|
I think we maybe are good to go? Should we merge the pypi PR before merging to master? |
Drop the one-time PyPI setup reminders (trusted-publisher registration, pypi environment creation) and the first-release version-decision notes now that minian is claimed on PyPI and the publish workflow is wired. Keep the durable bits: SCM-tag versioning, how to cut a release (cz bump / manual tag), and the workflow_dispatch dry run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Register minian on PyPI: publishing workflow + metadata
# Conflicts: # README.md
chore: merge master into modernize-minian (resolve README conflict)
daharoni
left a comment
There was a problem hiding this comment.
I think we are possibly ready to merge to master and set this as v1.3
|
Let's go baby |
This is a combined PR, where we are collecting smaller changes in this branch and reviewing them as we go. We are trying to update minian to make it runnable on more recent versions of python, and make use of more recent versions of its dependent packages. This should resolve most outstanding issues with installation!
It is currently WIP.
Contains PRs:
TODO: