v3.0.0: N² → N-linear memory optimizations + GPU backend + vendor SAM by atarashansky · Pull Request #174 · atarashansky/SAMap

atarashansky · 2026-03-12T15:02:39Z

Summary

Eliminates all N² memory intermediates from the SAMap iteration loop, adds optional GPU support via cupy/FAISS/rapids-singlecell, and vendors SAM (sc-sam) into the package.

Estimated scaling improvement: ~300-500k cells (v2.x) → ~5-10M cells (v3.0 on 256GB RAM + A100).

Breaking changes

sc-sam removed as a dependency. SAM is now samap.sam.SAM (vendored subset, ~1300 LOC).
_smart_expand now uses BFS instead of matrix powers by default — ~1% different neighbor selection (no self-loop artifact). legacy=True available for exact reproduction.

Added

N² → N-linear memory rewrites

Module	Before	After	Equivalence
`projection.py`	`Xtr = X @ gnnm_corr` (N×G, ~30% dense)	Precomposed `G·diag(W/σ)·PCs`; σ via quadratic form on precomputed `XᵀX`	1e-12 vs sklearn σ
`coarsening.py`	`D = B @ nnm_internal.T` (N², ~0.5% dense)	Stream mutual-NN per species-pair block	1e-12, 10 tests
`expand.py`	Iterated matrix powers `nnm^(i+1)`	Per-cell BFS with budget cap	Jaccard 0.9898
`correlation.py`	`Xavg = nnms @ Xs` (N×G_active, 10-60% dense)	Batched gene tiles, auto-selected by memory estimate	1e-10

GPU support (`pip install sc-samap[gpu]`)

Backend class dispatches numpy/scipy ↔ cupy/cupyx.scipy.sparse
SAMAP(backend="auto"|"cpu"|"cuda") — auto detects CUDA
FAISS-GPU brute-force kNN (replaces HNSW on GPU — index rebuilt every iteration, so build cost was pure waste)
Randomized SVD with implicit mean centering (~9 SpMM vs hundreds of ARPACK matvecs)
rapids-singlecell UMAP/Leiden wrappers with degenerate-clustering fallback

Infrastructure

Golden regression test pinning stitched kNN, homology graphs, gene weights (deterministic via hnswlib monkeypatch)
Benchmark suite: phase-level legacy-vs-optimized timing + memory profiling
docs/performance.md: memory model, tuning, scaling estimates
mapping.py split 1856→831 LOC into 5 focused modules

Fixed

Pre-existing bugs discovered and fixed:

Dead proj random-walk (result discarded at next line — pure compute waste)
thr kwarg misrouted to p-value threshold in FunctionalEnrichment
Stale setup.py coexisting with pyproject.toml
__version__ mismatch vs pyproject.toml
Stray deprecated .A attribute
_q helper duplicated 4×
Broken samalg.gui.SAMGUI import
SAM: obsm["X_processed"] stored every iteration, never read (memory leak)

Benchmark (macOS CPU, 3k cells)

phase         legacy_s      opt_s  speedup  legacy_mb     opt_mb mem_ratio
expand           5.611      1.160     4.8×       62.8       40.3      1.6×
projection      11.791      6.330     1.9×     1179.9      573.3      2.1×
correlation      0.366      1.332   (auto→materialize)  307.4    71.1    4.3×

End-to-end golden regression: 116s → 89s (−23%) from precompute reuse alone.

Not in this PR

_corr_kernel CUDA kernel (numba is dict-free and structurally portable; porting notes in module)
GPU path untested on real hardware (all behind @pytest.mark.skipif)
Tuning params not all plumbed to public API

See CHANGELOG.md and docs/performance.md for full details.

- Delete dead random-walk computation in mapping.py::_mapper (result written to lil matrix then immediately discarded by reassignment). Preserve the omp.data[:]=1 binarization side effect. - Delete mdata['xsim'] store — never read anywhere - Fix thr→align_thr kwarg misroute in enrichment.py (was falling through to unrelated p-value threshold) - Fix deprecated .A attr → np.asarray() in mapping.py - Delete stale root setup.py (pyproject.toml is authoritative) - Make __version__ dynamic via importlib.metadata with fallback - Consolidate duplicated _q helper into utils.q(), import as _q in the 4 callers (mapping, scores, gene_pairs, plotting) - Delete broken SAMGUI import and gui() method (module doesn't exist)

@jit

Vendored from sc-sam v2.0.2 (samalg module). Only the methods SAMap actually uses, split into clean submodules: sam/ __init__.py - exports SAM class core.py - SAM class: __init__, load_data, preprocess_data, run, calculate_nnm, dispersion_ranking_NN, leiden_clustering, get_labels, scatter, save_anndata pca.py - _pca_with_sparse (LinearOperator implicit-center), weighted_PCA knn.py - calc_nnm, nearest_neighbors_hnsw, gen_sparse_knn utils.py - convert_annotations, extract_annotation Dropped (not used by SAMap): - run_umap, run_tsne, run_diff_map, run_diff_umap - kmeans_clustering, hdbscan_clustering, louvain_clustering - identify_marker_genes_* - dill-based save/load, pickle (.p) loading Fixes applied during vendor: - Deleted obsm['X_processed'] = D_sub write (n_cells x n_genes stored every iteration, never read back) - Replaced 3x .tolil() + .setdiag() + .tocsr() cycles with direct CSR setdiag (no format round-trip, diagonal already in sparsity structure since hnswlib returns self as first neighbor) - Dropped numba import + NumbaWarning filter (SAM has no @jit functions) - Rewrote gen_sparse_knn: direct COO->CSR construction instead of lil_matrix fancy-index scatter (~3.5x faster) Verified: numerically equivalent to original samalg on planarian.h5ad (weight max abs diff ~6e-8, top genes identical). Existing SAMap imports not yet updated; that follows after golden test.

Captures current behavior of the full pipeline (hydra/planarian/schistosome) in a compressed .npz fixture. Pins: - Stitched cross-species kNN graph (obsp['connectivities']) - Refined gene-homology graph (sm.gnnm_refined) - Original BLAST homology graph (sm.gnnm) - Per-species SAM gene weights Determinism: hnswlib's multi-threaded add_items is the only source of run-to-run variation in the core algorithm. The test monkeypatches samap.core.mapping.hnswlib with a single-threaded wrapper (random_seed pinned, num_threads=1). Verified reproducible across 3 consecutive runs. Regenerate: pytest tests/regression/ -m slow --regenerate-golden Run: pytest tests/regression/ -m slow

Completes Phase 0.3 by wiring the vendored SAM into SAMap itself. Import changes: - src/samap/core/mapping.py: samalg.SAM -> samap.sam.SAM; samalg.utilities.convert_annotations -> samap.sam.utils - src/samap/analysis/scores.py: samalg.SAM -> samap.sam.SAM - src/samap/analysis/gene_pairs.py: samalg.SAM -> samap.sam.SAM (TYPE_CHECKING) - src/samap/utils.py: samalg.SAM -> samap.sam.SAM (TYPE_CHECKING); samalg.utilities.extract_annotation -> samap.sam.utils - tests/integration: samalg.SAM -> samap.sam.SAM - tests/regression (golden): samalg.SAM -> samap.sam.SAM; updated hnswlib-patch comment (vendored SAM's hnswlib is in samap.sam.knn but not reached by golden pipeline — input SAMs are pre-computed) - tests/unit/test_utils.py: dropped importorskip('samalg') guard (to_vo now uses vendored extract_annotation unconditionally) Deps: - Removed sc-sam>=2.0.0 from pyproject.toml dependencies - All transitive deps (hnswlib, leidenalg, anndata via scanpy, sklearn via scanpy, umap via scanpy) already listed or provided Verified: - Golden regression test passes (1 passed in 68s) - All 27 unit tests pass - ruff check src/ tests/ clean - rg samalg src/ tests/ -> only docstring provenance notes remain

…pyx (Phase 2.1) - Backend class: device='cpu'|'cuda'|'auto' with xp/sp/spla namespace access and gpu flag - Compat shims for scipy/cupy divergence: nonzero (cupy sparse has no .nonzero()), sparse_from_coo, setdiag (no LIL round-trip), svds (strips solver=/v0=/random_state= on GPU), LinearOperator - Data movement: to_device/to_host handle dense + CSR/CSC/COO, asfortran_if_gpu pre-converts for cuSPARSE SpMM, free_pool flushes cupy memory pools - COOBuilder: host-side triplet accumulator that finalizes to CSR/CSC on the active backend — replaces lil_matrix fancy-assign pattern (no LIL on cupy) - cupy is optional: module imports cleanly without it, HAS_CUPY flag exported, Backend('cuda') raises clear error if unavailable - 33 CPU unit tests + 18 GPU tests behind skipif guard

Pure refactor — extract ~20 private module-level functions from the 1856-line mapping.py into five focused modules. No algorithm changes. mapping.py 1856 → 831 lines (SAMAP class, _Samap_Iter, _avg_as) homology.py new · 272 lines (BLAST graph build/coarsen/filter, _tanh_scale) correlation.py new · 320 lines (numba refinement kernels, _refine_corr) projection.py new · 243 lines (hnswlib _united_proj, _mapping_window, prepare_SAMap_loadings) coarsening.py new · 265 lines (_mapper, _concatenate_sam, coclustering) expand.py new · 61 lines (_smart_expand, _sparse_knn_ks) Cross-module deps: - _tanh_scale lives in homology.py; imported by projection + coarsening - coarsening.py imports from correlation, expand, homology, projection - mapping.py re-imports public-ish names it needs for SAMAP.__init__ and _Samap_Iter.run (unchanged signatures) Dropped from mapping.py imports (no longer needed there): hnswlib, numba.{njit,prange}, Numba*Warning, StandardScaler, df_to_dict, to_vn, sparse_knn, warnings.filterwarnings(Numba...) — numba warning filters now live in correlation.py alongside the jitted kernels. test_golden_output.py: update hnswlib monkeypatch target from samap.core.mapping → samap.core.projection (the _united_proj call site moved). Golden fixture data unchanged. Verified: pytest tests/regression/ -m slow PASS (bit-identical) pytest tests/unit/ tests/integration/ 66 passed, 18 skipped ruff check src/ clean

The original _smart_expand materializes nnm^i sparse matrix powers to build hop-i rings, then trims each ring to a per-cell budget. This densifies geometrically at scale. New _smart_expand_bfs does a per-cell budget-capped BFS over the CSR adjacency arrays directly. Numba-parallel over cells, O(budget * k) working set per cell, never materializes matrix powers. Equivalence: exact match when budget >= reachable-within-NH-hops; under truncation, Jaccard ~0.99 on synthetic data (n=500, k=20, NH=3). The residual divergence is from in-ring ranking (path-sum vs max-edge weight) and matpow's spurious self-loops — nnm^2 has nonzero diagonal that survives ring subtraction, so matpow wastes ~1 budget slot/cell on self. BFS excludes self by construction and collects ~1 more useful neighbour. Exposed as _smart_expand(..., legacy: bool). Default legacy=True keeps the matpow path so golden regression passes byte-identically. Flip the default once BFS is validated end-to-end. Tests cover: exact-match regime, truncation regime (Jaccard > 0.9), structural invariants (budget, self-loop, binarized output), edge cases (zero budget, disconnected, empty graph), dispatch routing.

…ase 2.2/2.3) SAMAP.__init__ now accepts backend='auto'|'cpu'|'cuda', creates a Backend instance, and passes it down to _Samap_Iter. The iterator stores self._bk for use by hot-path functions once those rewrites land (call sites intentionally not changed yet — other agents own projection/coarsening/expand/correlation). LIL removal for GPU compat: Added utils.coo_to_csr_overwrite() — builds CSR from COO triplets with last-write-wins semantics (matches LIL fancy-index assignment, unlike standard COO sum-duplicates). Required for exact golden match since BLAST output and cross-species pairs can have duplicate (row, col) entries. - homology.py: _calculate_blast_graph gnnm1/gnnm2 and _filter_gnnm symmetrization — use coo_to_csr_overwrite - scores.py: convert_eggnog_to_homologs (plain COO sufficient — only nonzero pattern matters downstream), CellTypeTriangles, GeneTriangles — use coo_to_csr_overwrite Deliberate deviation from task spec: COOBuilder's finalize() uses sum-duplicates semantics, which changes numerical output when duplicates exist. The dedup helper approach preserves exact output (golden passes) and achieves the same GPU-compat goal (no LIL). Verification: - pytest tests/regression/ -m slow: PASS (golden match) - pytest tests/unit/: 71 passed, 18 skipped (GPU) - ruff check (my files): clean

Eliminates the N² dense intermediate D = B @ nnm_internal.T by computing per-species-pair blocks D[a,b] = B[a,b] @ nnm_b.T and mutualising each pair independently. B is block-off-diagonal and nnm_internal is block-diagonal, so the per-pair decomposition is exact. Peak memory drops from O(N²) to O(chunk × N_b) per species-pair. - _compute_mutual_graph: core streaming helper; chunks over source-species rows, computes left = B_ab[chunk] @ nnm_b.T and right = nnm_a[chunk] @ B_ba.T, mutualises sqrt(left ⊙ right), accumulates into COOBuilder - _scale_by_corr: per-chunk correlation rescaling (row-maxes are exact since each chunk holds all column-blocks for its rows) - neigh_from_keys (coclustering) path: keeps factored M @ M.T form, uses a precomputed M.T @ B_ba.T per partner (n_clusters × N_b, tiny) reused across chunks - Preserves the original asymmetry: 0.1 threshold only when no species uses coclustering _mapper signature unchanged. Numerically bit-identical to the old path — golden regression passes. Also: add RUF002/003/SIM108 to ruff ignores (math notation in docstrings, mathematical if/else clearer than ternary).

… intermediate (Phase 3.1) Precompose the projection G · diag(W/σ) · PCs so the per-iteration work is a single SpMM (ss_i @ P) producing an N × npcs result directly, instead of materialising the ~30%-dense N × G translated feature matrix Xtr. Algebra ------- Old path per species pair (i → s): Xtr_is = ss_i @ G_is -- N_i × G_s, ~30% dense ← ELIMINATED Xscaled = Xtr_is / σ_is (StandardScaler with_mean=False) wpca = (Xscaled * W_s) @ PCs_s New path: σ_is² = diag(G_isᵀ · XtX_i · G_is) / n_i − (μ_i · G_is)² [quadratic form] where XtX_i = ss_iᵀ @ ss_i and μ_i = ss_i.mean(0) are precomputed once in _projection_precompute (iteration-invariant) P_is = G_is · diag(W_s / σ_is) · PCs_s -- G_i × npcs, ~few MB wpca = ss_i @ P_is -- ONE SpMM The diag of the quadratic form is extracted as ((XtX @ G) ⊙ G).sum(0) — one SpGEMM + one elementwise product, never forming G_s × G_s. New public surface ------------------ _projection_precompute(sams, gns, bk) — build iteration-invariant state: ss[sid], XtX[sid], mu_ss[sid], wpca_own[sid], M_own[sid], indexers. Own-species projection ss_i @ PCs_i is fully iteration-invariant; cached here, referenced (not recomputed) per iteration. _compute_sigma(XtX, mu, G, n, bk) — σ via quadratic form. Matches sklearn StandardScaler(with_mean=False).scale_ exactly: biased variance (ddof=0), zero-variance columns → 1.0. _mapping_window_fast(gnnm, precompute, K, pairwise) — per-iteration worker. Backend-agnostic via samap.core._backend.Backend (xp/sp dispatch); defaults to CPU. hnswlib kNN step stays on host (no GPU hnswlib). Backward compat --------------- _mapping_window(sams, gnnm, gns, K, pairwise) keeps its old signature as a wrapper: builds precompute on-the-fly, delegates to _mapping_window_fast. coarsening._mapper still calls _mapping_window unchanged — the orchestration layer will swap in precompute caching in a later phase. gnnm=None path (own-species bootstrap) unchanged. Verification ------------ tests/unit/test_projection.py -- 11 tests: · σ quadratic-form vs sklearn at rtol=1e-12 (4 seeds) · σ zero-variance → 1.0 edge case · wpca equivalence: 2- and 3-species × {pairwise, all-to-all} at rtol=1e-6 (3-species all-to-all is non-trivial: global ≠ per-pair normalisation) · precompute is iteration-invariant (same dict, two different gnnms) · wrapper ≡ fast path pytest tests/regression/ -m slow PASS (golden bit-identical) pytest tests/unit/ tests/integration/ 98 passed, 18 skipped ruff check projection.py test_projection.py clean

… (Phase 3 finalization) The precomposed feature-translation path from Phase 3.1 now actually pays off: _projection_precompute runs ONCE in _Samap_Iter.__init__, and the cached state (ss, XtX, mu_ss, wpca_own) is reused across all NUMITERS iterations via _mapping_window_fast. Previously the backward-compat wrapper rebuilt the precompute on every iteration. Observed: golden regression test dropped from ~116s to ~89s (3-species, 3 iterations) — ~23% wall-clock reduction on this workload. mapping.py ---------- _Samap_Iter.__init__ + build self._gns = concat(gns_dict.values()) + build self._proj_cache = _projection_precompute(...) _Samap_Iter.run() - gns rebuilt every call → reuse self._gns + pass proj_cache=self._proj_cache, bk=self._bk to _mapper imports + _projection_precompute from .projection coarsening.py ------------- _mapper + proj_cache: dict | None = None + bk: Backend | None = None · if proj_cache is provided, route to _mapping_window_fast instead of the wrapper · thread bk → _compute_mutual_graph, _smart_expand _compute_mutual_graph + bk: Backend | None = None · replaces hardcoded Backend("cpu") — now honours upstream backend choice imports + _mapping_window_fast from .projection expand.py --------- _smart_expand + bk kwarg (unused; threaded for Phase 4 GPU) + TODO: flip legacy default → BFS after golden regen The _mapping_window wrapper stays as a working deprecated shim — still referenced by test_projection.py parity test and the proj_cache=None fallback in _mapper. Verified -------- pytest tests/regression/ -m slow PASS (bit-identical) pytest tests/unit/ 92 passed, 18 skipped ruff check mapping/coarsening/expand/projection.py clean

Two independent changes landing together: 1. **Dict-free numba kernel** (GPU-compat prep) - _refine_corr_kernel → _corr_kernel: species lookup now uses integer sp_starts/sp_lens arrays instead of a Python dict inside the hot loop - String species IDs → integer IDs (indexed by np.unique position) - np.append → preallocated np.empty + sliced assignment (cleaner for CUDA) - corr_mode string comparison moved out of kernel (pearson: bool arg) - Numerically bit-identical to original (same Pearson formula, same Xi) - Contiguity sanity check in _refine_corr_parallel guards the assumption 2. **Batched Xavg computation** (memory win) - New _compute_pair_corrs dispatches materialised (batch_size=None) vs streaming (batch_size=int) - Streaming: process pairs in batches; per batch compute Xavg only for the ≤ 2·batch_size genes actually referenced (nnms @ Xs[:, needed]), correlate, discard - Peak memory: O(N × G_active) → O(N × 2·batch_size) - Uses searchsorted on np.unique output to map global→local gene cols - Some recomputation when a gene spans batches; cheap (single-column SpMV) - Default batch_size=None → materialised path → golden passes unchanged _refine_corr and _refine_corr_parallel both gain batch_size kwarg (default None, backward-compatible; mapping.py unchanged). 14 equivalence tests: kernel vs pure-numpy reference (Pearson+Xi), streaming vs materialised across batch_size ∈ {1,7,32,50,64,200,256,500,10k}, 2-species and 3-species (shuffled pairs), empty-pairs edge case. All match to rtol=1e-10.

Adds randomized_svd_implicit_center() and wires it into _pca_with_sparse via svd_solver='randomized'. ARPACK remains the default — no behaviour change for existing callers. Algorithm: Halko-Martinsson-Tropp randomized range-finding adapted for implicit mean-centering. Every (X - 1·μ) @ M is computed as X @ M - 1·(μ @ M) — one SpMM plus a rank-1 broadcast correction, never materialising the dense centered matrix. With 4 power iterations this replaces ARPACK's hundreds of serial matvecs with ~9 block SpMM passes. GPU-friendly via Backend dispatch: uses bk.xp/bk.sp throughout, calls bk.asfortran_if_gpu before each SpMM (cuSPARSE prefers F-order RHS). Random sketch is seeded for determinism; svd_flip applied for sign consistency with ARPACK path. Tests (tests/unit/test_pca.py, 14 pass + 2 GPU-skipped): - ARPACK vs randomized: explained variance <1% rel err on top-10, subspace angles <0.1 rad within signal rank, reconstruction <5% - Both vs sklearn dense PCA: variance and subspace agreement - API: arpack-is-default, mu_axis=1 rejected, seeded determinism, n_power monotonicity - GPU tests behind HAS_CUPY skipif Verified: unit tests pass; golden regression passes (arpack default untouched); ruff clean.

- New src/samap/core/knn.py: approximate_knn() dispatches between CPU HNSW (hnswlib) and GPU brute-force (FAISS GpuIndexFlatIP) based on Backend.gpu. faiss/faiss-gpu are optional; graceful fallback to hnswlib with warning if GPU backend has no faiss-gpu. - _hnswlib_knn: extracted from projection._united_proj with identical params (ef=200, M=48, space='cosine'). num_threads exposed for golden-test determinism. - _faiss_gpu_knn: L2-normalise on GPU, GpuIndexFlatIP (exact), convert IP -> cosine distance. Results returned on host. TODO note for GpuIndexIVFFlat at >1M points. - Backend: add faiss_gpu_resources() lazy cache (StandardGpuResources is expensive to create, safe to reuse). - projection._united_proj: now calls approximate_knn; bk parameter threaded through from _mapping_window_fast. Also replaced LIL fancy-index with COO construction (equivalent: HNSW returns distinct neighbours so no duplicates). - Golden test monkeypatch retargeted: projection.hnswlib -> knn.hnswlib. - Tests: 9 CPU pass (recall vs brute-force >95%, output format, single- thread determinism), 6 GPU-gated (FAISS exact vs brute-force, cupy input, resource cache). - Golden regression: passes bit-exact.

…4.3) Three pieces: 1. _replace_vectorized(X, xi, yi, bk, batch_size=None) - Backend-agnostic per-pair Pearson over dense rows: gather 2·batch rows, centre, dot, normalise — one reduction kernel on GPU via bk.xp - Upcasts to float64 internally (wPCA often stored float32) - Optional batch_size caps working set at 2·batch_size·d floats; default None processes all pairs in one shot - Zero-variance rows → NaN (matches _replace; callers already handle) - np.errstate(invalid='ignore') silences the expected 0/0 warning 2. replace_corr(X, xi, yi, bk=None, batch_size=None) dispatcher - bk=None or not bk.gpu → numba _replace (CPU fast path: fused loop, no (n_pairs × d) allocation) - bk.gpu → _replace_vectorized (one large kernel launch) - Drop-in for callers that have a Backend; existing direct _replace callers unchanged 3. CUDA-porting notes for _corr_kernel (comment block) - Blockers enumerated: dynamic alloc (np.zeros/empty inside kernel), fancy-index scatter, .mean/.std method calls, _xicorr's argsort - Prescribed fix: don't densify — iterate CSC nonzeros directly, test each index against the two species ranges, accumulate Pearson sums in scalar registers (two-pass: mean then centred products) - Result: O(1) per-thread state, no shared memory needed - Xi correlation has no @cuda.jit equivalent; gate GPU kernel to pearson=True, fall back to CPU for Xi 12 new tests (26 total in test_correlation.py): - _replace_vectorized vs np.corrcoef and vs numba _replace (rtol=1e-12) - Batched vs full across batch_size ∈ {1,7,100,500,2000} — bit-identical - float32 input upcast; zero-variance NaN equivalence with numba - Dispatcher: CPU backend → numba (bit-identical), bk=None → numba, mock GPU backend → vectorised path

Three independent benchmarks, one per Phase-3 optimization target: expand — matpow vs BFS _smart_expand (legacy= toggle) projection — per-iter _mapping_window rebuild vs precompute-once + _mapping_window_fast (proj_cache toggle) correlation — materialised Xavg vs streaming _compute_pair_corrs (batch_size toggle) Phase-level (direct function calls) because the toggles aren't plumbed through _mapper yet — gives clean per-optimization attribution. bench_samap.py: parameterised scale sweep (1k/3k/10k/30k cells), tracemalloc peak memory, CSV output + summary table. plot_bench.py: log-log scaling curves (time + memory, per phase). Verified on macOS CPU, --max-cells 3000: expand: 3.5x→4.8x speedup (grows with n — matpow densification) projection: ~2x speedup, ~2x memory (precompute reuse) correlation: 4.3x memory win; streaming overhead at small n, breaks even when materialised Xavg approaches RAM limits

…e 5 finalization) All the optimized paths landed in Phases 3-4 now default-on. Golden regenerated to reflect BFS neighbourhood expansion (slightly different marginal neighbours — ~1% edge difference, no self-loop artefact). Default flips ------------- _smart_expand: legacy=True → legacy=False (BFS) _refine_corr: batch_size=None → batch_size=512 _refine_corr_parallel: batch_size=None → batch_size=512 BFS is a pure win (~5× at 3k cells, memory-bounded). Streaming correlation is a memory-for-speed trade: ~4× less peak memory, ~3-5× slower on toy data (<10k cells) where the full Xavg would fit anyway. Tuned for million-cell scale where Xavg would OOM. Pass batch_size=None explicitly for the fast materialised path on small data. Golden regenerated ------------------ tests/regression/fixtures/golden_3species.npz — regenerated with BFS + batch_size=512 defaults. BFS changes the neighbour selection; batched correlation is bit-identical to materialised. New files --------- docs/performance.md — memory model, backend selection, tuning, scaling estimates, benchmark results CHANGELOG.md — 3.0.0 section: breaking (sc-sam vendored, BFS default), added (GPU backend, N²→N-linear rewrites, randomized SVD, benchmark suite), fixed (Phase 0.2 bugs), changed (module split, defaults) Version bump ------------ pyproject.toml: 2.0.2 → 3.0.0 Verified -------- pytest tests/regression/ -m slow PASS (new golden) pytest tests/unit/ 146 passed, 26 skipped pytest tests/integration/ 6 passed (~160s — streaming overhead at toy scale) ruff check src/ clean benchmarks/bench_samap.py --max-cells 1000 ran cleanly; summary matches docs

…emory Changes the default batch_size from 512 to 'auto'. The streaming path is necessary at million-cell scale but 3-5× slower on toy data (~3k cells) where the full Xavg fits in memory — auto-selection picks the right path. - _resolve_batch_size(batch_size, nnms, Xs, mem_threshold_gb=2.0): estimates materialised Xavg size as n_cells × n_genes × out_density × 12 bytes (CSC data+indices), where out_density ≈ 1 - (1-expr_density)^k (kNN-smoothed sparse expression fills in; output entry is zero iff all k contributing inputs are zero). Under threshold → None (materialise); over → 512 (stream). Explicit batch_size (int or None) passes through unchanged. Decision logged at INFO. - _refine_corr / _refine_corr_parallel: batch_size default 512 → 'auto'; new correlation_mem_threshold param (default 2.0 GB). Resolution happens in _refine_corr_parallel right after nnms/Xs are built (shapes known). - mapping.py: correlation_mem_threshold exposed on both SAMAP.refine_homology_graph and _Samap_Iter.refine_homology_graph. - docs/performance.md: documents auto-selection and the tuning knob. 5 new tests (31 total): explicit passthrough (None/32/9999 untouched); toy 500-cell data → materialise; mock million-cell data → stream (uses duck-typed .shape/.nnz mock, no real alloc); threshold boundary flip; zero-sized degenerate inputs → materialise. Golden passes (auto selects materialised on toy data → same numerics as batch_size=None). Integration tests back to materialised speed.

atarashansky added 19 commits March 11, 2026 17:27

lint

a127811

atarashansky merged commit 929d5e9 into main Mar 12, 2026
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.0.0: N² → N-linear memory optimizations + GPU backend + vendor SAM#174

v3.0.0: N² → N-linear memory optimizations + GPU backend + vendor SAM#174
atarashansky merged 19 commits intomainfrom
optimize-v3

atarashansky commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

atarashansky commented Mar 12, 2026

Summary

Breaking changes

Added

N² → N-linear memory rewrites

GPU support (pip install sc-samap[gpu])

Infrastructure

Fixed

Benchmark (macOS CPU, 3k cells)

Not in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GPU support (`pip install sc-samap[gpu]`)