Skip to content

v3.0.0: N² → N-linear memory optimizations + GPU backend + vendor SAM#174

Merged
atarashansky merged 19 commits intomainfrom
optimize-v3
Mar 12, 2026
Merged

v3.0.0: N² → N-linear memory optimizations + GPU backend + vendor SAM#174
atarashansky merged 19 commits intomainfrom
optimize-v3

Conversation

@atarashansky
Copy link
Owner

Summary

Eliminates all N² memory intermediates from the SAMap iteration loop, adds optional GPU support via cupy/FAISS/rapids-singlecell, and vendors SAM (sc-sam) into the package.

Estimated scaling improvement: ~300-500k cells (v2.x) → ~5-10M cells (v3.0 on 256GB RAM + A100).

Breaking changes

  • sc-sam removed as a dependency. SAM is now samap.sam.SAM (vendored subset, ~1300 LOC).
  • _smart_expand now uses BFS instead of matrix powers by default — ~1% different neighbor selection (no self-loop artifact). legacy=True available for exact reproduction.

Added

N² → N-linear memory rewrites

Module Before After Equivalence
projection.py Xtr = X @ gnnm_corr (N×G, ~30% dense) Precomposed G·diag(W/σ)·PCs; σ via quadratic form on precomputed XᵀX 1e-12 vs sklearn σ
coarsening.py D = B @ nnm_internal.T (N², ~0.5% dense) Stream mutual-NN per species-pair block 1e-12, 10 tests
expand.py Iterated matrix powers nnm^(i+1) Per-cell BFS with budget cap Jaccard 0.9898
correlation.py Xavg = nnms @ Xs (N×G_active, 10-60% dense) Batched gene tiles, auto-selected by memory estimate 1e-10

GPU support (pip install sc-samap[gpu])

  • Backend class dispatches numpy/scipy ↔ cupy/cupyx.scipy.sparse
  • SAMAP(backend="auto"|"cpu"|"cuda") — auto detects CUDA
  • FAISS-GPU brute-force kNN (replaces HNSW on GPU — index rebuilt every iteration, so build cost was pure waste)
  • Randomized SVD with implicit mean centering (~9 SpMM vs hundreds of ARPACK matvecs)
  • rapids-singlecell UMAP/Leiden wrappers with degenerate-clustering fallback

Infrastructure

  • Golden regression test pinning stitched kNN, homology graphs, gene weights (deterministic via hnswlib monkeypatch)
  • Benchmark suite: phase-level legacy-vs-optimized timing + memory profiling
  • docs/performance.md: memory model, tuning, scaling estimates
  • mapping.py split 1856→831 LOC into 5 focused modules

Fixed

Pre-existing bugs discovered and fixed:

  • Dead proj random-walk (result discarded at next line — pure compute waste)
  • thr kwarg misrouted to p-value threshold in FunctionalEnrichment
  • Stale setup.py coexisting with pyproject.toml
  • __version__ mismatch vs pyproject.toml
  • Stray deprecated .A attribute
  • _q helper duplicated 4×
  • Broken samalg.gui.SAMGUI import
  • SAM: obsm["X_processed"] stored every iteration, never read (memory leak)

Benchmark (macOS CPU, 3k cells)

phase         legacy_s      opt_s  speedup  legacy_mb     opt_mb mem_ratio
expand           5.611      1.160     4.8×       62.8       40.3      1.6×
projection      11.791      6.330     1.9×     1179.9      573.3      2.1×
correlation      0.366      1.332   (auto→materialize)  307.4    71.1    4.3×

End-to-end golden regression: 116s → 89s (−23%) from precompute reuse alone.

Not in this PR

  • _corr_kernel CUDA kernel (numba is dict-free and structurally portable; porting notes in module)
  • GPU path untested on real hardware (all behind @pytest.mark.skipif)
  • Tuning params not all plumbed to public API

See CHANGELOG.md and docs/performance.md for full details.

- Delete dead random-walk computation in mapping.py::_mapper (result
  written to lil matrix then immediately discarded by reassignment).
  Preserve the omp.data[:]=1 binarization side effect.
- Delete mdata['xsim'] store — never read anywhere
- Fix thr→align_thr kwarg misroute in enrichment.py (was falling
  through to unrelated p-value threshold)
- Fix deprecated .A attr → np.asarray() in mapping.py
- Delete stale root setup.py (pyproject.toml is authoritative)
- Make __version__ dynamic via importlib.metadata with fallback
- Consolidate duplicated _q helper into utils.q(), import as _q
  in the 4 callers (mapping, scores, gene_pairs, plotting)
- Delete broken SAMGUI import and gui() method (module doesn't exist)
Vendored from sc-sam v2.0.2 (samalg module). Only the methods SAMap
actually uses, split into clean submodules:

  sam/
    __init__.py  - exports SAM class
    core.py      - SAM class: __init__, load_data, preprocess_data,
                   run, calculate_nnm, dispersion_ranking_NN,
                   leiden_clustering, get_labels, scatter, save_anndata
    pca.py       - _pca_with_sparse (LinearOperator implicit-center),
                   weighted_PCA
    knn.py       - calc_nnm, nearest_neighbors_hnsw, gen_sparse_knn
    utils.py     - convert_annotations, extract_annotation

Dropped (not used by SAMap):
  - run_umap, run_tsne, run_diff_map, run_diff_umap
  - kmeans_clustering, hdbscan_clustering, louvain_clustering
  - identify_marker_genes_*
  - dill-based save/load, pickle (.p) loading

Fixes applied during vendor:
  - Deleted obsm['X_processed'] = D_sub write (n_cells x n_genes stored
    every iteration, never read back)
  - Replaced 3x .tolil() + .setdiag() + .tocsr() cycles with direct CSR
    setdiag (no format round-trip, diagonal already in sparsity structure
    since hnswlib returns self as first neighbor)
  - Dropped numba import + NumbaWarning filter (SAM has no @jit functions)
  - Rewrote gen_sparse_knn: direct COO->CSR construction instead of
    lil_matrix fancy-index scatter (~3.5x faster)

Verified: numerically equivalent to original samalg on planarian.h5ad
(weight max abs diff ~6e-8, top genes identical).

Existing SAMap imports not yet updated; that follows after golden test.
Captures current behavior of the full pipeline (hydra/planarian/schistosome)
in a compressed .npz fixture. Pins:
  - Stitched cross-species kNN graph (obsp['connectivities'])
  - Refined gene-homology graph (sm.gnnm_refined)
  - Original BLAST homology graph (sm.gnnm)
  - Per-species SAM gene weights

Determinism: hnswlib's multi-threaded add_items is the only source of
run-to-run variation in the core algorithm. The test monkeypatches
samap.core.mapping.hnswlib with a single-threaded wrapper (random_seed
pinned, num_threads=1). Verified reproducible across 3 consecutive runs.

Regenerate: pytest tests/regression/ -m slow --regenerate-golden
Run:        pytest tests/regression/ -m slow
Completes Phase 0.3 by wiring the vendored SAM into SAMap itself.

Import changes:
  - src/samap/core/mapping.py: samalg.SAM -> samap.sam.SAM;
    samalg.utilities.convert_annotations -> samap.sam.utils
  - src/samap/analysis/scores.py: samalg.SAM -> samap.sam.SAM
  - src/samap/analysis/gene_pairs.py: samalg.SAM -> samap.sam.SAM (TYPE_CHECKING)
  - src/samap/utils.py: samalg.SAM -> samap.sam.SAM (TYPE_CHECKING);
    samalg.utilities.extract_annotation -> samap.sam.utils
  - tests/integration: samalg.SAM -> samap.sam.SAM
  - tests/regression (golden): samalg.SAM -> samap.sam.SAM;
    updated hnswlib-patch comment (vendored SAM's hnswlib is in
    samap.sam.knn but not reached by golden pipeline — input SAMs
    are pre-computed)
  - tests/unit/test_utils.py: dropped importorskip('samalg') guard
    (to_vo now uses vendored extract_annotation unconditionally)

Deps:
  - Removed sc-sam>=2.0.0 from pyproject.toml dependencies
  - All transitive deps (hnswlib, leidenalg, anndata via scanpy,
    sklearn via scanpy, umap via scanpy) already listed or provided

Verified:
  - Golden regression test passes (1 passed in 68s)
  - All 27 unit tests pass
  - ruff check src/ tests/ clean
  - rg samalg src/ tests/ -> only docstring provenance notes remain
…pyx (Phase 2.1)

- Backend class: device='cpu'|'cuda'|'auto' with xp/sp/spla namespace
  access and gpu flag
- Compat shims for scipy/cupy divergence: nonzero (cupy sparse has no
  .nonzero()), sparse_from_coo, setdiag (no LIL round-trip), svds (strips
  solver=/v0=/random_state= on GPU), LinearOperator
- Data movement: to_device/to_host handle dense + CSR/CSC/COO,
  asfortran_if_gpu pre-converts for cuSPARSE SpMM, free_pool flushes
  cupy memory pools
- COOBuilder: host-side triplet accumulator that finalizes to CSR/CSC on
  the active backend — replaces lil_matrix fancy-assign pattern (no LIL
  on cupy)
- cupy is optional: module imports cleanly without it, HAS_CUPY flag
  exported, Backend('cuda') raises clear error if unavailable
- 33 CPU unit tests + 18 GPU tests behind skipif guard
Pure refactor — extract ~20 private module-level functions from the
1856-line mapping.py into five focused modules. No algorithm changes.

  mapping.py     1856 → 831 lines  (SAMAP class, _Samap_Iter, _avg_as)
  homology.py    new · 272 lines  (BLAST graph build/coarsen/filter, _tanh_scale)
  correlation.py new · 320 lines  (numba refinement kernels, _refine_corr)
  projection.py  new · 243 lines  (hnswlib _united_proj, _mapping_window, prepare_SAMap_loadings)
  coarsening.py  new · 265 lines  (_mapper, _concatenate_sam, coclustering)
  expand.py      new ·  61 lines  (_smart_expand, _sparse_knn_ks)

Cross-module deps:
  - _tanh_scale lives in homology.py; imported by projection + coarsening
  - coarsening.py imports from correlation, expand, homology, projection
  - mapping.py re-imports public-ish names it needs for SAMAP.__init__
    and _Samap_Iter.run (unchanged signatures)

Dropped from mapping.py imports (no longer needed there):
  hnswlib, numba.{njit,prange}, Numba*Warning, StandardScaler,
  df_to_dict, to_vn, sparse_knn, warnings.filterwarnings(Numba...)
  — numba warning filters now live in correlation.py alongside the jitted kernels.

test_golden_output.py: update hnswlib monkeypatch target from
  samap.core.mapping → samap.core.projection (the _united_proj
  call site moved). Golden fixture data unchanged.

Verified:
  pytest tests/regression/ -m slow    PASS (bit-identical)
  pytest tests/unit/ tests/integration/  66 passed, 18 skipped
  ruff check src/                     clean
The original _smart_expand materializes nnm^i sparse matrix powers to
build hop-i rings, then trims each ring to a per-cell budget. This
densifies geometrically at scale.

New _smart_expand_bfs does a per-cell budget-capped BFS over the CSR
adjacency arrays directly. Numba-parallel over cells, O(budget * k)
working set per cell, never materializes matrix powers.

Equivalence: exact match when budget >= reachable-within-NH-hops; under
truncation, Jaccard ~0.99 on synthetic data (n=500, k=20, NH=3). The
residual divergence is from in-ring ranking (path-sum vs max-edge weight)
and matpow's spurious self-loops — nnm^2 has nonzero diagonal that
survives ring subtraction, so matpow wastes ~1 budget slot/cell on self.
BFS excludes self by construction and collects ~1 more useful neighbour.

Exposed as _smart_expand(..., legacy: bool). Default legacy=True keeps
the matpow path so golden regression passes byte-identically. Flip the
default once BFS is validated end-to-end.

Tests cover: exact-match regime, truncation regime (Jaccard > 0.9),
structural invariants (budget, self-loop, binarized output), edge cases
(zero budget, disconnected, empty graph), dispatch routing.
…ase 2.2/2.3)

SAMAP.__init__ now accepts backend='auto'|'cpu'|'cuda', creates a
Backend instance, and passes it down to _Samap_Iter. The iterator
stores self._bk for use by hot-path functions once those rewrites
land (call sites intentionally not changed yet — other agents own
projection/coarsening/expand/correlation).

LIL removal for GPU compat:

Added utils.coo_to_csr_overwrite() — builds CSR from COO triplets
with last-write-wins semantics (matches LIL fancy-index assignment,
unlike standard COO sum-duplicates). Required for exact golden match
since BLAST output and cross-species pairs can have duplicate
(row, col) entries.

- homology.py: _calculate_blast_graph gnnm1/gnnm2 and _filter_gnnm
  symmetrization — use coo_to_csr_overwrite
- scores.py: convert_eggnog_to_homologs (plain COO sufficient —
  only nonzero pattern matters downstream), CellTypeTriangles,
  GeneTriangles — use coo_to_csr_overwrite

Deliberate deviation from task spec: COOBuilder's finalize() uses
sum-duplicates semantics, which changes numerical output when
duplicates exist. The dedup helper approach preserves exact output
(golden passes) and achieves the same GPU-compat goal (no LIL).

Verification:
- pytest tests/regression/ -m slow: PASS (golden match)
- pytest tests/unit/: 71 passed, 18 skipped (GPU)
- ruff check (my files): clean
Eliminates the N² dense intermediate D = B @ nnm_internal.T by computing
per-species-pair blocks D[a,b] = B[a,b] @ nnm_b.T and mutualising each
pair independently. B is block-off-diagonal and nnm_internal is
block-diagonal, so the per-pair decomposition is exact.

Peak memory drops from O(N²) to O(chunk × N_b) per species-pair.

- _compute_mutual_graph: core streaming helper; chunks over source-species
  rows, computes left = B_ab[chunk] @ nnm_b.T and right = nnm_a[chunk] @
  B_ba.T, mutualises sqrt(left ⊙ right), accumulates into COOBuilder
- _scale_by_corr: per-chunk correlation rescaling (row-maxes are exact
  since each chunk holds all column-blocks for its rows)
- neigh_from_keys (coclustering) path: keeps factored M @ M.T form, uses
  a precomputed M.T @ B_ba.T per partner (n_clusters × N_b, tiny) reused
  across chunks
- Preserves the original asymmetry: 0.1 threshold only when no species
  uses coclustering

_mapper signature unchanged. Numerically bit-identical to the old path —
golden regression passes.

Also: add RUF002/003/SIM108 to ruff ignores (math notation in docstrings,
mathematical if/else clearer than ternary).
… intermediate (Phase 3.1)

Precompose the projection G · diag(W/σ) · PCs so the per-iteration work is
a single SpMM (ss_i @ P) producing an N × npcs result directly, instead of
materialising the ~30%-dense N × G translated feature matrix Xtr.

Algebra
-------
Old path per species pair (i → s):
  Xtr_is  = ss_i @ G_is                    -- N_i × G_s, ~30% dense  ← ELIMINATED
  Xscaled = Xtr_is / σ_is  (StandardScaler with_mean=False)
  wpca    = (Xscaled * W_s) @ PCs_s

New path:
  σ_is²   = diag(G_isᵀ · XtX_i · G_is) / n_i − (μ_i · G_is)²   [quadratic form]
          where XtX_i = ss_iᵀ @ ss_i and μ_i = ss_i.mean(0) are
          precomputed once in _projection_precompute (iteration-invariant)
  P_is    = G_is · diag(W_s / σ_is) · PCs_s                    -- G_i × npcs, ~few MB
  wpca    = ss_i @ P_is                                        -- ONE SpMM

The diag of the quadratic form is extracted as ((XtX @ G) ⊙ G).sum(0) —
one SpGEMM + one elementwise product, never forming G_s × G_s.

New public surface
------------------
  _projection_precompute(sams, gns, bk)  — build iteration-invariant state:
    ss[sid], XtX[sid], mu_ss[sid], wpca_own[sid], M_own[sid], indexers.
    Own-species projection ss_i @ PCs_i is fully iteration-invariant;
    cached here, referenced (not recomputed) per iteration.
  _compute_sigma(XtX, mu, G, n, bk)      — σ via quadratic form.
    Matches sklearn StandardScaler(with_mean=False).scale_ exactly:
    biased variance (ddof=0), zero-variance columns → 1.0.
  _mapping_window_fast(gnnm, precompute, K, pairwise)  — per-iteration worker.

Backend-agnostic via samap.core._backend.Backend (xp/sp dispatch);
defaults to CPU. hnswlib kNN step stays on host (no GPU hnswlib).

Backward compat
---------------
_mapping_window(sams, gnnm, gns, K, pairwise) keeps its old signature as a
wrapper: builds precompute on-the-fly, delegates to _mapping_window_fast.
coarsening._mapper still calls _mapping_window unchanged — the
orchestration layer will swap in precompute caching in a later phase.
gnnm=None path (own-species bootstrap) unchanged.

Verification
------------
  tests/unit/test_projection.py  -- 11 tests:
    · σ quadratic-form vs sklearn at rtol=1e-12 (4 seeds)
    · σ zero-variance → 1.0 edge case
    · wpca equivalence: 2- and 3-species × {pairwise, all-to-all} at rtol=1e-6
      (3-species all-to-all is non-trivial: global ≠ per-pair normalisation)
    · precompute is iteration-invariant (same dict, two different gnnms)
    · wrapper ≡ fast path
  pytest tests/regression/ -m slow       PASS (golden bit-identical)
  pytest tests/unit/ tests/integration/  98 passed, 18 skipped
  ruff check projection.py test_projection.py  clean
… (Phase 3 finalization)

The precomposed feature-translation path from Phase 3.1 now actually pays
off: _projection_precompute runs ONCE in _Samap_Iter.__init__, and the
cached state (ss, XtX, mu_ss, wpca_own) is reused across all NUMITERS
iterations via _mapping_window_fast. Previously the backward-compat
wrapper rebuilt the precompute on every iteration.

Observed: golden regression test dropped from ~116s to ~89s (3-species,
3 iterations) — ~23% wall-clock reduction on this workload.

mapping.py
----------
  _Samap_Iter.__init__  + build self._gns = concat(gns_dict.values())
                        + build self._proj_cache = _projection_precompute(...)
  _Samap_Iter.run()     - gns rebuilt every call → reuse self._gns
                        + pass proj_cache=self._proj_cache, bk=self._bk
                          to _mapper
  imports               + _projection_precompute from .projection

coarsening.py
-------------
  _mapper               + proj_cache: dict | None = None
                        + bk: Backend | None = None
                        · if proj_cache is provided, route to
                          _mapping_window_fast instead of the wrapper
                        · thread bk → _compute_mutual_graph, _smart_expand
  _compute_mutual_graph + bk: Backend | None = None
                        · replaces hardcoded Backend("cpu") — now honours
                          upstream backend choice
  imports               + _mapping_window_fast from .projection

expand.py
---------
  _smart_expand         + bk kwarg (unused; threaded for Phase 4 GPU)
                        + TODO: flip legacy default → BFS after golden regen

The _mapping_window wrapper stays as a working deprecated shim — still
referenced by test_projection.py parity test and the proj_cache=None
fallback in _mapper.

Verified
--------
  pytest tests/regression/ -m slow   PASS (bit-identical)
  pytest tests/unit/                 92 passed, 18 skipped
  ruff check mapping/coarsening/expand/projection.py  clean
Two independent changes landing together:

1. **Dict-free numba kernel** (GPU-compat prep)
   - _refine_corr_kernel → _corr_kernel: species lookup now uses integer
     sp_starts/sp_lens arrays instead of a Python dict inside the hot loop
   - String species IDs → integer IDs (indexed by np.unique position)
   - np.append → preallocated np.empty + sliced assignment (cleaner for CUDA)
   - corr_mode string comparison moved out of kernel (pearson: bool arg)
   - Numerically bit-identical to original (same Pearson formula, same Xi)
   - Contiguity sanity check in _refine_corr_parallel guards the assumption

2. **Batched Xavg computation** (memory win)
   - New _compute_pair_corrs dispatches materialised (batch_size=None) vs
     streaming (batch_size=int)
   - Streaming: process pairs in batches; per batch compute Xavg only for
     the ≤ 2·batch_size genes actually referenced (nnms @ Xs[:, needed]),
     correlate, discard
   - Peak memory: O(N × G_active) → O(N × 2·batch_size)
   - Uses searchsorted on np.unique output to map global→local gene cols
   - Some recomputation when a gene spans batches; cheap (single-column SpMV)
   - Default batch_size=None → materialised path → golden passes unchanged

_refine_corr and _refine_corr_parallel both gain batch_size kwarg
(default None, backward-compatible; mapping.py unchanged).

14 equivalence tests: kernel vs pure-numpy reference (Pearson+Xi),
streaming vs materialised across batch_size ∈ {1,7,32,50,64,200,256,500,10k},
2-species and 3-species (shuffled pairs), empty-pairs edge case.
All match to rtol=1e-10.
Adds randomized_svd_implicit_center() and wires it into _pca_with_sparse
via svd_solver='randomized'. ARPACK remains the default — no behaviour
change for existing callers.

Algorithm: Halko-Martinsson-Tropp randomized range-finding adapted for
implicit mean-centering. Every (X - 1·μ) @ M is computed as
X @ M - 1·(μ @ M) — one SpMM plus a rank-1 broadcast correction, never
materialising the dense centered matrix. With 4 power iterations this
replaces ARPACK's hundreds of serial matvecs with ~9 block SpMM passes.

GPU-friendly via Backend dispatch: uses bk.xp/bk.sp throughout, calls
bk.asfortran_if_gpu before each SpMM (cuSPARSE prefers F-order RHS).
Random sketch is seeded for determinism; svd_flip applied for sign
consistency with ARPACK path.

Tests (tests/unit/test_pca.py, 14 pass + 2 GPU-skipped):
  - ARPACK vs randomized: explained variance <1% rel err on top-10,
    subspace angles <0.1 rad within signal rank, reconstruction <5%
  - Both vs sklearn dense PCA: variance and subspace agreement
  - API: arpack-is-default, mu_axis=1 rejected, seeded determinism,
    n_power monotonicity
  - GPU tests behind HAS_CUPY skipif

Verified: unit tests pass; golden regression passes (arpack default
untouched); ruff clean.
- New src/samap/core/knn.py: approximate_knn() dispatches between CPU
  HNSW (hnswlib) and GPU brute-force (FAISS GpuIndexFlatIP) based on
  Backend.gpu. faiss/faiss-gpu are optional; graceful fallback to hnswlib
  with warning if GPU backend has no faiss-gpu.
- _hnswlib_knn: extracted from projection._united_proj with identical
  params (ef=200, M=48, space='cosine'). num_threads exposed for
  golden-test determinism.
- _faiss_gpu_knn: L2-normalise on GPU, GpuIndexFlatIP (exact), convert
  IP -> cosine distance. Results returned on host. TODO note for
  GpuIndexIVFFlat at >1M points.
- Backend: add faiss_gpu_resources() lazy cache (StandardGpuResources
  is expensive to create, safe to reuse).
- projection._united_proj: now calls approximate_knn; bk parameter
  threaded through from _mapping_window_fast. Also replaced LIL
  fancy-index with COO construction (equivalent: HNSW returns distinct
  neighbours so no duplicates).
- Golden test monkeypatch retargeted: projection.hnswlib -> knn.hnswlib.
- Tests: 9 CPU pass (recall vs brute-force >95%, output format, single-
  thread determinism), 6 GPU-gated (FAISS exact vs brute-force, cupy
  input, resource cache).
- Golden regression: passes bit-exact.
…4.3)

Three pieces:

1. _replace_vectorized(X, xi, yi, bk, batch_size=None)
   - Backend-agnostic per-pair Pearson over dense rows: gather 2·batch rows,
     centre, dot, normalise — one reduction kernel on GPU via bk.xp
   - Upcasts to float64 internally (wPCA often stored float32)
   - Optional batch_size caps working set at 2·batch_size·d floats; default
     None processes all pairs in one shot
   - Zero-variance rows → NaN (matches _replace; callers already handle)
   - np.errstate(invalid='ignore') silences the expected 0/0 warning

2. replace_corr(X, xi, yi, bk=None, batch_size=None) dispatcher
   - bk=None or not bk.gpu → numba _replace (CPU fast path: fused loop,
     no (n_pairs × d) allocation)
   - bk.gpu → _replace_vectorized (one large kernel launch)
   - Drop-in for callers that have a Backend; existing direct _replace
     callers unchanged

3. CUDA-porting notes for _corr_kernel (comment block)
   - Blockers enumerated: dynamic alloc (np.zeros/empty inside kernel),
     fancy-index scatter, .mean/.std method calls, _xicorr's argsort
   - Prescribed fix: don't densify — iterate CSC nonzeros directly,
     test each index against the two species ranges, accumulate Pearson
     sums in scalar registers (two-pass: mean then centred products)
   - Result: O(1) per-thread state, no shared memory needed
   - Xi correlation has no @cuda.jit equivalent; gate GPU kernel to
     pearson=True, fall back to CPU for Xi

12 new tests (26 total in test_correlation.py):
- _replace_vectorized vs np.corrcoef and vs numba _replace (rtol=1e-12)
- Batched vs full across batch_size ∈ {1,7,100,500,2000} — bit-identical
- float32 input upcast; zero-variance NaN equivalence with numba
- Dispatcher: CPU backend → numba (bit-identical), bk=None → numba,
  mock GPU backend → vectorised path
Three independent benchmarks, one per Phase-3 optimization target:

  expand      — matpow vs BFS _smart_expand (legacy= toggle)
  projection  — per-iter _mapping_window rebuild vs precompute-once +
                _mapping_window_fast (proj_cache toggle)
  correlation — materialised Xavg vs streaming _compute_pair_corrs
                (batch_size toggle)

Phase-level (direct function calls) because the toggles aren't plumbed
through _mapper yet — gives clean per-optimization attribution.

bench_samap.py: parameterised scale sweep (1k/3k/10k/30k cells),
tracemalloc peak memory, CSV output + summary table. plot_bench.py:
log-log scaling curves (time + memory, per phase).

Verified on macOS CPU, --max-cells 3000:

  expand:     3.5x→4.8x speedup (grows with n — matpow densification)
  projection: ~2x speedup, ~2x memory (precompute reuse)
  correlation: 4.3x memory win; streaming overhead at small n, breaks
               even when materialised Xavg approaches RAM limits
…e 5 finalization)

All the optimized paths landed in Phases 3-4 now default-on. Golden
regenerated to reflect BFS neighbourhood expansion (slightly different
marginal neighbours — ~1% edge difference, no self-loop artefact).

Default flips
-------------
  _smart_expand:        legacy=True  → legacy=False  (BFS)
  _refine_corr:         batch_size=None → batch_size=512
  _refine_corr_parallel: batch_size=None → batch_size=512

BFS is a pure win (~5× at 3k cells, memory-bounded). Streaming correlation
is a memory-for-speed trade: ~4× less peak memory, ~3-5× slower on toy
data (<10k cells) where the full Xavg would fit anyway. Tuned for
million-cell scale where Xavg would OOM. Pass batch_size=None explicitly
for the fast materialised path on small data.

Golden regenerated
------------------
  tests/regression/fixtures/golden_3species.npz — regenerated with BFS +
  batch_size=512 defaults. BFS changes the neighbour selection; batched
  correlation is bit-identical to materialised.

New files
---------
  docs/performance.md — memory model, backend selection, tuning, scaling
    estimates, benchmark results
  CHANGELOG.md        — 3.0.0 section: breaking (sc-sam vendored, BFS
    default), added (GPU backend, N²→N-linear rewrites, randomized SVD,
    benchmark suite), fixed (Phase 0.2 bugs), changed (module split,
    defaults)

Version bump
------------
  pyproject.toml: 2.0.2 → 3.0.0

Verified
--------
  pytest tests/regression/ -m slow        PASS (new golden)
  pytest tests/unit/                      146 passed, 26 skipped
  pytest tests/integration/               6 passed (~160s — streaming
                                          overhead at toy scale)
  ruff check src/                         clean
  benchmarks/bench_samap.py --max-cells 1000  ran cleanly; summary matches docs
…emory

Changes the default batch_size from 512 to 'auto'. The streaming path is
necessary at million-cell scale but 3-5× slower on toy data (~3k cells)
where the full Xavg fits in memory — auto-selection picks the right path.

- _resolve_batch_size(batch_size, nnms, Xs, mem_threshold_gb=2.0):
  estimates materialised Xavg size as n_cells × n_genes × out_density × 12
  bytes (CSC data+indices), where out_density ≈ 1 - (1-expr_density)^k
  (kNN-smoothed sparse expression fills in; output entry is zero iff all
  k contributing inputs are zero). Under threshold → None (materialise);
  over → 512 (stream). Explicit batch_size (int or None) passes through
  unchanged. Decision logged at INFO.

- _refine_corr / _refine_corr_parallel: batch_size default 512 → 'auto';
  new correlation_mem_threshold param (default 2.0 GB). Resolution happens
  in _refine_corr_parallel right after nnms/Xs are built (shapes known).

- mapping.py: correlation_mem_threshold exposed on both
  SAMAP.refine_homology_graph and _Samap_Iter.refine_homology_graph.

- docs/performance.md: documents auto-selection and the tuning knob.

5 new tests (31 total): explicit passthrough (None/32/9999 untouched);
toy 500-cell data → materialise; mock million-cell data → stream (uses
duck-typed .shape/.nnz mock, no real alloc); threshold boundary flip;
zero-sized degenerate inputs → materialise.

Golden passes (auto selects materialised on toy data → same numerics as
batch_size=None). Integration tests back to materialised speed.
@atarashansky atarashansky merged commit 929d5e9 into main Mar 12, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant