v3.0.0: N² → N-linear memory optimizations + GPU backend + vendor SAM#174
Merged
atarashansky merged 19 commits intomainfrom Mar 12, 2026
Merged
v3.0.0: N² → N-linear memory optimizations + GPU backend + vendor SAM#174atarashansky merged 19 commits intomainfrom
atarashansky merged 19 commits intomainfrom
Conversation
- Delete dead random-walk computation in mapping.py::_mapper (result written to lil matrix then immediately discarded by reassignment). Preserve the omp.data[:]=1 binarization side effect. - Delete mdata['xsim'] store — never read anywhere - Fix thr→align_thr kwarg misroute in enrichment.py (was falling through to unrelated p-value threshold) - Fix deprecated .A attr → np.asarray() in mapping.py - Delete stale root setup.py (pyproject.toml is authoritative) - Make __version__ dynamic via importlib.metadata with fallback - Consolidate duplicated _q helper into utils.q(), import as _q in the 4 callers (mapping, scores, gene_pairs, plotting) - Delete broken SAMGUI import and gui() method (module doesn't exist)
Vendored from sc-sam v2.0.2 (samalg module). Only the methods SAMap
actually uses, split into clean submodules:
sam/
__init__.py - exports SAM class
core.py - SAM class: __init__, load_data, preprocess_data,
run, calculate_nnm, dispersion_ranking_NN,
leiden_clustering, get_labels, scatter, save_anndata
pca.py - _pca_with_sparse (LinearOperator implicit-center),
weighted_PCA
knn.py - calc_nnm, nearest_neighbors_hnsw, gen_sparse_knn
utils.py - convert_annotations, extract_annotation
Dropped (not used by SAMap):
- run_umap, run_tsne, run_diff_map, run_diff_umap
- kmeans_clustering, hdbscan_clustering, louvain_clustering
- identify_marker_genes_*
- dill-based save/load, pickle (.p) loading
Fixes applied during vendor:
- Deleted obsm['X_processed'] = D_sub write (n_cells x n_genes stored
every iteration, never read back)
- Replaced 3x .tolil() + .setdiag() + .tocsr() cycles with direct CSR
setdiag (no format round-trip, diagonal already in sparsity structure
since hnswlib returns self as first neighbor)
- Dropped numba import + NumbaWarning filter (SAM has no @jit functions)
- Rewrote gen_sparse_knn: direct COO->CSR construction instead of
lil_matrix fancy-index scatter (~3.5x faster)
Verified: numerically equivalent to original samalg on planarian.h5ad
(weight max abs diff ~6e-8, top genes identical).
Existing SAMap imports not yet updated; that follows after golden test.
Captures current behavior of the full pipeline (hydra/planarian/schistosome) in a compressed .npz fixture. Pins: - Stitched cross-species kNN graph (obsp['connectivities']) - Refined gene-homology graph (sm.gnnm_refined) - Original BLAST homology graph (sm.gnnm) - Per-species SAM gene weights Determinism: hnswlib's multi-threaded add_items is the only source of run-to-run variation in the core algorithm. The test monkeypatches samap.core.mapping.hnswlib with a single-threaded wrapper (random_seed pinned, num_threads=1). Verified reproducible across 3 consecutive runs. Regenerate: pytest tests/regression/ -m slow --regenerate-golden Run: pytest tests/regression/ -m slow
Completes Phase 0.3 by wiring the vendored SAM into SAMap itself.
Import changes:
- src/samap/core/mapping.py: samalg.SAM -> samap.sam.SAM;
samalg.utilities.convert_annotations -> samap.sam.utils
- src/samap/analysis/scores.py: samalg.SAM -> samap.sam.SAM
- src/samap/analysis/gene_pairs.py: samalg.SAM -> samap.sam.SAM (TYPE_CHECKING)
- src/samap/utils.py: samalg.SAM -> samap.sam.SAM (TYPE_CHECKING);
samalg.utilities.extract_annotation -> samap.sam.utils
- tests/integration: samalg.SAM -> samap.sam.SAM
- tests/regression (golden): samalg.SAM -> samap.sam.SAM;
updated hnswlib-patch comment (vendored SAM's hnswlib is in
samap.sam.knn but not reached by golden pipeline — input SAMs
are pre-computed)
- tests/unit/test_utils.py: dropped importorskip('samalg') guard
(to_vo now uses vendored extract_annotation unconditionally)
Deps:
- Removed sc-sam>=2.0.0 from pyproject.toml dependencies
- All transitive deps (hnswlib, leidenalg, anndata via scanpy,
sklearn via scanpy, umap via scanpy) already listed or provided
Verified:
- Golden regression test passes (1 passed in 68s)
- All 27 unit tests pass
- ruff check src/ tests/ clean
- rg samalg src/ tests/ -> only docstring provenance notes remain
…pyx (Phase 2.1)
- Backend class: device='cpu'|'cuda'|'auto' with xp/sp/spla namespace
access and gpu flag
- Compat shims for scipy/cupy divergence: nonzero (cupy sparse has no
.nonzero()), sparse_from_coo, setdiag (no LIL round-trip), svds (strips
solver=/v0=/random_state= on GPU), LinearOperator
- Data movement: to_device/to_host handle dense + CSR/CSC/COO,
asfortran_if_gpu pre-converts for cuSPARSE SpMM, free_pool flushes
cupy memory pools
- COOBuilder: host-side triplet accumulator that finalizes to CSR/CSC on
the active backend — replaces lil_matrix fancy-assign pattern (no LIL
on cupy)
- cupy is optional: module imports cleanly without it, HAS_CUPY flag
exported, Backend('cuda') raises clear error if unavailable
- 33 CPU unit tests + 18 GPU tests behind skipif guard
Pure refactor — extract ~20 private module-level functions from the
1856-line mapping.py into five focused modules. No algorithm changes.
mapping.py 1856 → 831 lines (SAMAP class, _Samap_Iter, _avg_as)
homology.py new · 272 lines (BLAST graph build/coarsen/filter, _tanh_scale)
correlation.py new · 320 lines (numba refinement kernels, _refine_corr)
projection.py new · 243 lines (hnswlib _united_proj, _mapping_window, prepare_SAMap_loadings)
coarsening.py new · 265 lines (_mapper, _concatenate_sam, coclustering)
expand.py new · 61 lines (_smart_expand, _sparse_knn_ks)
Cross-module deps:
- _tanh_scale lives in homology.py; imported by projection + coarsening
- coarsening.py imports from correlation, expand, homology, projection
- mapping.py re-imports public-ish names it needs for SAMAP.__init__
and _Samap_Iter.run (unchanged signatures)
Dropped from mapping.py imports (no longer needed there):
hnswlib, numba.{njit,prange}, Numba*Warning, StandardScaler,
df_to_dict, to_vn, sparse_knn, warnings.filterwarnings(Numba...)
— numba warning filters now live in correlation.py alongside the jitted kernels.
test_golden_output.py: update hnswlib monkeypatch target from
samap.core.mapping → samap.core.projection (the _united_proj
call site moved). Golden fixture data unchanged.
Verified:
pytest tests/regression/ -m slow PASS (bit-identical)
pytest tests/unit/ tests/integration/ 66 passed, 18 skipped
ruff check src/ clean
The original _smart_expand materializes nnm^i sparse matrix powers to build hop-i rings, then trims each ring to a per-cell budget. This densifies geometrically at scale. New _smart_expand_bfs does a per-cell budget-capped BFS over the CSR adjacency arrays directly. Numba-parallel over cells, O(budget * k) working set per cell, never materializes matrix powers. Equivalence: exact match when budget >= reachable-within-NH-hops; under truncation, Jaccard ~0.99 on synthetic data (n=500, k=20, NH=3). The residual divergence is from in-ring ranking (path-sum vs max-edge weight) and matpow's spurious self-loops — nnm^2 has nonzero diagonal that survives ring subtraction, so matpow wastes ~1 budget slot/cell on self. BFS excludes self by construction and collects ~1 more useful neighbour. Exposed as _smart_expand(..., legacy: bool). Default legacy=True keeps the matpow path so golden regression passes byte-identically. Flip the default once BFS is validated end-to-end. Tests cover: exact-match regime, truncation regime (Jaccard > 0.9), structural invariants (budget, self-loop, binarized output), edge cases (zero budget, disconnected, empty graph), dispatch routing.
…ase 2.2/2.3) SAMAP.__init__ now accepts backend='auto'|'cpu'|'cuda', creates a Backend instance, and passes it down to _Samap_Iter. The iterator stores self._bk for use by hot-path functions once those rewrites land (call sites intentionally not changed yet — other agents own projection/coarsening/expand/correlation). LIL removal for GPU compat: Added utils.coo_to_csr_overwrite() — builds CSR from COO triplets with last-write-wins semantics (matches LIL fancy-index assignment, unlike standard COO sum-duplicates). Required for exact golden match since BLAST output and cross-species pairs can have duplicate (row, col) entries. - homology.py: _calculate_blast_graph gnnm1/gnnm2 and _filter_gnnm symmetrization — use coo_to_csr_overwrite - scores.py: convert_eggnog_to_homologs (plain COO sufficient — only nonzero pattern matters downstream), CellTypeTriangles, GeneTriangles — use coo_to_csr_overwrite Deliberate deviation from task spec: COOBuilder's finalize() uses sum-duplicates semantics, which changes numerical output when duplicates exist. The dedup helper approach preserves exact output (golden passes) and achieves the same GPU-compat goal (no LIL). Verification: - pytest tests/regression/ -m slow: PASS (golden match) - pytest tests/unit/: 71 passed, 18 skipped (GPU) - ruff check (my files): clean
Eliminates the N² dense intermediate D = B @ nnm_internal.T by computing per-species-pair blocks D[a,b] = B[a,b] @ nnm_b.T and mutualising each pair independently. B is block-off-diagonal and nnm_internal is block-diagonal, so the per-pair decomposition is exact. Peak memory drops from O(N²) to O(chunk × N_b) per species-pair. - _compute_mutual_graph: core streaming helper; chunks over source-species rows, computes left = B_ab[chunk] @ nnm_b.T and right = nnm_a[chunk] @ B_ba.T, mutualises sqrt(left ⊙ right), accumulates into COOBuilder - _scale_by_corr: per-chunk correlation rescaling (row-maxes are exact since each chunk holds all column-blocks for its rows) - neigh_from_keys (coclustering) path: keeps factored M @ M.T form, uses a precomputed M.T @ B_ba.T per partner (n_clusters × N_b, tiny) reused across chunks - Preserves the original asymmetry: 0.1 threshold only when no species uses coclustering _mapper signature unchanged. Numerically bit-identical to the old path — golden regression passes. Also: add RUF002/003/SIM108 to ruff ignores (math notation in docstrings, mathematical if/else clearer than ternary).
… intermediate (Phase 3.1)
Precompose the projection G · diag(W/σ) · PCs so the per-iteration work is
a single SpMM (ss_i @ P) producing an N × npcs result directly, instead of
materialising the ~30%-dense N × G translated feature matrix Xtr.
Algebra
-------
Old path per species pair (i → s):
Xtr_is = ss_i @ G_is -- N_i × G_s, ~30% dense ← ELIMINATED
Xscaled = Xtr_is / σ_is (StandardScaler with_mean=False)
wpca = (Xscaled * W_s) @ PCs_s
New path:
σ_is² = diag(G_isᵀ · XtX_i · G_is) / n_i − (μ_i · G_is)² [quadratic form]
where XtX_i = ss_iᵀ @ ss_i and μ_i = ss_i.mean(0) are
precomputed once in _projection_precompute (iteration-invariant)
P_is = G_is · diag(W_s / σ_is) · PCs_s -- G_i × npcs, ~few MB
wpca = ss_i @ P_is -- ONE SpMM
The diag of the quadratic form is extracted as ((XtX @ G) ⊙ G).sum(0) —
one SpGEMM + one elementwise product, never forming G_s × G_s.
New public surface
------------------
_projection_precompute(sams, gns, bk) — build iteration-invariant state:
ss[sid], XtX[sid], mu_ss[sid], wpca_own[sid], M_own[sid], indexers.
Own-species projection ss_i @ PCs_i is fully iteration-invariant;
cached here, referenced (not recomputed) per iteration.
_compute_sigma(XtX, mu, G, n, bk) — σ via quadratic form.
Matches sklearn StandardScaler(with_mean=False).scale_ exactly:
biased variance (ddof=0), zero-variance columns → 1.0.
_mapping_window_fast(gnnm, precompute, K, pairwise) — per-iteration worker.
Backend-agnostic via samap.core._backend.Backend (xp/sp dispatch);
defaults to CPU. hnswlib kNN step stays on host (no GPU hnswlib).
Backward compat
---------------
_mapping_window(sams, gnnm, gns, K, pairwise) keeps its old signature as a
wrapper: builds precompute on-the-fly, delegates to _mapping_window_fast.
coarsening._mapper still calls _mapping_window unchanged — the
orchestration layer will swap in precompute caching in a later phase.
gnnm=None path (own-species bootstrap) unchanged.
Verification
------------
tests/unit/test_projection.py -- 11 tests:
· σ quadratic-form vs sklearn at rtol=1e-12 (4 seeds)
· σ zero-variance → 1.0 edge case
· wpca equivalence: 2- and 3-species × {pairwise, all-to-all} at rtol=1e-6
(3-species all-to-all is non-trivial: global ≠ per-pair normalisation)
· precompute is iteration-invariant (same dict, two different gnnms)
· wrapper ≡ fast path
pytest tests/regression/ -m slow PASS (golden bit-identical)
pytest tests/unit/ tests/integration/ 98 passed, 18 skipped
ruff check projection.py test_projection.py clean
… (Phase 3 finalization)
The precomposed feature-translation path from Phase 3.1 now actually pays
off: _projection_precompute runs ONCE in _Samap_Iter.__init__, and the
cached state (ss, XtX, mu_ss, wpca_own) is reused across all NUMITERS
iterations via _mapping_window_fast. Previously the backward-compat
wrapper rebuilt the precompute on every iteration.
Observed: golden regression test dropped from ~116s to ~89s (3-species,
3 iterations) — ~23% wall-clock reduction on this workload.
mapping.py
----------
_Samap_Iter.__init__ + build self._gns = concat(gns_dict.values())
+ build self._proj_cache = _projection_precompute(...)
_Samap_Iter.run() - gns rebuilt every call → reuse self._gns
+ pass proj_cache=self._proj_cache, bk=self._bk
to _mapper
imports + _projection_precompute from .projection
coarsening.py
-------------
_mapper + proj_cache: dict | None = None
+ bk: Backend | None = None
· if proj_cache is provided, route to
_mapping_window_fast instead of the wrapper
· thread bk → _compute_mutual_graph, _smart_expand
_compute_mutual_graph + bk: Backend | None = None
· replaces hardcoded Backend("cpu") — now honours
upstream backend choice
imports + _mapping_window_fast from .projection
expand.py
---------
_smart_expand + bk kwarg (unused; threaded for Phase 4 GPU)
+ TODO: flip legacy default → BFS after golden regen
The _mapping_window wrapper stays as a working deprecated shim — still
referenced by test_projection.py parity test and the proj_cache=None
fallback in _mapper.
Verified
--------
pytest tests/regression/ -m slow PASS (bit-identical)
pytest tests/unit/ 92 passed, 18 skipped
ruff check mapping/coarsening/expand/projection.py clean
Two independent changes landing together:
1. **Dict-free numba kernel** (GPU-compat prep)
- _refine_corr_kernel → _corr_kernel: species lookup now uses integer
sp_starts/sp_lens arrays instead of a Python dict inside the hot loop
- String species IDs → integer IDs (indexed by np.unique position)
- np.append → preallocated np.empty + sliced assignment (cleaner for CUDA)
- corr_mode string comparison moved out of kernel (pearson: bool arg)
- Numerically bit-identical to original (same Pearson formula, same Xi)
- Contiguity sanity check in _refine_corr_parallel guards the assumption
2. **Batched Xavg computation** (memory win)
- New _compute_pair_corrs dispatches materialised (batch_size=None) vs
streaming (batch_size=int)
- Streaming: process pairs in batches; per batch compute Xavg only for
the ≤ 2·batch_size genes actually referenced (nnms @ Xs[:, needed]),
correlate, discard
- Peak memory: O(N × G_active) → O(N × 2·batch_size)
- Uses searchsorted on np.unique output to map global→local gene cols
- Some recomputation when a gene spans batches; cheap (single-column SpMV)
- Default batch_size=None → materialised path → golden passes unchanged
_refine_corr and _refine_corr_parallel both gain batch_size kwarg
(default None, backward-compatible; mapping.py unchanged).
14 equivalence tests: kernel vs pure-numpy reference (Pearson+Xi),
streaming vs materialised across batch_size ∈ {1,7,32,50,64,200,256,500,10k},
2-species and 3-species (shuffled pairs), empty-pairs edge case.
All match to rtol=1e-10.
Adds randomized_svd_implicit_center() and wires it into _pca_with_sparse
via svd_solver='randomized'. ARPACK remains the default — no behaviour
change for existing callers.
Algorithm: Halko-Martinsson-Tropp randomized range-finding adapted for
implicit mean-centering. Every (X - 1·μ) @ M is computed as
X @ M - 1·(μ @ M) — one SpMM plus a rank-1 broadcast correction, never
materialising the dense centered matrix. With 4 power iterations this
replaces ARPACK's hundreds of serial matvecs with ~9 block SpMM passes.
GPU-friendly via Backend dispatch: uses bk.xp/bk.sp throughout, calls
bk.asfortran_if_gpu before each SpMM (cuSPARSE prefers F-order RHS).
Random sketch is seeded for determinism; svd_flip applied for sign
consistency with ARPACK path.
Tests (tests/unit/test_pca.py, 14 pass + 2 GPU-skipped):
- ARPACK vs randomized: explained variance <1% rel err on top-10,
subspace angles <0.1 rad within signal rank, reconstruction <5%
- Both vs sklearn dense PCA: variance and subspace agreement
- API: arpack-is-default, mu_axis=1 rejected, seeded determinism,
n_power monotonicity
- GPU tests behind HAS_CUPY skipif
Verified: unit tests pass; golden regression passes (arpack default
untouched); ruff clean.
- New src/samap/core/knn.py: approximate_knn() dispatches between CPU HNSW (hnswlib) and GPU brute-force (FAISS GpuIndexFlatIP) based on Backend.gpu. faiss/faiss-gpu are optional; graceful fallback to hnswlib with warning if GPU backend has no faiss-gpu. - _hnswlib_knn: extracted from projection._united_proj with identical params (ef=200, M=48, space='cosine'). num_threads exposed for golden-test determinism. - _faiss_gpu_knn: L2-normalise on GPU, GpuIndexFlatIP (exact), convert IP -> cosine distance. Results returned on host. TODO note for GpuIndexIVFFlat at >1M points. - Backend: add faiss_gpu_resources() lazy cache (StandardGpuResources is expensive to create, safe to reuse). - projection._united_proj: now calls approximate_knn; bk parameter threaded through from _mapping_window_fast. Also replaced LIL fancy-index with COO construction (equivalent: HNSW returns distinct neighbours so no duplicates). - Golden test monkeypatch retargeted: projection.hnswlib -> knn.hnswlib. - Tests: 9 CPU pass (recall vs brute-force >95%, output format, single- thread determinism), 6 GPU-gated (FAISS exact vs brute-force, cupy input, resource cache). - Golden regression: passes bit-exact.
…4.3)
Three pieces:
1. _replace_vectorized(X, xi, yi, bk, batch_size=None)
- Backend-agnostic per-pair Pearson over dense rows: gather 2·batch rows,
centre, dot, normalise — one reduction kernel on GPU via bk.xp
- Upcasts to float64 internally (wPCA often stored float32)
- Optional batch_size caps working set at 2·batch_size·d floats; default
None processes all pairs in one shot
- Zero-variance rows → NaN (matches _replace; callers already handle)
- np.errstate(invalid='ignore') silences the expected 0/0 warning
2. replace_corr(X, xi, yi, bk=None, batch_size=None) dispatcher
- bk=None or not bk.gpu → numba _replace (CPU fast path: fused loop,
no (n_pairs × d) allocation)
- bk.gpu → _replace_vectorized (one large kernel launch)
- Drop-in for callers that have a Backend; existing direct _replace
callers unchanged
3. CUDA-porting notes for _corr_kernel (comment block)
- Blockers enumerated: dynamic alloc (np.zeros/empty inside kernel),
fancy-index scatter, .mean/.std method calls, _xicorr's argsort
- Prescribed fix: don't densify — iterate CSC nonzeros directly,
test each index against the two species ranges, accumulate Pearson
sums in scalar registers (two-pass: mean then centred products)
- Result: O(1) per-thread state, no shared memory needed
- Xi correlation has no @cuda.jit equivalent; gate GPU kernel to
pearson=True, fall back to CPU for Xi
12 new tests (26 total in test_correlation.py):
- _replace_vectorized vs np.corrcoef and vs numba _replace (rtol=1e-12)
- Batched vs full across batch_size ∈ {1,7,100,500,2000} — bit-identical
- float32 input upcast; zero-variance NaN equivalence with numba
- Dispatcher: CPU backend → numba (bit-identical), bk=None → numba,
mock GPU backend → vectorised path
Three independent benchmarks, one per Phase-3 optimization target:
expand — matpow vs BFS _smart_expand (legacy= toggle)
projection — per-iter _mapping_window rebuild vs precompute-once +
_mapping_window_fast (proj_cache toggle)
correlation — materialised Xavg vs streaming _compute_pair_corrs
(batch_size toggle)
Phase-level (direct function calls) because the toggles aren't plumbed
through _mapper yet — gives clean per-optimization attribution.
bench_samap.py: parameterised scale sweep (1k/3k/10k/30k cells),
tracemalloc peak memory, CSV output + summary table. plot_bench.py:
log-log scaling curves (time + memory, per phase).
Verified on macOS CPU, --max-cells 3000:
expand: 3.5x→4.8x speedup (grows with n — matpow densification)
projection: ~2x speedup, ~2x memory (precompute reuse)
correlation: 4.3x memory win; streaming overhead at small n, breaks
even when materialised Xavg approaches RAM limits
…e 5 finalization)
All the optimized paths landed in Phases 3-4 now default-on. Golden
regenerated to reflect BFS neighbourhood expansion (slightly different
marginal neighbours — ~1% edge difference, no self-loop artefact).
Default flips
-------------
_smart_expand: legacy=True → legacy=False (BFS)
_refine_corr: batch_size=None → batch_size=512
_refine_corr_parallel: batch_size=None → batch_size=512
BFS is a pure win (~5× at 3k cells, memory-bounded). Streaming correlation
is a memory-for-speed trade: ~4× less peak memory, ~3-5× slower on toy
data (<10k cells) where the full Xavg would fit anyway. Tuned for
million-cell scale where Xavg would OOM. Pass batch_size=None explicitly
for the fast materialised path on small data.
Golden regenerated
------------------
tests/regression/fixtures/golden_3species.npz — regenerated with BFS +
batch_size=512 defaults. BFS changes the neighbour selection; batched
correlation is bit-identical to materialised.
New files
---------
docs/performance.md — memory model, backend selection, tuning, scaling
estimates, benchmark results
CHANGELOG.md — 3.0.0 section: breaking (sc-sam vendored, BFS
default), added (GPU backend, N²→N-linear rewrites, randomized SVD,
benchmark suite), fixed (Phase 0.2 bugs), changed (module split,
defaults)
Version bump
------------
pyproject.toml: 2.0.2 → 3.0.0
Verified
--------
pytest tests/regression/ -m slow PASS (new golden)
pytest tests/unit/ 146 passed, 26 skipped
pytest tests/integration/ 6 passed (~160s — streaming
overhead at toy scale)
ruff check src/ clean
benchmarks/bench_samap.py --max-cells 1000 ran cleanly; summary matches docs
…emory Changes the default batch_size from 512 to 'auto'. The streaming path is necessary at million-cell scale but 3-5× slower on toy data (~3k cells) where the full Xavg fits in memory — auto-selection picks the right path. - _resolve_batch_size(batch_size, nnms, Xs, mem_threshold_gb=2.0): estimates materialised Xavg size as n_cells × n_genes × out_density × 12 bytes (CSC data+indices), where out_density ≈ 1 - (1-expr_density)^k (kNN-smoothed sparse expression fills in; output entry is zero iff all k contributing inputs are zero). Under threshold → None (materialise); over → 512 (stream). Explicit batch_size (int or None) passes through unchanged. Decision logged at INFO. - _refine_corr / _refine_corr_parallel: batch_size default 512 → 'auto'; new correlation_mem_threshold param (default 2.0 GB). Resolution happens in _refine_corr_parallel right after nnms/Xs are built (shapes known). - mapping.py: correlation_mem_threshold exposed on both SAMAP.refine_homology_graph and _Samap_Iter.refine_homology_graph. - docs/performance.md: documents auto-selection and the tuning knob. 5 new tests (31 total): explicit passthrough (None/32/9999 untouched); toy 500-cell data → materialise; mock million-cell data → stream (uses duck-typed .shape/.nnz mock, no real alloc); threshold boundary flip; zero-sized degenerate inputs → materialise. Golden passes (auto selects materialised on toy data → same numerics as batch_size=None). Integration tests back to materialised speed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eliminates all N² memory intermediates from the SAMap iteration loop, adds optional GPU support via cupy/FAISS/rapids-singlecell, and vendors SAM (sc-sam) into the package.
Estimated scaling improvement: ~300-500k cells (v2.x) → ~5-10M cells (v3.0 on 256GB RAM + A100).
Breaking changes
sc-samremoved as a dependency. SAM is nowsamap.sam.SAM(vendored subset, ~1300 LOC)._smart_expandnow uses BFS instead of matrix powers by default — ~1% different neighbor selection (no self-loop artifact).legacy=Trueavailable for exact reproduction.Added
N² → N-linear memory rewrites
projection.pyXtr = X @ gnnm_corr(N×G, ~30% dense)G·diag(W/σ)·PCs; σ via quadratic form on precomputedXᵀXcoarsening.pyD = B @ nnm_internal.T(N², ~0.5% dense)expand.pynnm^(i+1)correlation.pyXavg = nnms @ Xs(N×G_active, 10-60% dense)GPU support (
pip install sc-samap[gpu])Backendclass dispatches numpy/scipy ↔ cupy/cupyx.scipy.sparseSAMAP(backend="auto"|"cpu"|"cuda")— auto detects CUDAInfrastructure
docs/performance.md: memory model, tuning, scaling estimatesmapping.pysplit 1856→831 LOC into 5 focused modulesFixed
Pre-existing bugs discovered and fixed:
projrandom-walk (result discarded at next line — pure compute waste)thrkwarg misrouted to p-value threshold inFunctionalEnrichmentsetup.pycoexisting withpyproject.toml__version__mismatch vspyproject.toml.Aattribute_qhelper duplicated 4×samalg.gui.SAMGUIimportobsm["X_processed"]stored every iteration, never read (memory leak)Benchmark (macOS CPU, 3k cells)
End-to-end golden regression: 116s → 89s (−23%) from precompute reuse alone.
Not in this PR
_corr_kernelCUDA kernel (numba is dict-free and structurally portable; porting notes in module)@pytest.mark.skipif)See
CHANGELOG.mdanddocs/performance.mdfor full details.