Out-of-core sharding and merging of large AnnData files with minimal memory usage.
Large single-cell datasets stored as .h5ad or .zarr files can easily exceed available RAM. annslicer slices them into manageable shards — and merges them back — without loading full matrices into memory. It uses best practices from anndata with a few small speed improvements for random shuffling.
Consolidates best practices into a simple command-line tool.
annslicer slice input.h5ad output_prefixannslicer merge output.h5ad shard_0.h5ad shard_1.h5ad- Shards and merges
X, alllayers,obs,var,obsm, anduns - Handles both dense and sparse (CSR) matrices
- Constant, low memory footprint regardless of file size
- Input supports both
.h5adand.zarrformats for slicing - Merge output supports both
.h5adand.zarrformats - Optional cell shuffling (
--shuffle) for representative shards without loading the full matrix - Simple CLI and Python API
pip install annslicerFor Zarr input/output support (optional):
pip install annslicer[zarr]annslicer provides two subcommands: slice and merge.
annslicer slice input.h5ad output_prefix --size 10000Both .h5ad and .zarr inputs are supported.
| Argument | Description |
|---|---|
input.h5ad or input.zarr |
Path to the source file |
output_prefix |
Prefix for output files (e.g. atlas → atlas_shard001.h5ad, …) |
--size N |
Number of cells per shard (default: 10000) |
--shuffle |
Randomly assign cells to shards (each shard is a representative draw) |
--seed N |
Random seed for reproducible shuffling (requires --shuffle) |
--compression FILTER |
HDF5 compression filter for shard files (e.g. gzip, lzf); default: no compression |
Example — basic sharding:
annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 20000Example — shuffled sharding from a large h5ad:
annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 10000 --shuffle --seed 0Example — gzip-compressed shards:
annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 10000 --compression gzipProduces: atlas_shard_0.h5ad, atlas_shard_1.h5ad, …
annslicer merge output.h5ad shard_0.h5ad shard_1.h5ad shard_2.h5adOutput format is inferred from the extension — use .zarr for Zarr output (requires annslicer[zarr]):
annslicer merge output.zarr shard_0.h5ad shard_1.h5ad shard_2.h5adInput files can also be specified as glob patterns (expanded lexicographically):
annslicer merge output.h5ad "shards/atlas_shard_*.h5ad"| Argument | Description |
|---|---|
output_file |
Path for the merged output file (.h5ad or .zarr) |
input_files |
One or more shard paths or glob patterns, in order |
--join {inner,outer} |
How to join var (gene) axes when files differ (default: outer) |
When shards have different gene sets, --join outer (default) takes the union of all genes and fills missing entries with zeros; --join inner keeps only genes present in every shard. Layers absent from any shard are always dropped.
| Flag | Description |
|---|---|
--debug |
Enable verbose debug-level logging |
from annslicer import shard_h5ad, merge_out_of_core
# Basic sharding (h5ad or zarr input)
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000)
shard_h5ad("large_atlas.zarr", "atlas", shard_size=20000) # requires annslicer[zarr]
# Shuffled sharding — cells are randomly distributed across shards
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000, shuffle=True, seed=0)
# Gzip-compressed shards — smaller files at the cost of write speed
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000, compression="gzip")
# Custom output filenames — provide explicit paths instead of auto-generated names
shard_h5ad(
"large_atlas.h5ad",
"atlas", # ignored when output_filenames is provided
shard_size=20000,
output_filenames=["batch_0.h5ad", "batch_1.h5ad", "batch_2.h5ad"],
)
# Merge shards back into one file (identical-var fast path used automatically)
merge_out_of_core(["atlas_shard_0.h5ad", "atlas_shard_1.h5ad"], "merged.h5ad")
# Merge shards with different gene sets — outer join (union, fills absent genes with 0)
merge_out_of_core(["shard_a.h5ad", "shard_b.h5ad"], "merged.h5ad", join="outer")
# Merge shards with different gene sets — inner join (intersection only)
merge_out_of_core(["shard_a.h5ad", "shard_b.h5ad"], "merged.h5ad", join="inner")- Opens the input file ("backed" AnnData for
.h5ad;anndata.io.sparse_datasetfor.zarr). - If
shuffle=True, generates a global cell permutation upfront usingnumpy.random.default_rng. - For each shard, reads only the relevant rows from
Xand each layer via sorted fancy indexing — no full matrix is ever loaded into memory. - When shuffling, rows are read in sorted index order (maximising sequential I/O) and then reordered in-memory to the desired shuffled order.
- Reassembles a valid
AnnDataobject per shard and writes it to disk.
- Reads
obs,var, andunsfrom all shards to build a skeleton output file. - Computes the merged
varindex: union (outer join) or intersection (inner join) of gene sets across all shards. If every shard shares the identicalvar, remapping is skipped entirely (fast path). - Scans shards to calculate total non-zero sizes for pre-allocation (for an inner join, entries for excluded genes are filtered during the scan).
- Streams
X, layers, andobsmdata shard-by-shard directly into the pre-allocated output arrays, remapping column indices on the fly where needed. - Layers absent from any shard are dropped so every cell has consistent layer coverage.
Note: CSC (column-compressed) sparse matrices are not supported for out-of-core row-slicing. Convert to CSR before sharding.
Run on a dummy sparse anndata object with 200k cells and 10k genes.
| Slicing method | Mean runtime (s) | Peak memory (MB) |
|---|---|---|
annslicer slice |
0.584 | 211.4 |
anndata backed |
0.601 | 203.7 |
annslicer slice --shuffle |
1.731 | 221.8 |
anndata backed with shuffle |
3.830 | 209.1 |
| Slicing method | Mean runtime (s) | Peak memory (MB) |
|---|---|---|
annslicer slice |
1.050 | 62.1 |
anndata backed |
0.799 | 54.4 |
annslicer slice --shuffle |
5.544 | 142.9 |
anndata backed with shuffle |
6.591 | 151.4 |
Based on these benchmarks, for making randomly shuffled data shards, we recommend using annslicer slice --shuffle on an h5ad format file.
BSD 3-clause
