annslicer

Out-of-core sharding and merging of large AnnData files with minimal memory usage.

Large single-cell datasets stored as .h5ad or .zarr files can easily exceed available RAM. annslicer slices them into manageable shards — and merges them back — without loading full matrices into memory. It uses best practices from anndata with a few small speed improvements for random shuffling.

Consolidates best practices into a simple command-line tool.

annslicer slice input.h5ad output_prefix

annslicer merge output.h5ad shard_0.h5ad shard_1.h5ad

Features

Shards and merges X, all layers, obs, var, obsm, and uns
Handles both dense and sparse (CSR) matrices
Constant, low memory footprint regardless of file size
Input supports both .h5ad and .zarr formats for slicing
Merge output supports both .h5ad and .zarr formats
Optional cell shuffling (--shuffle) for representative shards without loading the full matrix
Simple CLI and Python API

Installation

pip install annslicer

For Zarr input/output support (optional):

pip install annslicer[zarr]

CLI Usage

annslicer provides two subcommands: slice and merge.

Sharding a large file

annslicer slice input.h5ad output_prefix --size 10000

Both .h5ad and .zarr inputs are supported.

Argument	Description
`input.h5ad` or `input.zarr`	Path to the source file
`output_prefix`	Prefix for output files (e.g. `atlas` → `atlas_shard001.h5ad`, …)
`--size N`	Number of cells per shard (default: `10000`)
`--shuffle`	Randomly assign cells to shards (each shard is a representative draw)
`--seed N`	Random seed for reproducible shuffling (requires `--shuffle`)
`--compression FILTER`	HDF5 compression filter for shard files (e.g. `gzip`, `lzf`); default: no compression

Example — basic sharding:

annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 20000

Example — shuffled sharding from a large h5ad:

annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 10000 --shuffle --seed 0

Example — gzip-compressed shards:

annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 10000 --compression gzip

Produces: atlas_shard_0.h5ad, atlas_shard_1.h5ad, …

Merging shards back into one file

annslicer merge output.h5ad shard_0.h5ad shard_1.h5ad shard_2.h5ad

Output format is inferred from the extension — use .zarr for Zarr output (requires annslicer[zarr]):

annslicer merge output.zarr shard_0.h5ad shard_1.h5ad shard_2.h5ad

Input files can also be specified as glob patterns (expanded lexicographically):

annslicer merge output.h5ad "shards/atlas_shard_*.h5ad"

Argument	Description
`output_file`	Path for the merged output file (`.h5ad` or `.zarr`)
`input_files`	One or more shard paths or glob patterns, in order
`--join {inner,outer}`	How to join var (gene) axes when files differ (default: `outer`)

When shards have different gene sets, --join outer (default) takes the union of all genes and fills missing entries with zeros; --join inner keeps only genes present in every shard. Layers absent from any shard are always dropped.

Global options

Flag	Description
`--debug`	Enable verbose debug-level logging

Python API

from annslicer import shard_h5ad, merge_out_of_core

# Basic sharding (h5ad or zarr input)
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000)
shard_h5ad("large_atlas.zarr", "atlas", shard_size=20000)  # requires annslicer[zarr]

# Shuffled sharding — cells are randomly distributed across shards
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000, shuffle=True, seed=0)

# Gzip-compressed shards — smaller files at the cost of write speed
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000, compression="gzip")

# Custom output filenames — provide explicit paths instead of auto-generated names
shard_h5ad(
    "large_atlas.h5ad",
    "atlas",  # ignored when output_filenames is provided
    shard_size=20000,
    output_filenames=["batch_0.h5ad", "batch_1.h5ad", "batch_2.h5ad"],
)

# Merge shards back into one file (identical-var fast path used automatically)
merge_out_of_core(["atlas_shard_0.h5ad", "atlas_shard_1.h5ad"], "merged.h5ad")

# Merge shards with different gene sets — outer join (union, fills absent genes with 0)
merge_out_of_core(["shard_a.h5ad", "shard_b.h5ad"], "merged.h5ad", join="outer")

# Merge shards with different gene sets — inner join (intersection only)
merge_out_of_core(["shard_a.h5ad", "shard_b.h5ad"], "merged.h5ad", join="inner")

How it works

Slicing

Opens the input file ("backed" AnnData for .h5ad; anndata.io.sparse_dataset for .zarr).
If shuffle=True, generates a global cell permutation upfront using numpy.random.default_rng.
For each shard, reads only the relevant rows from X and each layer via sorted fancy indexing — no full matrix is ever loaded into memory.
When shuffling, rows are read in sorted index order (maximising sequential I/O) and then reordered in-memory to the desired shuffled order.
Reassembles a valid AnnData object per shard and writes it to disk.

Merging

Reads obs, var, and uns from all shards to build a skeleton output file.
Computes the merged var index: union (outer join) or intersection (inner join) of gene sets across all shards. If every shard shares the identical var, remapping is skipped entirely (fast path).
Scans shards to calculate total non-zero sizes for pre-allocation (for an inner join, entries for excluded genes are filtered during the scan).
Streams X, layers, and obsm data shard-by-shard directly into the pre-allocated output arrays, remapping column indices on the fly where needed.
Layers absent from any shard are dropped so every cell has consistent layer coverage.

Note: CSC (column-compressed) sparse matrices are not supported for out-of-core row-slicing. Convert to CSR before sharding.

Benchmarks

Run on a dummy sparse anndata object with 200k cells and 10k genes.

For h5ad format

Slicing method	Mean runtime (s)	Peak memory (MB)
`annslicer slice`	0.584	211.4
`anndata` backed	0.601	203.7
`annslicer slice --shuffle`	1.731	221.8
`anndata` backed with shuffle	3.830	209.1

For zarr format

Slicing method	Mean runtime (s)	Peak memory (MB)
`annslicer slice`	1.050	62.1
`anndata` backed	0.799	54.4
`annslicer slice --shuffle`	5.544	142.9
`anndata` backed with shuffle	6.591	151.4

Based on these benchmarks, for making randomly shuffled data shards, we recommend using annslicer slice --shuffle on an h5ad format file.

License

BSD 3-clause

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
src/annslicer		src/annslicer
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
diagram.png		diagram.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

annslicer

Features

Installation

CLI Usage

Sharding a large file

Merging shards back into one file

Global options

Python API

How it works

Slicing

Merging

Benchmarks

For h5ad format

For zarr format

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

annslicer

Features

Installation

CLI Usage

Sharding a large file

Merging shards back into one file

Global options

Python API

How it works

Slicing

Merging

Benchmarks

For h5ad format

For zarr format

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages