Memory issues running Bigraph algorithm on large datasets

Hi，

I’m really impressed by the power and flexibility of the Bigraph algorithm for spatial data and have been excited to apply it to my much larger dataset—some 800 patients with an average of 13,000 cells each—but I’ve encountered a major roadblock: 

As I ran through the bigraph algorithm on the entire dataset  (i.e. while generate and store full adjacency matrices for each patient specifically), memory usage rockets up to around 500 GB after processing roughly 200 patients, causing the kernel to crash. 

<img width="435" alt="Image" src="https://github.com/user-attachments/assets/c62d9b1d-1644-412d-bda5-37fa4cb35328" />

To work around this, I’ve been thinking about a sequential workflow where I construct each patient’s cell graph, compute and save the subtree embeddings to disk, then immediately clear the graph from memory before moving on to the next patient, and finally reload only the embeddings for clustering across all patients. 

However, I’m concerned about compatibility with the existing codebase and whether there might be more efficient built-in options—such as sparse or on-disk graph representations, chunked or streaming processing, or other memory-saving strategies—that could better accommodate large-scale datasets. Any advice or best practices you could share would be really helpful!

Best,
Irina

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues running Bigraph algorithm on large datasets #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Memory issues running Bigraph algorithm on large datasets #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions