Skip to content

Memory issues running Bigraph algorithm on large datasets #2

@yutong-zhu

Description

@yutong-zhu

Hi,

I’m really impressed by the power and flexibility of the Bigraph algorithm for spatial data and have been excited to apply it to my much larger dataset—some 800 patients with an average of 13,000 cells each—but I’ve encountered a major roadblock:

As I ran through the bigraph algorithm on the entire dataset (i.e. while generate and store full adjacency matrices for each patient specifically), memory usage rockets up to around 500 GB after processing roughly 200 patients, causing the kernel to crash.

Image

To work around this, I’ve been thinking about a sequential workflow where I construct each patient’s cell graph, compute and save the subtree embeddings to disk, then immediately clear the graph from memory before moving on to the next patient, and finally reload only the embeddings for clustering across all patients.

However, I’m concerned about compatibility with the existing codebase and whether there might be more efficient built-in options—such as sparse or on-disk graph representations, chunked or streaming processing, or other memory-saving strategies—that could better accommodate large-scale datasets. Any advice or best practices you could share would be really helpful!

Best,
Irina

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions