Skip to content

Biodyn-AI/manifold

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Discovering and Extracting a Hematopoietic Algorithm from scGPT

Code, data, and reproducibility materials for the paper:

Discovering and Extracting a Hematopoietic Algorithm from scGPT: From Internal Manifold to Deployable Compact Operator

Overview

We discover a compact hematopoietic developmental manifold inside the frozen scGPT single-cell foundation model, validate it on external datasets in strict zero-shot mode, and extract it as a standalone algorithm. The extracted operator significantly outperforms established trajectory-inference methods on pseudotime-depth alignment (orientation-independent |rho| = 0.439 vs 0.331 for the next-best alternative) while being 34.5x faster and requiring ~1,000x fewer trainable parameters. Multi-stage compaction reduces the operator to a single attention head (5.9 MB) and further to a rank-64 surrogate (0.73 MB) whose four-factor mechanistic core resolves into explicit hematopoietic gene programs.

Repository Structure

.
├── paper/                          # Manuscript and figures
│   ├── scgpt_hematopoiesis_manifold_study_v2.tex   # Main paper + supplements
│   ├── figures/                    # Generated figures for the paper
│   └── results/                    # Benchmark result tables
│
├── iterations/
│   ├── iter_0011/                  # H65: Primary hematopoietic manifold study
│   │   ├── *.py                    # 50+ analysis scripts (h65_*.py)
│   │   ├── reviewer_kit/           # Organized reproducibility package
│   │   │   ├── 01_core_h65/        #   Core manifold discovery
│   │   │   ├── 02_trajectory_pseudotime/
│   │   │   ├── 03_let_transfer/    #   Latent embedding transfer
│   │   │   ├── 04_visual_lenses/   #   Visualization methods
│   │   │   ├── 05_dimension_flatness_ripple/
│   │   │   ├── 06_phase2_deepening/
│   │   │   ├── 07_external_validation/
│   │   │   ├── 08_head_attribution_and_ablations/
│   │   │   ├── 09_process_and_provenance/
│   │   │   ├── MANIFEST.csv        #   Complete file inventory
│   │   │   └── PAPER_CODE_MAP.md   #   Figure-to-code mapping
│   │   ├── artifacts/              # Intermediate outputs
│   │   │   ├── figures/            #   Generated plots (referenced by paper)
│   │   │   ├── summaries/          #   JSON/MD summaries
│   │   │   └── notes/              #   Analysis notes
│   │   ├── scripts_by_section/     # Scripts organized by paper section
│   │   ├── ops/                    # Operational utilities
│   │   ├── sota_comparison_20260303/
│   │   │   ├── scripts/            # SOTA benchmark scripts (included)
│   │   │   ├── configs/            # Benchmark configurations
│   │   │   └── logs/               # Execution logs
│   │   └── *.csv, *.json, *.md     # Result data files
│   │
│   └── iter_0014/                  # H38: Intercellular communication manifold
│       ├── *.py                    # H38 rescue/validation scripts
│       └── *.csv, *.json           # H38 result artifacts
│
├── loop/                           # Autonomous research loop configuration
│   ├── config_*.json               # Loop configuration
│   └── *.sh                        # Start/stop/status scripts
│
├── prompts/                        # AI agent prompts for the research loop
├── reports/                        # Loop execution logs and live results
└── planning/                       # Project charter

Quick Start for Reviewers

  1. Paper: paper/scgpt_hematopoiesis_manifold_study_v2.tex (compile with pdflatex)
  2. Code-to-figure mapping: iterations/iter_0011/reviewer_kit/PAPER_CODE_MAP.md
  3. Reproducibility package: iterations/iter_0011/reviewer_kit/ (9 topical directories)
  4. Main orchestration script: iterations/iter_0011/run_manifold_experiments.py
  5. H38 secondary manifold: iterations/iter_0014/

Excluded from Repository (Reproducible from Scripts)

The following large directories are excluded via .gitignore and can be regenerated:

Directory Size Contents
iter_0011/sota_comparison_20260303/outputs/ 12 GB Raw SOTA method outputs
iter_0011/sota_comparison_20260303/methods_src/ 633 MB Downloaded SOTA implementations
iter_0011/external_downloads/ 517 MB External validation datasets
iter_0011/artifacts/viewers/ 54 MB Interactive HTML visualizations
iter_0011/artifacts/tables/ 31 MB Full numerical result tables
Non-primary iterations (1-10, 12-13) 33 MB Exploratory iterations

Dependencies

  • Python 3.10+ with conda environment bio_mech_interp
  • Key packages: scgpt, scanpy, anndata, torch, scikit-learn, scipy
  • LaTeX: pdflatex with standard packages (amsmath, booktabs, graphicx, hyperref)
  • Data: Tabula Sapiens immune subset, processed through frozen scGPT whole-human checkpoint

Autonomous Research Loop

The study was conducted through a two-phase autonomous research loop:

  • Phase 1 (AI-driven broad search): An AI executor-reviewer pair explored dozens of manifold hypotheses under pre-registered quality gates (trustworthiness >= 0.80, holdout correlation >= 0.20, blocked-permutation p <= 0.001). The key output was the identification of H65 (hematopoietic developmental manifold) as the first branch to pass all gates.

  • Phase 2 (Author-directed investigation): Once H65 was identified, the research transitioned to manual investigation by the authors: methodological closure, external validation, extraction, benchmarking, compaction, and mechanistic interpretability.

Loop infrastructure is in loop/, prompts/, and reports/.

License

This repository accompanies a manuscript under review. Code is provided for reproducibility and review purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages