Discovering and Extracting a Hematopoietic Algorithm from scGPT

Code, data, and reproducibility materials for the paper:

Discovering and Extracting a Hematopoietic Algorithm from scGPT: From Internal Manifold to Deployable Compact Operator

Overview

We discover a compact hematopoietic developmental manifold inside the frozen scGPT single-cell foundation model, validate it on external datasets in strict zero-shot mode, and extract it as a standalone algorithm. The extracted operator significantly outperforms established trajectory-inference methods on pseudotime-depth alignment (orientation-independent |rho| = 0.439 vs 0.331 for the next-best alternative) while being 34.5x faster and requiring ~1,000x fewer trainable parameters. Multi-stage compaction reduces the operator to a single attention head (5.9 MB) and further to a rank-64 surrogate (0.73 MB) whose four-factor mechanistic core resolves into explicit hematopoietic gene programs.

Repository Structure

.
├── paper/                          # Manuscript and figures
│   ├── scgpt_hematopoiesis_manifold_study_v2.tex   # Main paper + supplements
│   ├── figures/                    # Generated figures for the paper
│   └── results/                    # Benchmark result tables
│
├── iterations/
│   ├── iter_0011/                  # H65: Primary hematopoietic manifold study
│   │   ├── *.py                    # 50+ analysis scripts (h65_*.py)
│   │   ├── reviewer_kit/           # Organized reproducibility package
│   │   │   ├── 01_core_h65/        #   Core manifold discovery
│   │   │   ├── 02_trajectory_pseudotime/
│   │   │   ├── 03_let_transfer/    #   Latent embedding transfer
│   │   │   ├── 04_visual_lenses/   #   Visualization methods
│   │   │   ├── 05_dimension_flatness_ripple/
│   │   │   ├── 06_phase2_deepening/
│   │   │   ├── 07_external_validation/
│   │   │   ├── 08_head_attribution_and_ablations/
│   │   │   ├── 09_process_and_provenance/
│   │   │   ├── MANIFEST.csv        #   Complete file inventory
│   │   │   └── PAPER_CODE_MAP.md   #   Figure-to-code mapping
│   │   ├── artifacts/              # Intermediate outputs
│   │   │   ├── figures/            #   Generated plots (referenced by paper)
│   │   │   ├── summaries/          #   JSON/MD summaries
│   │   │   └── notes/              #   Analysis notes
│   │   ├── scripts_by_section/     # Scripts organized by paper section
│   │   ├── ops/                    # Operational utilities
│   │   ├── sota_comparison_20260303/
│   │   │   ├── scripts/            # SOTA benchmark scripts (included)
│   │   │   ├── configs/            # Benchmark configurations
│   │   │   └── logs/               # Execution logs
│   │   └── *.csv, *.json, *.md     # Result data files
│   │
│   └── iter_0014/                  # H38: Intercellular communication manifold
│       ├── *.py                    # H38 rescue/validation scripts
│       └── *.csv, *.json           # H38 result artifacts
│
├── loop/                           # Autonomous research loop configuration
│   ├── config_*.json               # Loop configuration
│   └── *.sh                        # Start/stop/status scripts
│
├── prompts/                        # AI agent prompts for the research loop
├── reports/                        # Loop execution logs and live results
└── planning/                       # Project charter

Quick Start for Reviewers

Paper: paper/scgpt_hematopoiesis_manifold_study_v2.tex (compile with pdflatex)
Code-to-figure mapping: iterations/iter_0011/reviewer_kit/PAPER_CODE_MAP.md
Reproducibility package: iterations/iter_0011/reviewer_kit/ (9 topical directories)
Main orchestration script: iterations/iter_0011/run_manifold_experiments.py
H38 secondary manifold: iterations/iter_0014/

Excluded from Repository (Reproducible from Scripts)

The following large directories are excluded via .gitignore and can be regenerated:

Directory	Size	Contents
`iter_0011/sota_comparison_20260303/outputs/`	12 GB	Raw SOTA method outputs
`iter_0011/sota_comparison_20260303/methods_src/`	633 MB	Downloaded SOTA implementations
`iter_0011/external_downloads/`	517 MB	External validation datasets
`iter_0011/artifacts/viewers/`	54 MB	Interactive HTML visualizations
`iter_0011/artifacts/tables/`	31 MB	Full numerical result tables
Non-primary iterations (1-10, 12-13)	33 MB	Exploratory iterations

Dependencies

Python 3.10+ with conda environment bio_mech_interp
Key packages: scgpt, scanpy, anndata, torch, scikit-learn, scipy
LaTeX: pdflatex with standard packages (amsmath, booktabs, graphicx, hyperref)
Data: Tabula Sapiens immune subset, processed through frozen scGPT whole-human checkpoint

Autonomous Research Loop

The study was conducted through a two-phase autonomous research loop:

Phase 1 (AI-driven broad search): An AI executor-reviewer pair explored dozens of manifold hypotheses under pre-registered quality gates (trustworthiness >= 0.80, holdout correlation >= 0.20, blocked-permutation p <= 0.001). The key output was the identification of H65 (hematopoietic developmental manifold) as the first branch to pass all gates.
Phase 2 (Author-directed investigation): Once H65 was identified, the research transitioned to manual investigation by the authors: methodological closure, external validation, extraction, benchmarking, compaction, and mechanistic interpretability.

Loop infrastructure is in loop/, prompts/, and reports/.

License

This repository accompanies a manuscript under review. Code is provided for reproducibility and review purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discovering and Extracting a Hematopoietic Algorithm from scGPT

Overview

Repository Structure

Quick Start for Reviewers

Excluded from Repository (Reproducible from Scripts)

Dependencies

Autonomous Research Loop

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
iterations		iterations
loop		loop
paper		paper
planning		planning
prompts		prompts
reports		reports
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Discovering and Extracting a Hematopoietic Algorithm from scGPT

Overview

Repository Structure

Quick Start for Reviewers

Excluded from Repository (Reproducible from Scripts)

Dependencies

Autonomous Research Loop

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages