Code, data, and reproducibility materials for the paper:
Discovering and Extracting a Hematopoietic Algorithm from scGPT: From Internal Manifold to Deployable Compact Operator
We discover a compact hematopoietic developmental manifold inside the frozen scGPT single-cell foundation model, validate it on external datasets in strict zero-shot mode, and extract it as a standalone algorithm. The extracted operator significantly outperforms established trajectory-inference methods on pseudotime-depth alignment (orientation-independent |rho| = 0.439 vs 0.331 for the next-best alternative) while being 34.5x faster and requiring ~1,000x fewer trainable parameters. Multi-stage compaction reduces the operator to a single attention head (5.9 MB) and further to a rank-64 surrogate (0.73 MB) whose four-factor mechanistic core resolves into explicit hematopoietic gene programs.
.
├── paper/ # Manuscript and figures
│ ├── scgpt_hematopoiesis_manifold_study_v2.tex # Main paper + supplements
│ ├── figures/ # Generated figures for the paper
│ └── results/ # Benchmark result tables
│
├── iterations/
│ ├── iter_0011/ # H65: Primary hematopoietic manifold study
│ │ ├── *.py # 50+ analysis scripts (h65_*.py)
│ │ ├── reviewer_kit/ # Organized reproducibility package
│ │ │ ├── 01_core_h65/ # Core manifold discovery
│ │ │ ├── 02_trajectory_pseudotime/
│ │ │ ├── 03_let_transfer/ # Latent embedding transfer
│ │ │ ├── 04_visual_lenses/ # Visualization methods
│ │ │ ├── 05_dimension_flatness_ripple/
│ │ │ ├── 06_phase2_deepening/
│ │ │ ├── 07_external_validation/
│ │ │ ├── 08_head_attribution_and_ablations/
│ │ │ ├── 09_process_and_provenance/
│ │ │ ├── MANIFEST.csv # Complete file inventory
│ │ │ └── PAPER_CODE_MAP.md # Figure-to-code mapping
│ │ ├── artifacts/ # Intermediate outputs
│ │ │ ├── figures/ # Generated plots (referenced by paper)
│ │ │ ├── summaries/ # JSON/MD summaries
│ │ │ └── notes/ # Analysis notes
│ │ ├── scripts_by_section/ # Scripts organized by paper section
│ │ ├── ops/ # Operational utilities
│ │ ├── sota_comparison_20260303/
│ │ │ ├── scripts/ # SOTA benchmark scripts (included)
│ │ │ ├── configs/ # Benchmark configurations
│ │ │ └── logs/ # Execution logs
│ │ └── *.csv, *.json, *.md # Result data files
│ │
│ └── iter_0014/ # H38: Intercellular communication manifold
│ ├── *.py # H38 rescue/validation scripts
│ └── *.csv, *.json # H38 result artifacts
│
├── loop/ # Autonomous research loop configuration
│ ├── config_*.json # Loop configuration
│ └── *.sh # Start/stop/status scripts
│
├── prompts/ # AI agent prompts for the research loop
├── reports/ # Loop execution logs and live results
└── planning/ # Project charter
- Paper:
paper/scgpt_hematopoiesis_manifold_study_v2.tex(compile withpdflatex) - Code-to-figure mapping:
iterations/iter_0011/reviewer_kit/PAPER_CODE_MAP.md - Reproducibility package:
iterations/iter_0011/reviewer_kit/(9 topical directories) - Main orchestration script:
iterations/iter_0011/run_manifold_experiments.py - H38 secondary manifold:
iterations/iter_0014/
The following large directories are excluded via .gitignore and can be regenerated:
| Directory | Size | Contents |
|---|---|---|
iter_0011/sota_comparison_20260303/outputs/ |
12 GB | Raw SOTA method outputs |
iter_0011/sota_comparison_20260303/methods_src/ |
633 MB | Downloaded SOTA implementations |
iter_0011/external_downloads/ |
517 MB | External validation datasets |
iter_0011/artifacts/viewers/ |
54 MB | Interactive HTML visualizations |
iter_0011/artifacts/tables/ |
31 MB | Full numerical result tables |
| Non-primary iterations (1-10, 12-13) | 33 MB | Exploratory iterations |
- Python 3.10+ with conda environment
bio_mech_interp - Key packages:
scgpt,scanpy,anndata,torch,scikit-learn,scipy - LaTeX:
pdflatexwith standard packages (amsmath,booktabs,graphicx,hyperref) - Data: Tabula Sapiens immune subset, processed through frozen scGPT whole-human checkpoint
The study was conducted through a two-phase autonomous research loop:
-
Phase 1 (AI-driven broad search): An AI executor-reviewer pair explored dozens of manifold hypotheses under pre-registered quality gates (trustworthiness >= 0.80, holdout correlation >= 0.20, blocked-permutation p <= 0.001). The key output was the identification of H65 (hematopoietic developmental manifold) as the first branch to pass all gates.
-
Phase 2 (Author-directed investigation): Once H65 was identified, the research transitioned to manual investigation by the authors: methodological closure, external validation, extraction, benchmarking, compaction, and mechanistic interpretability.
Loop infrastructure is in loop/, prompts/, and reports/.
This repository accompanies a manuscript under review. Code is provided for reproducibility and review purposes.