GitHub - friedsam/multi-omics-drug-response-baseline

A Reproducible Baseline for Multi-Omics Drug-Response Prediction

Goal. Provide a clean, end-to-end baseline pipeline and API scaffold for predicting drug response from multi-omics data, with explicit stress-tests for missing modalities.

Why this repo

Most “SOTA” papers are hard to reproduce. This is a usable starting point others can extend.
Focus on robustness: how badly do models degrade as modalities go missing, and which simple strategies help?

What’s included

Minimal FastAPI service (/predict) and Dockerfile
Experiment matrix (configs/experiments.yaml) with missingness regimes (MCAR, block-missing)
Baseline models (logistic, random forest, simple NN) — to be filled in the src/pipeline/ modules
Unit tests and CI (GitHub Actions)
Makefile with reproducible environment setup (pip, conda, Docker)

Quickstart

make env         # create/update conda env (auto-generates environment.yml)
make run         # run baseline pipeline
# or via Docker
make docker-build-train && make docker-run-train

Example request:

curl -X POST http://localhost:8000/predict -H "Content-Type: application/json"   -d '{"features":{"g1":1.0,"g2":-2.0,"g3":0.5},"masks":{"g1":1,"g2":1,"g3":0}}'

Environment Setup

Option A: Pip / Colab (lightweight)

pip install -r requirements.txt
python scripts/train.py --config configs/experiments.yaml

Option B: Conda + Makefile (recommended)

make env          # create/update env (pip inside conda)
make env-update   # update from environment.yml
make run          # run baseline
make freeze       # export pinned deps to requirements-lock.txt
make clean        # remove caches

Optional Conda extras (CUDA, MKL):

make env-conda CONDA_DEPS="pytorch pytorch-cuda=12.1" CONDA_CHANNELS="pytorch nvidia conda-forge"

Pure Conda convenience (no YAML/pip):

make conda-env

Option C: Docker (full reproducibility)

make docker-build-train
make docker-run-train
# after adding FastAPI app/main.py:
make docker-build-api
make docker-run-api

Download & Ingest Data

Download TCGA data:
```
make data-tcga
```
Downloads into:
- data/raw/tcga/tcga_RSEM_gene_tpm.gz
- data/raw/supplemental/Survival_SupplementalTable_S1_20171025_xena_sp

Ingest and preprocess:

make data-real

Extracts .tsv, skips duplicate copies, and organizes:

data/raw/
  ├── tcga/
  │   ├── tcga_RSEM_gene_tpm.gz
  │   └── tcga_RSEM_gene_tpm.tsv
  └── supplemental/
      └── Survival_SupplementalTable_S1_20171025_xena_sp

Full reset (clean and reload end-to-end):
```
make data-reset
```

Roadmap (execution order)

Phase 1: Baselines under complete vs missing data (MCAR, block-missing). Metrics: AUROC/AUPRC.
Phase 2: Simple robustness methods: mean/zero + mask, matrix completion (softImpute), cross-modal ridge (RNA→pseudo-proteome).
Phase 3: Productionization: clean configs, reproducible runs, API/dashboard.
Phase 4: Critical analysis vs frontier methods (GNNs, Transformers, CODE-AE, diffusion) — discussion only in the public version.

Discussion pointers (for the blog post)

Missing-modality robustness is often under-reported. Publish curves of performance vs % missing.
Advanced architectures add value only if they respect masks / domain shift; otherwise they overfit.

Contributing / Extending

Drop in alternative models in src/pipeline/
Add new missingness regimes in configs/experiments.yaml
PRs for better baselines welcome

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
configs		configs
notebooks		notebooks
reports		reports
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
dashboard.md		dashboard.md
environment.yml		environment.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why this repo

What’s included

Quickstart

Environment Setup

Option A: Pip / Colab (lightweight)

Option B: Conda + Makefile (recommended)

Option C: Docker (full reproducibility)

Download & Ingest Data

Roadmap (execution order)

Discussion pointers (for the blog post)

Contributing / Extending

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Why this repo

What’s included

Quickstart

Environment Setup

Option A: Pip / Colab (lightweight)

Option B: Conda + Makefile (recommended)

Option C: Docker (full reproducibility)

Download & Ingest Data

Roadmap (execution order)

Discussion pointers (for the blog post)

Contributing / Extending

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages