A Reproducible Baseline for Multi-Omics Drug-Response Prediction
Goal. Provide a clean, end-to-end baseline pipeline and API scaffold for predicting drug response from multi-omics data, with explicit stress-tests for missing modalities.
- Most “SOTA” papers are hard to reproduce. This is a usable starting point others can extend.
- Focus on robustness: how badly do models degrade as modalities go missing, and which simple strategies help?
- Minimal FastAPI service (
/predict) and Dockerfile - Experiment matrix (
configs/experiments.yaml) with missingness regimes (MCAR, block-missing) - Baseline models (logistic, random forest, simple NN) — to be filled in the
src/pipeline/modules - Unit tests and CI (GitHub Actions)
- Makefile with reproducible environment setup (pip, conda, Docker)
make env # create/update conda env (auto-generates environment.yml)
make run # run baseline pipeline
# or via Docker
make docker-build-train && make docker-run-trainExample request:
curl -X POST http://localhost:8000/predict -H "Content-Type: application/json" -d '{"features":{"g1":1.0,"g2":-2.0,"g3":0.5},"masks":{"g1":1,"g2":1,"g3":0}}'pip install -r requirements.txt
python scripts/train.py --config configs/experiments.yamlmake env # create/update env (pip inside conda)
make env-update # update from environment.yml
make run # run baseline
make freeze # export pinned deps to requirements-lock.txt
make clean # remove cachesOptional Conda extras (CUDA, MKL):
make env-conda CONDA_DEPS="pytorch pytorch-cuda=12.1" CONDA_CHANNELS="pytorch nvidia conda-forge"Pure Conda convenience (no YAML/pip):
make conda-envmake docker-build-train
make docker-run-train
# after adding FastAPI app/main.py:
make docker-build-api
make docker-run-api-
Download TCGA data:
make data-tcga
Downloads into:
data/raw/tcga/tcga_RSEM_gene_tpm.gzdata/raw/supplemental/Survival_SupplementalTable_S1_20171025_xena_sp
-
Ingest and preprocess:
make data-real
Extracts
.tsv, skips duplicate copies, and organizes:data/raw/ ├── tcga/ │ ├── tcga_RSEM_gene_tpm.gz │ └── tcga_RSEM_gene_tpm.tsv └── supplemental/ └── Survival_SupplementalTable_S1_20171025_xena_sp -
Full reset (clean and reload end-to-end):
make data-reset
- Phase 1: Baselines under complete vs missing data (MCAR, block-missing). Metrics: AUROC/AUPRC.
- Phase 2: Simple robustness methods: mean/zero + mask, matrix completion (softImpute), cross-modal ridge (RNA→pseudo-proteome).
- Phase 3: Productionization: clean configs, reproducible runs, API/dashboard.
- Phase 4: Critical analysis vs frontier methods (GNNs, Transformers, CODE-AE, diffusion) — discussion only in the public version.
- Missing-modality robustness is often under-reported. Publish curves of performance vs % missing.
- Advanced architectures add value only if they respect masks / domain shift; otherwise they overfit.
- Drop in alternative models in
src/pipeline/ - Add new missingness regimes in
configs/experiments.yaml - PRs for better baselines welcome