Autophagy is the cell's internal recycling system. Under normal conditions, cells use it to clean up damaged structures and keep things running smoothly. However, bladder cancer cells hijack this very mechanism. In low-nutrient, high-stress tumor environments, cancer cells use autophagy as a shield to survive, multiply, and resist chemotherapy.
Diagnosing these changes early is difficult. Finding the subtle genetic signatures of this cellular hijack among thousands of active genes is like looking for a needle in a haystack.
This project provides an end-to-end clinical machine learning pipeline and interactive dashboard. It acts as a genomic detective, isolating the top Autophagy-Related Genes (ARGs) to predict bladder cancer risk and explain the biological decisions behind those predictions in real-time.
This workflow is engineered to solve clinical-grade data challenges with rigorous, industry-standard methodologies:
- Strict Zero-Leakage Pipeline: In genomics, small cohorts are easily prone to overfit bias. To prevent this, our pipeline completely isolates a stratified holdout test set before any transformations occur. Feature selection, normalization, and class-imbalance oversampling are calculated and applied strictly within training folds, ensuring that validation and test results reflect real patient scenarios.
- Clinical-First Decision Thresholds: In oncology, a False Negative (missing a patient's cancer) is dangerous. We adjust model decision thresholds to prioritize Sensitivity and Negative Predictive Value (NPV). This design ensures that patients predicted as "low risk" can be cleared with high confidence, reducing diagnostic error.
- Addressing Imbalance in Genomic Cohorts:
Faced with an imbalanced dataset (very few normal samples vs. many tumor samples), the pipeline utilizes synthetic class balancing (
SMOTE) paired with high regularization (such as L1 Lasso coefficients to prune redundant genes, and high weight-decay penalties in Neural Networks) to prevent majority-class bias. - Physician-Friendly Interpretability (SHAP): Instead of presenting clinicians with a "black box" prediction, the portal integrates SHAP explainability. This generates patient-specific waterfall plots showing how much each gene's expression level pushed the risk prediction toward normal or cancer.
flowchart TD
A[Tumor Samples Excel] & B[Normal Samples Excel] --> C(preprocess.py <br/>Merge & Transpose)
C --> D[KNN Imputer <br/>Impute Missing Gene Values]
D --> E[Shuffled Preprocessed Data <br/>preprocessed_data.xlsx]
E --> F[train.py <br/>Stratified Holdout Split]
F --> G[Training Pool]
F --> H[Isolated Holdout Test Set]
G --> I[5-Fold Stratified CV Loop]
subgraph KFold_Isolation [Fold-Level Leakage Prevention]
I --> J[SelectKBest Feature Selection]
J --> K[StandardScaler Fitting]
K --> L[SMOTE Class Balancing]
end
L --> M[Hyperparameter Tuning & Training <br/>SVM, Lasso, RF, XGBoost, MLP]
H & M --> O(visualize_metrics.py <br/>Compare Models on Holdout Set)
O --> P[Generate Performance Plots <br/>roc_curves_comparison.png, metrics_comparison_bar.png]
O --> N[Save Best Model & Scalers <br/>scaler.pkl, feature_names.pkl, best_model.pkl]
N --> Q(app.py <br/>Streamlit Diagnostic App)
Q --> R[User Uploads CSV]
R --> S[Clinical Prediction & Risk Bands]
S --> T[Patient-Level SHAP Waterfall Explanations]
Evaluating the trained models on the isolated holdout test set (86 samples: 4 Normal, 82 Tumor) using clinical, NPV-optimized decision thresholds yields the following performance metrics:
| Classifier Model | Decision Threshold | Recall (Sensitivity) | Specificity | NPV | F2-Score | ROC AUC | Confusion Matrix (TN / FP / FN / TP) |
|---|---|---|---|---|---|---|---|
| Regularized MLP (DL) | 0.10 | 1.0000 | 0.5000 | 1.0000 | 0.9951 | 0.9085 | TN=2, FP=2, FN=0, TP=82 |
| SVM | 0.15 | 1.0000 | 0.5000 | 1.0000 | 0.9951 | 0.8415 | TN=2, FP=2, FN=0, TP=82 |
| Logistic Regression | 0.05 | 1.0000 | 0.5000 | 1.0000 | 0.9951 | 0.8384 | TN=2, FP=2, FN=0, TP=82 |
| XGBoost | 0.10 | 0.9878 | 0.7500 | 0.7500 | 0.9878 | 0.9543 | TN=3, FP=1, FN=1, TP=81 |
| Random Forest | 0.10 | 0.9756 | 0.7500 | 0.6000 | 0.9780 | 0.8902 | TN=3, FP=1, FN=2, TP=80 |
- Regularized MLP (DL) & SVM: Offer exceptional sensitivity, leaving no cancer cases undetected (Recall of 1.0000, NPV of 1.0000) while maintaining smooth clinical decision margins.
- XGBoost: Delivers robust tree-based classification with the highest overall discriminative performance (ROC AUC of 0.9543) and balanced metrics.
- Logistic Regression: Prunes redundant gene features by forcing their coefficients to zero, highlighting the most vital diagnostic signals.
- Random Forest: Aggregates predictions across a forest of decision trees to minimize model variance.
The interactive web portal brings the pipeline to life for clinicians:
- Batch Patient Upload: Drag-and-drop a CSV file containing patient gene expression values.
- Interactive Patient Analysis: Select any patient from a dropdown to trigger a local SHAP explanation.
- Genomic Driver Summaries: View customized oncology literature descriptions explaining the biological roles of top contributing genes (e.g.
SYNPO2cytoskeletal regulation,FOSproto-oncogene activation,DCNtumor suppression). - Cohort Global Insights: Toggle interactive global plots to see which genes are the overall drivers of cancer risk across the entire cohort.
-
Verify Raw Data Files Ensure the following raw data files are in your working directory:
Tumor_samples_408_filtered_quantile_0.25_last_step (1).xlsxNormal_samples_19_filtered_quantile_0.25_last_step (2).xlsx
-
Install Required Packages Install the required dependencies from the project root:
pip install -r requirements.txt
Align datasets, transpose patients to rows, and impute missing genes.
python preprocess.pyGenerated output: preprocessed_data.xlsx
Execute nested cross-validation, feature selection, scaling, class balancing, and model training.
python train.pyGenerated outputs: feature_names.pkl, scaler.pkl, best_model.pkl, XGBoost_best_model.pkl, model_metadata.json, holdout_test_patients.csv, and holdout_test_labels.csv.
Evaluate models on the holdout test set under optimized decision thresholds and output performance charts.
python visualize_metrics.pyGenerated outputs: roc_curves_comparison.png and metrics_comparison_bar.png
streamlit run app.pyUpload the generated holdout_test_patients.csv (or any patient gene expression profile CSV) to test predictions and review patient SHAP waterfalls.