ML and DL based prediction of bladder cancer

The Autophagy Paradox & Diagnostic Challenge

Autophagy is the cell's internal recycling system. Under normal conditions, cells use it to clean up damaged structures and keep things running smoothly. However, bladder cancer cells hijack this very mechanism. In low-nutrient, high-stress tumor environments, cancer cells use autophagy as a shield to survive, multiply, and resist chemotherapy.

Diagnosing these changes early is difficult. Finding the subtle genetic signatures of this cellular hijack among thousands of active genes is like looking for a needle in a haystack.

This project provides an end-to-end clinical machine learning pipeline and interactive dashboard. It acts as a genomic detective, isolating the top Autophagy-Related Genes (ARGs) to predict bladder cancer risk and explain the biological decisions behind those predictions in real-time.

Key Pipeline & Architectural Highlights

This workflow is engineered to solve clinical-grade data challenges with rigorous, industry-standard methodologies:

Strict Zero-Leakage Pipeline: In genomics, small cohorts are easily prone to overfit bias. To prevent this, our pipeline completely isolates a stratified holdout test set before any transformations occur. Feature selection, normalization, and class-imbalance oversampling are calculated and applied strictly within training folds, ensuring that validation and test results reflect real patient scenarios.
Clinical-First Decision Thresholds: In oncology, a False Negative (missing a patient's cancer) is dangerous. We adjust model decision thresholds to prioritize Sensitivity and Negative Predictive Value (NPV). This design ensures that patients predicted as "low risk" can be cleared with high confidence, reducing diagnostic error.
Addressing Imbalance in Genomic Cohorts: Faced with an imbalanced dataset (very few normal samples vs. many tumor samples), the pipeline utilizes synthetic class balancing (SMOTE) paired with high regularization (such as L1 Lasso coefficients to prune redundant genes, and high weight-decay penalties in Neural Networks) to prevent majority-class bias.
Physician-Friendly Interpretability (SHAP): Instead of presenting clinicians with a "black box" prediction, the portal integrates SHAP explainability. This generates patient-specific waterfall plots showing how much each gene's expression level pushed the risk prediction toward normal or cancer.

System Pipeline Architecture

flowchart TD
    A[Tumor Samples Excel] & B[Normal Samples Excel] --> C(preprocess.py <br/>Merge & Transpose)
    C --> D[KNN Imputer <br/>Impute Missing Gene Values]
    D --> E[Shuffled Preprocessed Data <br/>preprocessed_data.xlsx]
    
    E --> F[train.py <br/>Stratified Holdout Split]
    F --> G[Training Pool]
    F --> H[Isolated Holdout Test Set]
    
    G --> I[5-Fold Stratified CV Loop]
    subgraph KFold_Isolation [Fold-Level Leakage Prevention]
        I --> J[SelectKBest Feature Selection]
        J --> K[StandardScaler Fitting]
        K --> L[SMOTE Class Balancing]
    end
    
    L --> M[Hyperparameter Tuning & Training <br/>SVM, Lasso, RF, XGBoost, MLP]
    
    H & M --> O(visualize_metrics.py <br/>Compare Models on Holdout Set)
    O --> P[Generate Performance Plots <br/>roc_curves_comparison.png, metrics_comparison_bar.png]
    O --> N[Save Best Model & Scalers <br/>scaler.pkl, feature_names.pkl, best_model.pkl]
    
    N --> Q(app.py <br/>Streamlit Diagnostic App)
    Q --> R[User Uploads CSV]
    R --> S[Clinical Prediction & Risk Bands]
    S --> T[Patient-Level SHAP Waterfall Explanations]

Model Evaluation and Holdout Set Performance

Evaluating the trained models on the isolated holdout test set (86 samples: 4 Normal, 82 Tumor) using clinical, NPV-optimized decision thresholds yields the following performance metrics:

Classifier Model	Decision Threshold	Recall (Sensitivity)	Specificity	NPV	F2-Score	ROC AUC	Confusion Matrix (TN / FP / FN / TP)
Regularized MLP (DL)	0.10	1.0000	0.5000	1.0000	0.9951	0.9085	TN=2, FP=2, FN=0, TP=82
SVM	0.15	1.0000	0.5000	1.0000	0.9951	0.8415	TN=2, FP=2, FN=0, TP=82
Logistic Regression	0.05	1.0000	0.5000	1.0000	0.9951	0.8384	TN=2, FP=2, FN=0, TP=82
XGBoost	0.10	0.9878	0.7500	0.7500	0.9878	0.9543	TN=3, FP=1, FN=1, TP=81
Random Forest	0.10	0.9756	0.7500	0.6000	0.9780	0.8902	TN=3, FP=1, FN=2, TP=80

Qualitative Highlights

Regularized MLP (DL) & SVM: Offer exceptional sensitivity, leaving no cancer cases undetected (Recall of 1.0000, NPV of 1.0000) while maintaining smooth clinical decision margins.
XGBoost: Delivers robust tree-based classification with the highest overall discriminative performance (ROC AUC of 0.9543) and balanced metrics.
Logistic Regression: Prunes redundant gene features by forcing their coefficients to zero, highlighting the most vital diagnostic signals.
Random Forest: Aggregates predictions across a forest of decision trees to minimize model variance.

Streamlit Portal Features

The interactive web portal brings the pipeline to life for clinicians:

Batch Patient Upload: Drag-and-drop a CSV file containing patient gene expression values.
Interactive Patient Analysis: Select any patient from a dropdown to trigger a local SHAP explanation.
Genomic Driver Summaries: View customized oncology literature descriptions explaining the biological roles of top contributing genes (e.g. SYNPO2 cytoskeletal regulation, FOS proto-oncogene activation, DCN tumor suppression).
Cohort Global Insights: Toggle interactive global plots to see which genes are the overall drivers of cancer risk across the entire cohort.

Installation & Setup

Verify Raw Data Files Ensure the following raw data files are in your working directory:
- Tumor_samples_408_filtered_quantile_0.25_last_step (1).xlsx
- Normal_samples_19_filtered_quantile_0.25_last_step (2).xlsx
Install Required Packages Install the required dependencies from the project root:
```
pip install -r requirements.txt
```

Step-by-Step Running Guide

Step 1: Preprocess the Genomic Matrices

Align datasets, transpose patients to rows, and impute missing genes.

python preprocess.py

Generated output: preprocessed_data.xlsx

Step 2: Run the Training & CV Pipeline

Execute nested cross-validation, feature selection, scaling, class balancing, and model training.

python train.py

Generated outputs: feature_names.pkl, scaler.pkl, best_model.pkl, XGBoost_best_model.pkl, model_metadata.json, holdout_test_patients.csv, and holdout_test_labels.csv.

Step 3: Compare Models & Save Comparison Charts

Evaluate models on the holdout test set under optimized decision thresholds and output performance charts.

python visualize_metrics.py

Generated outputs: roc_curves_comparison.png and metrics_comparison_bar.png

Step 4: Launch the Streamlit Interactive Portal

streamlit run app.py

Upload the generated holdout_test_patients.csv (or any patient gene expression profile CSV) to test predictions and review patient SHAP waterfalls.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitattributes		.gitattributes
README.md		README.md
cancer prediction dashboard.png		cancer prediction dashboard.png
cancer_dashboard.py		cancer_dashboard.py
model training and evaluation.py		model training and evaluation.py
preprocessed_data.xlsx		preprocessed_data.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML and DL based prediction of bladder cancer

The Autophagy Paradox & Diagnostic Challenge

Key Pipeline & Architectural Highlights

System Pipeline Architecture

Model Evaluation and Holdout Set Performance

Qualitative Highlights

Streamlit Portal Features

Installation & Setup

Step-by-Step Running Guide

Step 1: Preprocess the Genomic Matrices

Step 2: Run the Training & CV Pipeline

Step 3: Compare Models & Save Comparison Charts

Step 4: Launch the Streamlit Interactive Portal

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ML and DL based prediction of bladder cancer

The Autophagy Paradox & Diagnostic Challenge

Key Pipeline & Architectural Highlights

System Pipeline Architecture

Model Evaluation and Holdout Set Performance

Qualitative Highlights

Streamlit Portal Features

Installation & Setup

Step-by-Step Running Guide

Step 1: Preprocess the Genomic Matrices

Step 2: Run the Training & CV Pipeline

Step 3: Compare Models & Save Comparison Charts

Step 4: Launch the Streamlit Interactive Portal

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages