A machine learning system for early sepsis risk prediction using ICU clinical data from the PhysioNet Challenge 2019. The model achieves ROC-AUC 0.8891 with full SHAP explainability — identifying which clinical features drove each individual patient's risk score.
- Overview
- Dataset
- Feature Engineering
- Model Performance
- SHAP Explainability
- Model Comparison
- Known Limitations
- How to Reproduce
- Repository Structure
- Technologies Used
- Disclaimer
Sepsis is a life-threatening condition affecting millions of ICU patients annually. Early detection is critical — every hour of delayed treatment increases mortality risk significantly.
This project builds a binary classifier to predict sepsis risk from aggregated ICU vital signs and laboratory values. The pipeline includes:
- Feature engineering from raw hourly time-series PSV files
- Leakage detection and removal
- XGBoost training with early stopping and class imbalance handling
- Threshold tuning using F2 score (prioritises recall for clinical screening)
- SHAP explainability including individual patient waterfall plots
- 3-model comparison against clinical baselines
Source: PhysioNet Computing in Cardiology Challenge 2019 — Early Prediction of Sepsis from Clinical Data
Reyna M A et al. Early prediction of sepsis from clinical data: the PhysioNet/Computing in Cardiology Challenge 2019. Critical Care Medicine, 2020. https://physionet.org/content/challenge-2019/
| Property | Value |
|---|---|
| Total patients | 40,336 |
| Sepsis patients | 2,932 (7.3%) |
| Non-sepsis patients | 37,404 (92.7%) |
| Class imbalance ratio | 12.8 : 1 |
| Missing values (after engineering) | 0 |
| Format | Hourly PSV files, one per patient |
The 7.3% sepsis prevalence reflects realistic ICU conditions — the dataset was not artificially balanced.
Each patient's hourly time-series was collapsed into a single feature vector capturing statistical summaries and clinical indicators.
Vital signs processed: HR, O2Sat, Temperature, SBP, DBP, MAP, Respiratory Rate
For each vital sign, the following were extracted:
| Feature type | Description |
|---|---|
_mean |
Average over ICU stay |
_std |
Variability |
_min / _max |
Extremes |
_last |
Most recent recorded value |
_trend |
Linear slope (improving or deteriorating) |
Derived clinical features:
| Feature | Clinical meaning |
|---|---|
ShockIndex_mean |
HR / SBP — elevated in circulatory shock |
Temp_fever_frac |
Fraction of time with fever (>38°C) |
Temp_low_frac |
Fraction of time with hypothermia (<36°C) |
MAP_low_frac |
Fraction of time MAP < 65 mmHg |
SBP_low_frac |
Fraction of time SBP < 90 mmHg |
O2_below_92_frac |
Fraction of time O2Sat < 92% |
Lab values: WBC, Creatinine, Glucose, Platelets, Fibrinogen
Columns removed before training:
| Column | Reason |
|---|---|
ICULOS_last, ICULOS_bucket |
Data leakage — ICU stay length is caused by sepsis, not a predictor |
Age_group |
Redundant — derived directly from Age |
Unit1, Unit2 |
Administrative bias — hospital-specific ICU routing, not physiology |
Temp_missing_frac, PulsePressure_mean, Platelets_last, Platelets_measured |
Near-zero correlation with sepsis label |
Final feature count: 50
| Metric | Value |
|---|---|
| Test ROC-AUC | 0.8891 |
| Validation ROC-AUC | 0.8931 |
| Threshold (F2-optimised) | 0.3355 |
| Recall — Sepsis | 72.7% |
| Precision — Sepsis | 35.6% |
| False alarm rate | 10.3% |
| Best iteration (early stopping) | 416 |
Threshold tuning: Default threshold of 0.5 is inappropriate for medical screening. F2 score (which weights recall 2× more than precision) was used to find the optimal threshold — reflecting the clinical priority of catching sepsis cases over minimising false alarms.
Class imbalance: Handled using scale_pos_weight = 12.76 (ratio of negative to positive cases in training set).
SHAP (SHapley Additive exPlanations) was used to explain individual patient predictions. For medical AI, explainability is essential — clinicians need to know why a patient was flagged, not just that they were.

Top predictors by mean |SHAP value|:
| Rank | Feature | Clinical meaning |
|---|---|---|
| 1 | Temp_max |
Peak temperature — fever is a core sepsis criterion |
| 2 | Temp_last |
Most recent temperature reading |
| 3 | Creatinine_missing_frac |
Informative missingness — ordered less for stable patients |
| 4 | Creatinine_last |
Kidney function — deteriorates in sepsis |
| 5 | WBC_last |
White blood cell count — immune response marker |
Case study — highest risk patient (predicted probability: 0.999):
The model correctly identified a sepsis patient driven by elevated temperature (Temp_last = 38.6°C, +1.31 SHAP), peak temperature (Temp_max = 38.6°C, +0.85), declining MAP trend (+0.36), and low O2_min (+0.43). Every feature pointed in the same direction — a textbook sepsis presentation.
Case study — missed patient (predicted probability: 0.204):
SHAP analysis revealed this patient presented with hypothermic sepsis — Temp_last = 35.9°C pushed risk down (-0.26), HR_max = 66 pushed risk down (-0.28), and declining respiratory trend pushed risk down (-0.33). The model missed this case because it was trained predominantly on typical fever-driven sepsis patterns. Cold sepsis in elderly or immunocompromised patients is a known clinical challenge.
All three models were trained on the identical split with F2-optimised thresholds.
| Model | ROC-AUC | Recall | Precision |
|---|---|---|---|
| Logistic Regression | 0.8053 | 70.5% | 20.1% |
| Random Forest | 0.8480 | 75.0% | 23.8% |
| XGBoost (selected) | 0.8891 | 72.7% | 35.6% |
Why XGBoost over Random Forest: Random Forest achieves 2.3% higher recall (75.0% vs 72.7%) but XGBoost precision is 49% better (35.6% vs 23.8%). In a clinical ICU setting, precision matters — unnecessary interventions and false alarms contribute to alarm fatigue, which is a documented patient safety risk. XGBoost offers the best overall balance.
-
Patient-level aggregation — collapsing hourly time-series into statistical summaries loses temporal deterioration patterns. A patient worsening over 6 hours looks identical to one improving, if their means are the same.
-
Informative missingness —
Creatinine_missing_fracis the third most important feature. This captures a real clinical signal (stable patients have fewer tests ordered) but may not generalise across hospitals with different lab ordering practices. -
Cold sepsis — hypothermic sepsis patients (low or normal temperature) are harder to detect. SHAP analysis on the false negative case confirms the model misses atypical presentations that don't show the classic fever pattern.
-
Single dataset validation — the model was trained and evaluated on PhysioNet 2019 only. External validation on MIMIC-III or eICU is required before any real-world use.
-
No temporal modelling — an LSTM or Transformer architecture operating on the raw hourly sequences would better capture deterioration trajectories.
pip install numpy pandas matplotlib seaborn scikit-learn xgboost shap joblib1. Get the dataset
Request access to the PhysioNet Challenge 2019 dataset: https://physionet.org/content/challenge-2019/
Download both training sets (Training_SetA and Training_SetB). Each patient is one .psv file with hourly clinical readings.
2. Upload dataset to Google Drive
Place all .psv files in a folder on your Google Drive, for example:
3. Open the notebook in Google Colab
4. Run all cells in order
The notebook handles:
- Feature engineering from raw PSV files
- Data cleaning and leakage removal
- Train/validation/test split (70/15/15, stratified)
- XGBoost training with early stopping
- Threshold tuning via F2 score
- SHAP explainability plots
- 3-model comparison
- Model saving to Google Drive
Expected outputs saved to Drive: sepsis_xgb_final.pkl — trained model + threshold + SHAP explainer sepsis_model_evaluation.png — ROC curve, PR curve, confusion matrix, feature importance shap_beeswarm.png — SHAP summary across all test patients shap_bar.png — global SHAP feature importance shap_waterfall.png — individual explanation (highest risk patient) shap_false_negative.png — individual explanation (missed sepsis patient) model_comparison.png — LR vs RF vs XGBoost bar chart
Early-Sepsis-Detection-Using-ML/ │ ├── Early_Sepsis_Detection_XGBoost.ipynb ← main notebook ├── requirements.txt ← dependencies ├── README.md ← this file ├── LICENSE ← MIT └── .gitignore
| Tool | Purpose |
|---|---|
| Python 3.10 | Core language |
| Pandas / NumPy | Data processing |
| Scikit-learn | Preprocessing, baselines, metrics |
| XGBoost | Primary classifier |
| SHAP | Model explainability |
| Matplotlib / Seaborn | Visualisation |
| Google Colab | Development environment |
| Google Drive | Dataset and model storage |
| Joblib | Model serialisation |
This project is for academic and research purposes only. It is not validated for clinical use and must not be used for medical decision-making. All predictions should be interpreted by qualified medical professionals.
Niteesh — Data Science Student
GitHub: @Niteesh014

