Predict whether a service incident will occur within the next H time steps given the previous W steps of multivariate server metrics, using a stacked-ensemble sliding-window classifier.
pip install -r requirements.txt
jupyter notebook incident_prediction.ipynb # run all cellsincident_prediction.ipynb Main notebook (end-to-end pipeline)
requirements.txt
src/
data_generation.py Synthetic 5-metric time-series generator
features.py 122-dim sliding-window feature extraction
model.py Stacked ensemble + Optuna HP tuning
evaluation.py Metrics, threshold calibration, plotting
flowchart LR
A[Synthetic Data<br/>5 metrics, 20k steps] --> B[Sliding Window<br/>W=30, H=15]
B --> C[Feature Extraction<br/>122 features]
C --> D[Temporal Split<br/>60 / 20 / 20]
D --> E[Optuna HP Search<br/>30 trials on val]
E --> F[Train Ensemble<br/>XGB + RF + HGB]
F --> G[Meta-Learner<br/>Logistic Regression]
G --> H[Isotonic Calibration]
H --> I[Evaluation<br/>PR-AUC, incident-level]
flowchart TB
subgraph base [Base Learners -- trained on 60% train split]
XGB[XGBoost<br/><i>Optuna-tuned, early stopping</i>]
RF[Random Forest<br/><i>400 trees, sqrt features</i>]
HGB[Hist Gradient Boosting<br/><i>histogram splits, L2 reg</i>]
end
subgraph meta [Meta-Learner -- fitted on 20% val split]
LR[Logistic Regression<br/><i>combines base predictions</i>]
end
subgraph calib [Calibration]
ISO[Isotonic Regression<br/><i>maps score to true P</i>]
end
XGB -- P_xgb --> LR
RF -- P_rf --> LR
HGB -- P_hgb --> LR
LR -- raw P --> ISO
ISO -- calibrated P --> OUT[Alert Decision<br/><i>threshold on PR curve</i>]
Given M = 5 server metrics sampled at 1-minute intervals, at each step t the model receives a window [t-W+1, t] and outputs:
The label is y_t = 1 iff any incident step falls within the horizon. A threshold on the Precision-Recall curve converts the probability into a binary alert.
mindmap
root((122 Features))
Per-Metric x 5
Basic Stats
last, mean, std, min, max
Trend
OLS slope, z-score, delta
Multi-Scale
mean/std at W/2 and W/4
Dynamics
velocity, acceleration, diff std
Distribution
skewness, kurtosis, IQR, p10, p90
EWM
exp-weighted mean and std
Cross-Metric
Pairwise Correlations x 10
Max and Mean abs z-score
| Metric | Score |
|---|---|
| PR-AUC | 0.759 |
| ROC-AUC | 0.891 |
| F1 (optimal threshold) | 0.771 |
| Incident detection rate | 100% (21/21) |
| Mean early warning | 8.4 steps before onset |
Operating points (configurable precision-recall trade-off):
| Mode | Threshold | Precision | Recall | Use case |
|---|---|---|---|---|
| F1-optimal | 0.30 | 0.88 | 0.69 | Balanced default |
| High-precision | 0.41 | 0.90 | 0.66 | Pager alerts |
| High-recall | 0.01 | 0.15 | 0.90 | Dashboard warnings |
- Synthetic data -- real incidents are more heterogeneous; production deployment needs retraining on operational labels.
- Stationarity -- feature statistics assume a stable distribution; concept drift requires periodic retraining.
- Single-host scope -- fleet-wide and cross-service features would strengthen a production model.
- Fixed threshold -- adaptive thresholds tied to the local noise floor would further reduce false alerts.
flowchart LR
S[Streaming Metrics<br/>Flink / Spark] --> FE[Real-Time<br/>Feature Store]
FE --> INF[Model Serving<br/>REST / gRPC]
INF --> AR[Alert Router<br/>per-team thresholds]
AR --> PD[PagerDuty /<br/>OpsGenie]
PD --> FB[Feedback Loop<br/>ack / dismiss]
FB --> RT[Scheduled<br/>Retraining]
RT --> INF