Incident Prediction from Time-Series Metrics

Predict whether a service incident will occur within the next H time steps given the previous W steps of multivariate server metrics, using a stacked-ensemble sliding-window classifier.

Quick Start

pip install -r requirements.txt
jupyter notebook incident_prediction.ipynb   # run all cells

Project Layout

incident_prediction.ipynb   Main notebook (end-to-end pipeline)
requirements.txt
src/
  data_generation.py        Synthetic 5-metric time-series generator
  features.py               122-dim sliding-window feature extraction
  model.py                  Stacked ensemble + Optuna HP tuning
  evaluation.py             Metrics, threshold calibration, plotting

Pipeline

flowchart LR
    A[Synthetic Data<br/>5 metrics, 20k steps] --> B[Sliding Window<br/>W=30, H=15]
    B --> C[Feature Extraction<br/>122 features]
    C --> D[Temporal Split<br/>60 / 20 / 20]
    D --> E[Optuna HP Search<br/>30 trials on val]
    E --> F[Train Ensemble<br/>XGB + RF + HGB]
    F --> G[Meta-Learner<br/>Logistic Regression]
    G --> H[Isotonic Calibration]
    H --> I[Evaluation<br/>PR-AUC, incident-level]

Model Architecture

flowchart TB
    subgraph base [Base Learners -- trained on 60% train split]
        XGB[XGBoost<br/><i>Optuna-tuned, early stopping</i>]
        RF[Random Forest<br/><i>400 trees, sqrt features</i>]
        HGB[Hist Gradient Boosting<br/><i>histogram splits, L2 reg</i>]
    end

    subgraph meta [Meta-Learner -- fitted on 20% val split]
        LR[Logistic Regression<br/><i>combines base predictions</i>]
    end

    subgraph calib [Calibration]
        ISO[Isotonic Regression<br/><i>maps score to true P</i>]
    end

    XGB -- P_xgb --> LR
    RF  -- P_rf  --> LR
    HGB -- P_hgb --> LR
    LR  -- raw P --> ISO
    ISO -- calibrated P --> OUT[Alert Decision<br/><i>threshold on PR curve</i>]

Problem Formulation

Given M = 5 server metrics sampled at 1-minute intervals, at each step t the model receives a window [t-W+1, t] and outputs:

$$P(\text{incident in } (t,; t+H])$$

The label is y_t = 1 iff any incident step falls within the horizon. A threshold on the Precision-Recall curve converts the probability into a binary alert.

Feature Summary (122 dimensions)

mindmap
  root((122 Features))
    Per-Metric x 5
      Basic Stats
        last, mean, std, min, max
      Trend
        OLS slope, z-score, delta
      Multi-Scale
        mean/std at W/2 and W/4
      Dynamics
        velocity, acceleration, diff std
      Distribution
        skewness, kurtosis, IQR, p10, p90
      EWM
        exp-weighted mean and std
    Cross-Metric
      Pairwise Correlations x 10
      Max and Mean abs z-score

Results

Metric	Score
PR-AUC	0.759
ROC-AUC	0.891
F1 (optimal threshold)	0.771
Incident detection rate	100% (21/21)
Mean early warning	8.4 steps before onset

Operating points (configurable precision-recall trade-off):

Mode	Threshold	Precision	Recall	Use case
F1-optimal	0.30	0.88	0.69	Balanced default
High-precision	0.41	0.90	0.66	Pager alerts
High-recall	0.01	0.15	0.90	Dashboard warnings

Limitations

Synthetic data -- real incidents are more heterogeneous; production deployment needs retraining on operational labels.
Stationarity -- feature statistics assume a stable distribution; concept drift requires periodic retraining.
Single-host scope -- fleet-wide and cross-service features would strengthen a production model.
Fixed threshold -- adaptive thresholds tied to the local noise floor would further reduce false alerts.

Production Adaptation

flowchart LR
    S[Streaming Metrics<br/>Flink / Spark] --> FE[Real-Time<br/>Feature Store]
    FE --> INF[Model Serving<br/>REST / gRPC]
    INF --> AR[Alert Router<br/>per-team thresholds]
    AR --> PD[PagerDuty /<br/>OpsGenie]
    PD --> FB[Feedback Loop<br/>ack / dismiss]
    FB --> RT[Scheduled<br/>Retraining]
    RT --> INF

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
README.md		README.md
incident_prediction.ipynb		incident_prediction.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Incident Prediction from Time-Series Metrics

Quick Start

Project Layout

Pipeline

Model Architecture

Problem Formulation

Feature Summary (122 dimensions)

Results

Limitations

Production Adaptation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Incident Prediction from Time-Series Metrics

Quick Start

Project Layout

Pipeline

Model Architecture

Problem Formulation

Feature Summary (122 dimensions)

Results

Limitations

Production Adaptation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages