Skip to content

xaiqo/IncidentPrediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Incident Prediction from Time-Series Metrics

Predict whether a service incident will occur within the next H time steps given the previous W steps of multivariate server metrics, using a stacked-ensemble sliding-window classifier.

Quick Start

pip install -r requirements.txt
jupyter notebook incident_prediction.ipynb   # run all cells

Project Layout

incident_prediction.ipynb   Main notebook (end-to-end pipeline)
requirements.txt
src/
  data_generation.py        Synthetic 5-metric time-series generator
  features.py               122-dim sliding-window feature extraction
  model.py                  Stacked ensemble + Optuna HP tuning
  evaluation.py             Metrics, threshold calibration, plotting

Pipeline

flowchart LR
    A[Synthetic Data<br/>5 metrics, 20k steps] --> B[Sliding Window<br/>W=30, H=15]
    B --> C[Feature Extraction<br/>122 features]
    C --> D[Temporal Split<br/>60 / 20 / 20]
    D --> E[Optuna HP Search<br/>30 trials on val]
    E --> F[Train Ensemble<br/>XGB + RF + HGB]
    F --> G[Meta-Learner<br/>Logistic Regression]
    G --> H[Isotonic Calibration]
    H --> I[Evaluation<br/>PR-AUC, incident-level]
Loading

Model Architecture

flowchart TB
    subgraph base [Base Learners -- trained on 60% train split]
        XGB[XGBoost<br/><i>Optuna-tuned, early stopping</i>]
        RF[Random Forest<br/><i>400 trees, sqrt features</i>]
        HGB[Hist Gradient Boosting<br/><i>histogram splits, L2 reg</i>]
    end

    subgraph meta [Meta-Learner -- fitted on 20% val split]
        LR[Logistic Regression<br/><i>combines base predictions</i>]
    end

    subgraph calib [Calibration]
        ISO[Isotonic Regression<br/><i>maps score to true P</i>]
    end

    XGB -- P_xgb --> LR
    RF  -- P_rf  --> LR
    HGB -- P_hgb --> LR
    LR  -- raw P --> ISO
    ISO -- calibrated P --> OUT[Alert Decision<br/><i>threshold on PR curve</i>]
Loading

Problem Formulation

Given M = 5 server metrics sampled at 1-minute intervals, at each step t the model receives a window [t-W+1, t] and outputs:

$$P(\text{incident in } (t,; t+H])$$

The label is y_t = 1 iff any incident step falls within the horizon. A threshold on the Precision-Recall curve converts the probability into a binary alert.

Feature Summary (122 dimensions)

mindmap
  root((122 Features))
    Per-Metric x 5
      Basic Stats
        last, mean, std, min, max
      Trend
        OLS slope, z-score, delta
      Multi-Scale
        mean/std at W/2 and W/4
      Dynamics
        velocity, acceleration, diff std
      Distribution
        skewness, kurtosis, IQR, p10, p90
      EWM
        exp-weighted mean and std
    Cross-Metric
      Pairwise Correlations x 10
      Max and Mean abs z-score
Loading

Results

Metric Score
PR-AUC 0.759
ROC-AUC 0.891
F1 (optimal threshold) 0.771
Incident detection rate 100% (21/21)
Mean early warning 8.4 steps before onset

Operating points (configurable precision-recall trade-off):

Mode Threshold Precision Recall Use case
F1-optimal 0.30 0.88 0.69 Balanced default
High-precision 0.41 0.90 0.66 Pager alerts
High-recall 0.01 0.15 0.90 Dashboard warnings

Limitations

  • Synthetic data -- real incidents are more heterogeneous; production deployment needs retraining on operational labels.
  • Stationarity -- feature statistics assume a stable distribution; concept drift requires periodic retraining.
  • Single-host scope -- fleet-wide and cross-service features would strengthen a production model.
  • Fixed threshold -- adaptive thresholds tied to the local noise floor would further reduce false alerts.

Production Adaptation

flowchart LR
    S[Streaming Metrics<br/>Flink / Spark] --> FE[Real-Time<br/>Feature Store]
    FE --> INF[Model Serving<br/>REST / gRPC]
    INF --> AR[Alert Router<br/>per-team thresholds]
    AR --> PD[PagerDuty /<br/>OpsGenie]
    PD --> FB[Feedback Loop<br/>ack / dismiss]
    FB --> RT[Scheduled<br/>Retraining]
    RT --> INF
Loading

About

Predict whether a service incident will occur within the next **H** time steps given the previous **W** steps of multivariate server metrics, using a stacked-ensemble sliding-window classifier.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors