What happens when you combine ML · Reinforcement Learning · Real-time Alerting into one industrial pipeline
Most ML repositories show you a model trained on a dataset.
A regression. A classifier. A confusion matrix. End of story.
ShiftSentinel asks a different question:
What does it look like when you chain all of it together — unsupervised learning, supervised learning, reinforcement learning, survival analysis — and connect the output to a real operator's phone?
The answer is this: an intelligent Andon system with prescriptive maintenance.
Andon is a core concept from Toyota's production system — a signal that alerts workers the moment something goes wrong on the line. Traditionally it's a light, a cord, a buzzer. A human detects the problem and triggers it.
ShiftSentinel automates the entire chain:
Traditional Andon: Human notices problem → pulls cord → light turns on
ShiftSentinel: Sensor data → 5 ML/RL models → Telegram alert
"CNC-03 · Risk 72% · INSPECT within 24h · $18,700 if ignored"
No one has to notice anything. No one has to be on the floor.
The system detects, classifies, prioritizes, and tells you exactly what to do — with the cost of doing nothing included.
The gap between a data science exercise and an industrial AI system is not the algorithm.
It's the integration.
Anyone can train an XGBoost model on a CSV. The interesting problem is:
- How do you combine 5 different ML paradigms so they don't contradict each other?
- How do you turn a floating-point prediction into a decision a factory operator can act on?
- How do you make the system push information to the human — not the other way around?
- What does Reinforcement Learning actually look like when applied to a real maintenance problem?
This project is a working answer to those questions.
It's a proof of concept — but everything here is grounded in how these systems actually work in industry.
┌─────────────────────────────────────────────────────────────────┐
│ CNC SENSOR STREAM │
│ CNC-01 · CNC-02 · CNC-03 │
│ vibration · temp · deviation · force · current · energy │
└────────────────────────┬────────────────────────────────────────┘
│
┌────────────────▼────────────────────────────┐
│ LAYER 1 — Health Index │
│ Variational Autoencoder (Unsupervised) │
│ Learns what "normal" looks like │
│ Reconstruction error → Health 0-100 │
└────────────────┬────────────────────────────┘
│
┌────────────────▼────────────────────────────┐
│ LAYER 2 — Anomaly Detection │
│ Ensemble: Isolation Forest + LSTM-AE │
│ + Z-Score (Unsupervised) │
│ 2-of-3 vote → anomaly + which sensors │
└────────────────┬────────────────────────────┘
│
┌────────────────▼────────────────────────────┐
│ LAYER 3 — Failure Probability │
│ XGBoost + Platt Scaling + SHAP (Supervised)│
│ P(failure) per sensor — fully explainable │
└────────────────┬────────────────────────────┘
│
┌────────────────▼────────────────────────────┐
│ LAYER 3.5 — Remaining Useful Life │
│ Weibull Survival Analysis │
│ How many cycles until failure (+ 90% CI) │
└────────────────┬────────────────────────────┘
│
┌────────────────▼────────────────────────────┐
│ LAYER 4 — Prescriptive Engine │
│ Deep Q-Network (Reinforcement Learning) │
│ Agent learns cost-optimal maintenance │
│ policy — no hand-coded rules │
│ Computes 30-day downtime cost per action │
└────────────────┬────────────────────────────┘
│
┌────────────────▼────────────────────────────┐
│ COMPOSITE RISK SIGNAL (CR) │
│ CR = 40% Failure + 40% Anomaly + 20% Wear │
│ Unifies all model outputs — one number │
│ No contradictions between layers │
└────────────────┬────────────────────────────┘
│
┌────────────────▼────────────────────────────┐
│ TELEGRAM BOT — Intelligent Andon │
│ Real-time risk alerts to operator's phone │
│ Button-first UX — no commands to memorize │
│ Scheduled shift reports (06:00/14:00/22:00)│
│ Action plan: INSPECT · REPLACE · MONITOR │
└─────────────────────────────────────────────┘
Most ML projects use one. This one uses three — connected in sequence.
| Paradigm | Models Used | What it contributes |
|---|---|---|
| Unsupervised | VAE + Isolation Forest + LSTM-AE + Z-Score | Learns normality without labels. Detects deviation and which sensors are responsible. |
| Supervised | XGBoost + Platt Scaling + SHAP | Calibrated failure probability per sensor. Fully explainable — operator sees exactly why the alert fired. |
| Reinforcement Learning | Deep Q-Network (DQN) | Agent trained on simulated maintenance episodes. Learns when to IGNORE vs MONITOR vs INSPECT vs REPLACE by maximizing long-term reward (avoided downtime cost). No hand-coded thresholds. |
Plus Weibull survival analysis — the industry standard for RUL estimation used in aerospace, automotive, and industrial equipment.
The notebooks cover each model in depth.
What makes this project different is what happens after the models run.
Every sensor cycle goes through the full pipeline and produces one number: Composite Risk (CR).
CR = 40% × Failure Probability
+ 40% × Anomaly Score
+ 20% × Machine Wear
When CR crosses 65%, the operator's Telegram receives:
🔴 RISK ALERT — CNC-03
Cycle #1,247 | 06:23 AM
Risk: 72%
🔴 Vibration: 91% failure risk
🔴 Cut Force: 88% failure risk
🔴 Dim Deviation: 84% failure risk
🟡 Spindle Temp: 67% failure risk
🔧 Action: REPLACE — Before next shift
💸 Downtime cost if ignored (30d): $18,768
Not a dashboard they have to open.
Not a log they have to check.
A message, on their phone, telling them what to do and what it costs to ignore it.
| Layer | Algorithm | Why this over alternatives |
|---|---|---|
| Health Index | Variational Autoencoder | Captures non-linear degradation patterns; latent space encodes mechanical health state; reconstruction error is interpretable. PCA is linear and misses sensor interaction effects. |
| Anomaly Detection | Ensemble: IF + LSTM-AE + Z-Score | Isolation Forest handles global density anomalies; LSTM-AE captures temporal drift; Z-Score provides per-sensor attribution. 2-of-3 voting reduces false alarm rate to acceptable industrial levels. |
| Failure Probability | XGBoost + Platt Scaling + SHAP | Native imbalanced class handling; calibrated probabilities (not raw scores); SHAP attribution in every alert so operators understand why, not just what. |
| RUL | Weibull Survival Analysis | Industry standard (GE, Siemens, aerospace); shape parameter k>1 confirms wear-out failure mode for CNC spindles; confidence intervals included. |
| Prescriptive | Deep Q-Network (DQN) | RL agent learns maintenance policy from simulated cost episodes without hand-coded rules. Reward signal = avoided downtime. Computes 30-day cost exposure per recommended action. |
| Unified Signal | Composite Risk (CR) | Eliminates contradiction between model layers. Single operator-facing score drives alerts, action plan, and all bot displays consistently. |
Persistent keyboard — no commands to memorize, no dashboard to open.
Every button shows live risk icons (🔴🟡🟢) for each machine.
| Button | What it shows |
|---|---|
| 🏭 Fleet Status | Risk Score for all 3 machines + Failure / Anomaly / Wear breakdown |
| What to do right now — INSPECT / REPLACE / MONITOR + cost | |
| 📊 Risk Breakdown | Component bars (40% Failure · 40% Anomaly · 20% Wear) + top sensors |
| 📈 CR Trend | ASCII chart of Risk Score over last 30 readings with timestamps |
| 🔬 Parameters | Raw sensor readings + cumulative wear % |
| 📋 Report | On-demand shift summary with pending actions and cost exposure |
| ⚙️ CR Ranking | All machines ranked highest risk first |
| ❓ Help | Plain-English explanation of every number and every button |
06:00 every day 🏭 Morning Report — Risk overview + pending actions + cost exposure
14:00 / 22:00 🔄 Shift Change — Anomalies in shift + top priority action
Real-time 🔴 Risk Alert — CR > 65% · max 1 per machine per 15 min
Source: CNC machining center production floor simulation
Scope: 6,000 production cycles · 3 machines · 6 sensor channels · 8% anomaly rate
| Machine | Profile |
|---|---|
| CNC-01 | Low degradation — stable operating conditions |
| CNC-02 | Moderate degradation — mid-life wear indicators |
| CNC-03 | Accelerated wear — high utilization, replacement pending |
Sensors: vibration (mm/s) · spindle temperature (°C) · dimensional deviation (μm) · cutting force (N) · servo current (A) · energy (kWh)
| Model | Metric | Result |
|---|---|---|
| VAE Health Index | Anomaly separation | Anomalies: score≈0 · Normal: mean=57.9 |
| Ensemble Anomaly | Recall | 1.00 — zero missed anomalies |
| Ensemble Anomaly | Precision | 0.48 — acceptable for industrial PdM |
| XGBoost Failure | AUC | 1.00 — strong multi-sensor class separation |
| Weibull RUL | Failure mode confirmed | k>1 — wear-out pattern for CNC spindles |
| DQN Policy | Critical machine | REPLACE ✓ |
| DQN Policy | Healthy machine | IGNORE ✓ |
Scenario: 3 CNC machines · $850/hr downtime cost · 8hr MTTR
| Reactive maintenance | ShiftSentinel |
|---|---|
| Failure detected after it happens | Risk alert before failure — CR > 65% |
| $6,800 average unplanned failure cost | $200–1,500 planned intervention |
| No visibility into remaining life | Weibull RUL with 90% confidence interval |
| Operator checks dashboard manually | Alert arrives on operator's phone |
| Maintenance scheduled by calendar | Maintenance triggered by actual machine state |
30-day cost exposure per critical machine: $8,400–$18,768
# 1. Install dependencies
pip install -r requirements.txt
# 2. Configure Telegram (.env.example → .env)
# Add TELEGRAM_TOKEN and TELEGRAM_CHAT_ID
# 3. Train all models (~5 min)
python main.py --train
# 4. Run
python main.py # real-time (60 cycles/hr)
python main.py --demo # x10 speed demo
# 5. Smoke test (no Telegram needed)
python main.py --testGet your Telegram credentials:
Token → @BotFather → /newbot
Chat ID → send any message to @userinfobot
Each notebook is a standalone technical deep-dive into one model layer:
| Notebook | Content |
|---|---|
01_EDA_raw_data |
Fleet dataset exploration · sensor correlations · degradation profiles per machine |
02_VAE_health_index |
VAE architecture · latent space visualization · reconstruction error as health signal |
03_anomaly_detection |
IF vs LSTM-AE vs Z-Score comparison · ensemble advantage · false alarm analysis |
04_failure_probability |
XGBoost training · calibration curve · SHAP force plots per sensor |
05_prescriptive_engine |
DQN training · policy analysis · cost sensitivity · RL reward design |
ShiftSentinel/
├── data/historical/ # 6,000-cycle fleet dataset
├── src/
│ ├── simulator/cnc_generator.py # CNC stream (3 machines · 6 sensors)
│ ├── models/
│ │ ├── health_index.py # VAE health score
│ │ ├── anomaly.py # Ensemble anomaly detector
│ │ ├── failure_prob.py # XGBoost + SHAP
│ │ ├── survival.py # Weibull RUL
│ │ └── prescriptive.py # DQN policy + cost
│ ├── bot/
│ │ ├── alerts.py # Telegram message formatters
│ │ ├── commands.py # Button handlers + keyboard UX
│ │ └── scheduler.py # ML pipeline + streaming + JobQueue
│ └── utils/
│ ├── config.py # YAML config + .env loader
│ └── logger.py # Rotating file logger
├── notebooks/ # 5 technical deep-dives
├── models/saved/ # Trained artifacts (git-ignored)
├── main.py # CLI entrypoint
├── config.yaml # All system parameters
└── .env.example # Secrets template
Python · PyTorch · XGBoost · SHAP · Lifelines · python-telegram-bot v22 · PTB JobQueue
ShiftSentinel v1.0 · ML + RL Industrial Portfolio
Unsupervised · Supervised · Reinforcement Learning · Survival Analysis · Real-time Alerting