Skip to content

LozanoLsa/ShiftSentinel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ShiftSentinel

Intelligent Andon System + Prescriptive Maintenance

What happens when you combine ML · Reinforcement Learning · Real-time Alerting into one industrial pipeline


The Idea

Most ML repositories show you a model trained on a dataset.
A regression. A classifier. A confusion matrix. End of story.

ShiftSentinel asks a different question:

What does it look like when you chain all of it together — unsupervised learning, supervised learning, reinforcement learning, survival analysis — and connect the output to a real operator's phone?

The answer is this: an intelligent Andon system with prescriptive maintenance.

What is Andon?

Andon is a core concept from Toyota's production system — a signal that alerts workers the moment something goes wrong on the line. Traditionally it's a light, a cord, a buzzer. A human detects the problem and triggers it.

ShiftSentinel automates the entire chain:

Traditional Andon:   Human notices problem → pulls cord → light turns on

ShiftSentinel:       Sensor data → 5 ML/RL models → Telegram alert
                     "CNC-03 · Risk 72% · INSPECT within 24h · $18,700 if ignored"

No one has to notice anything. No one has to be on the floor.
The system detects, classifies, prioritizes, and tells you exactly what to do — with the cost of doing nothing included.


Why This Project Exists

The gap between a data science exercise and an industrial AI system is not the algorithm.
It's the integration.

Anyone can train an XGBoost model on a CSV. The interesting problem is:

  • How do you combine 5 different ML paradigms so they don't contradict each other?
  • How do you turn a floating-point prediction into a decision a factory operator can act on?
  • How do you make the system push information to the human — not the other way around?
  • What does Reinforcement Learning actually look like when applied to a real maintenance problem?

This project is a working answer to those questions.
It's a proof of concept — but everything here is grounded in how these systems actually work in industry.


What the Pipeline Does

┌─────────────────────────────────────────────────────────────────┐
│                    CNC SENSOR STREAM                            │
│              CNC-01 · CNC-02 · CNC-03                           │
│    vibration · temp · deviation · force · current · energy      │
└────────────────────────┬────────────────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  LAYER 1 — Health Index                     │
        │  Variational Autoencoder (Unsupervised)     │
        │  Learns what "normal" looks like            │
        │  Reconstruction error → Health 0-100        │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  LAYER 2 — Anomaly Detection                │
        │  Ensemble: Isolation Forest + LSTM-AE       │
        │             + Z-Score (Unsupervised)        │
        │  2-of-3 vote → anomaly + which sensors      │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  LAYER 3 — Failure Probability              │
        │  XGBoost + Platt Scaling + SHAP (Supervised)│
        │  P(failure) per sensor — fully explainable  │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  LAYER 3.5 — Remaining Useful Life          │
        │  Weibull Survival Analysis                  │
        │  How many cycles until failure (+ 90% CI)  │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  LAYER 4 — Prescriptive Engine              │
        │  Deep Q-Network (Reinforcement Learning)    │
        │  Agent learns cost-optimal maintenance      │
        │  policy — no hand-coded rules               │
        │  Computes 30-day downtime cost per action   │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  COMPOSITE RISK SIGNAL (CR)                 │
        │  CR = 40% Failure + 40% Anomaly + 20% Wear  │
        │  Unifies all model outputs — one number     │
        │  No contradictions between layers           │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  TELEGRAM BOT — Intelligent Andon           │
        │  Real-time risk alerts to operator's phone  │
        │  Button-first UX — no commands to memorize  │
        │  Scheduled shift reports (06:00/14:00/22:00)│
        │  Action plan: INSPECT · REPLACE · MONITOR  │
        └─────────────────────────────────────────────┘

The Three Paradigms Combined

Most ML projects use one. This one uses three — connected in sequence.

Paradigm Models Used What it contributes
Unsupervised VAE + Isolation Forest + LSTM-AE + Z-Score Learns normality without labels. Detects deviation and which sensors are responsible.
Supervised XGBoost + Platt Scaling + SHAP Calibrated failure probability per sensor. Fully explainable — operator sees exactly why the alert fired.
Reinforcement Learning Deep Q-Network (DQN) Agent trained on simulated maintenance episodes. Learns when to IGNORE vs MONITOR vs INSPECT vs REPLACE by maximizing long-term reward (avoided downtime cost). No hand-coded thresholds.

Plus Weibull survival analysis — the industry standard for RUL estimation used in aerospace, automotive, and industrial equipment.


From Model Output to Phone Notification

The notebooks cover each model in depth.
What makes this project different is what happens after the models run.

Every sensor cycle goes through the full pipeline and produces one number: Composite Risk (CR).

CR = 40% × Failure Probability
   + 40% × Anomaly Score
   + 20% × Machine Wear

When CR crosses 65%, the operator's Telegram receives:

🔴 RISK ALERT — CNC-03
Cycle #1,247 | 06:23 AM
Risk: 72%

🔴 Vibration: 91% failure risk
🔴 Cut Force: 88% failure risk
🔴 Dim Deviation: 84% failure risk
🟡 Spindle Temp: 67% failure risk

🔧 Action: REPLACE — Before next shift
💸 Downtime cost if ignored (30d): $18,768

Not a dashboard they have to open.
Not a log they have to check.
A message, on their phone, telling them what to do and what it costs to ignore it.


ML Models — Technical Decisions

Layer Algorithm Why this over alternatives
Health Index Variational Autoencoder Captures non-linear degradation patterns; latent space encodes mechanical health state; reconstruction error is interpretable. PCA is linear and misses sensor interaction effects.
Anomaly Detection Ensemble: IF + LSTM-AE + Z-Score Isolation Forest handles global density anomalies; LSTM-AE captures temporal drift; Z-Score provides per-sensor attribution. 2-of-3 voting reduces false alarm rate to acceptable industrial levels.
Failure Probability XGBoost + Platt Scaling + SHAP Native imbalanced class handling; calibrated probabilities (not raw scores); SHAP attribution in every alert so operators understand why, not just what.
RUL Weibull Survival Analysis Industry standard (GE, Siemens, aerospace); shape parameter k>1 confirms wear-out failure mode for CNC spindles; confidence intervals included.
Prescriptive Deep Q-Network (DQN) RL agent learns maintenance policy from simulated cost episodes without hand-coded rules. Reward signal = avoided downtime. Computes 30-day cost exposure per recommended action.
Unified Signal Composite Risk (CR) Eliminates contradiction between model layers. Single operator-facing score drives alerts, action plan, and all bot displays consistently.

Bot Interface

Persistent keyboard — no commands to memorize, no dashboard to open.
Every button shows live risk icons (🔴🟡🟢) for each machine.

Button What it shows
🏭 Fleet Status Risk Score for all 3 machines + Failure / Anomaly / Wear breakdown
⚠️ Actions What to do right now — INSPECT / REPLACE / MONITOR + cost
📊 Risk Breakdown Component bars (40% Failure · 40% Anomaly · 20% Wear) + top sensors
📈 CR Trend ASCII chart of Risk Score over last 30 readings with timestamps
🔬 Parameters Raw sensor readings + cumulative wear %
📋 Report On-demand shift summary with pending actions and cost exposure
⚙️ CR Ranking All machines ranked highest risk first
Help Plain-English explanation of every number and every button

Automated Alerts

06:00 every day     🏭 Morning Report — Risk overview + pending actions + cost exposure
14:00 / 22:00       🔄 Shift Change — Anomalies in shift + top priority action
Real-time           🔴 Risk Alert — CR > 65% · max 1 per machine per 15 min

Dataset

Source: CNC machining center production floor simulation
Scope: 6,000 production cycles · 3 machines · 6 sensor channels · 8% anomaly rate

Machine Profile
CNC-01 Low degradation — stable operating conditions
CNC-02 Moderate degradation — mid-life wear indicators
CNC-03 Accelerated wear — high utilization, replacement pending

Sensors: vibration (mm/s) · spindle temperature (°C) · dimensional deviation (μm) · cutting force (N) · servo current (A) · energy (kWh)


Key Results

Model Metric Result
VAE Health Index Anomaly separation Anomalies: score≈0 · Normal: mean=57.9
Ensemble Anomaly Recall 1.00 — zero missed anomalies
Ensemble Anomaly Precision 0.48 — acceptable for industrial PdM
XGBoost Failure AUC 1.00 — strong multi-sensor class separation
Weibull RUL Failure mode confirmed k>1 — wear-out pattern for CNC spindles
DQN Policy Critical machine REPLACE ✓
DQN Policy Healthy machine IGNORE ✓

Business Value

Scenario: 3 CNC machines · $850/hr downtime cost · 8hr MTTR

Reactive maintenance ShiftSentinel
Failure detected after it happens Risk alert before failure — CR > 65%
$6,800 average unplanned failure cost $200–1,500 planned intervention
No visibility into remaining life Weibull RUL with 90% confidence interval
Operator checks dashboard manually Alert arrives on operator's phone
Maintenance scheduled by calendar Maintenance triggered by actual machine state

30-day cost exposure per critical machine: $8,400–$18,768


Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Configure Telegram (.env.example → .env)
# Add TELEGRAM_TOKEN and TELEGRAM_CHAT_ID

# 3. Train all models (~5 min)
python main.py --train

# 4. Run
python main.py          # real-time (60 cycles/hr)
python main.py --demo   # x10 speed demo

# 5. Smoke test (no Telegram needed)
python main.py --test

Get your Telegram credentials:
Token → @BotFather/newbot
Chat ID → send any message to @userinfobot


Notebooks

Each notebook is a standalone technical deep-dive into one model layer:

Notebook Content
01_EDA_raw_data Fleet dataset exploration · sensor correlations · degradation profiles per machine
02_VAE_health_index VAE architecture · latent space visualization · reconstruction error as health signal
03_anomaly_detection IF vs LSTM-AE vs Z-Score comparison · ensemble advantage · false alarm analysis
04_failure_probability XGBoost training · calibration curve · SHAP force plots per sensor
05_prescriptive_engine DQN training · policy analysis · cost sensitivity · RL reward design

Project Structure

ShiftSentinel/
├── data/historical/              # 6,000-cycle fleet dataset
├── src/
│   ├── simulator/cnc_generator.py     # CNC stream (3 machines · 6 sensors)
│   ├── models/
│   │   ├── health_index.py            # VAE health score
│   │   ├── anomaly.py                 # Ensemble anomaly detector
│   │   ├── failure_prob.py            # XGBoost + SHAP
│   │   ├── survival.py                # Weibull RUL
│   │   └── prescriptive.py            # DQN policy + cost
│   ├── bot/
│   │   ├── alerts.py                  # Telegram message formatters
│   │   ├── commands.py                # Button handlers + keyboard UX
│   │   └── scheduler.py               # ML pipeline + streaming + JobQueue
│   └── utils/
│       ├── config.py                  # YAML config + .env loader
│       └── logger.py                  # Rotating file logger
├── notebooks/                    # 5 technical deep-dives
├── models/saved/                 # Trained artifacts (git-ignored)
├── main.py                       # CLI entrypoint
├── config.yaml                   # All system parameters
└── .env.example                  # Secrets template

Stack

Python · PyTorch · XGBoost · SHAP · Lifelines · python-telegram-bot v22 · PTB JobQueue


ShiftSentinel v1.0 · ML + RL Industrial Portfolio
Unsupervised · Supervised · Reinforcement Learning · Survival Analysis · Real-time Alerting