ShiftSentinel

Intelligent Andon System + Prescriptive Maintenance

What happens when you combine ML · Reinforcement Learning · Real-time Alerting into one industrial pipeline

The Idea

Most ML repositories show you a model trained on a dataset.
A regression. A classifier. A confusion matrix. End of story.

ShiftSentinel asks a different question:

What does it look like when you chain all of it together — unsupervised learning, supervised learning, reinforcement learning, survival analysis — and connect the output to a real operator's phone?

The answer is this: an intelligent Andon system with prescriptive maintenance.

What is Andon?

Andon is a core concept from Toyota's production system — a signal that alerts workers the moment something goes wrong on the line. Traditionally it's a light, a cord, a buzzer. A human detects the problem and triggers it.

ShiftSentinel automates the entire chain:

Traditional Andon:   Human notices problem → pulls cord → light turns on

ShiftSentinel:       Sensor data → 5 ML/RL models → Telegram alert
                     "CNC-03 · Risk 72% · INSPECT within 24h · $18,700 if ignored"

No one has to notice anything. No one has to be on the floor.
The system detects, classifies, prioritizes, and tells you exactly what to do — with the cost of doing nothing included.

Why This Project Exists

The gap between a data science exercise and an industrial AI system is not the algorithm.
It's the integration.

Anyone can train an XGBoost model on a CSV. The interesting problem is:

How do you combine 5 different ML paradigms so they don't contradict each other?
How do you turn a floating-point prediction into a decision a factory operator can act on?
How do you make the system push information to the human — not the other way around?
What does Reinforcement Learning actually look like when applied to a real maintenance problem?

This project is a working answer to those questions.
It's a proof of concept — but everything here is grounded in how these systems actually work in industry.

What the Pipeline Does

┌─────────────────────────────────────────────────────────────────┐
│                    CNC SENSOR STREAM                            │
│              CNC-01 · CNC-02 · CNC-03                           │
│    vibration · temp · deviation · force · current · energy      │
└────────────────────────┬────────────────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  LAYER 1 — Health Index                     │
        │  Variational Autoencoder (Unsupervised)     │
        │  Learns what "normal" looks like            │
        │  Reconstruction error → Health 0-100        │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  LAYER 2 — Anomaly Detection                │
        │  Ensemble: Isolation Forest + LSTM-AE       │
        │             + Z-Score (Unsupervised)        │
        │  2-of-3 vote → anomaly + which sensors      │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  LAYER 3 — Failure Probability              │
        │  XGBoost + Platt Scaling + SHAP (Supervised)│
        │  P(failure) per sensor — fully explainable  │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  LAYER 3.5 — Remaining Useful Life          │
        │  Weibull Survival Analysis                  │
        │  How many cycles until failure (+ 90% CI)  │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  LAYER 4 — Prescriptive Engine              │
        │  Deep Q-Network (Reinforcement Learning)    │
        │  Agent learns cost-optimal maintenance      │
        │  policy — no hand-coded rules               │
        │  Computes 30-day downtime cost per action   │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  COMPOSITE RISK SIGNAL (CR)                 │
        │  CR = 40% Failure + 40% Anomaly + 20% Wear  │
        │  Unifies all model outputs — one number     │
        │  No contradictions between layers           │
        └────────────────┬────────────────────────────┘
                         │
        ┌────────────────▼────────────────────────────┐
        │  TELEGRAM BOT — Intelligent Andon           │
        │  Real-time risk alerts to operator's phone  │
        │  Button-first UX — no commands to memorize  │
        │  Scheduled shift reports (06:00/14:00/22:00)│
        │  Action plan: INSPECT · REPLACE · MONITOR  │
        └─────────────────────────────────────────────┘

The Three Paradigms Combined

Most ML projects use one. This one uses three — connected in sequence.

Paradigm	Models Used	What it contributes
Unsupervised	VAE + Isolation Forest + LSTM-AE + Z-Score	Learns normality without labels. Detects deviation and which sensors are responsible.
Supervised	XGBoost + Platt Scaling + SHAP	Calibrated failure probability per sensor. Fully explainable — operator sees exactly why the alert fired.
Reinforcement Learning	Deep Q-Network (DQN)	Agent trained on simulated maintenance episodes. Learns when to IGNORE vs MONITOR vs INSPECT vs REPLACE by maximizing long-term reward (avoided downtime cost). No hand-coded thresholds.

Plus Weibull survival analysis — the industry standard for RUL estimation used in aerospace, automotive, and industrial equipment.

From Model Output to Phone Notification

The notebooks cover each model in depth.
What makes this project different is what happens after the models run.

Every sensor cycle goes through the full pipeline and produces one number: Composite Risk (CR).

CR = 40% × Failure Probability
   + 40% × Anomaly Score
   + 20% × Machine Wear

When CR crosses 65%, the operator's Telegram receives:

🔴 RISK ALERT — CNC-03
Cycle #1,247 | 06:23 AM
Risk: 72%

🔴 Vibration: 91% failure risk
🔴 Cut Force: 88% failure risk
🔴 Dim Deviation: 84% failure risk
🟡 Spindle Temp: 67% failure risk

🔧 Action: REPLACE — Before next shift
💸 Downtime cost if ignored (30d): $18,768

Not a dashboard they have to open.
Not a log they have to check.
A message, on their phone, telling them what to do and what it costs to ignore it.

ML Models — Technical Decisions

Layer	Algorithm	Why this over alternatives
Health Index	Variational Autoencoder	Captures non-linear degradation patterns; latent space encodes mechanical health state; reconstruction error is interpretable. PCA is linear and misses sensor interaction effects.
Anomaly Detection	Ensemble: IF + LSTM-AE + Z-Score	Isolation Forest handles global density anomalies; LSTM-AE captures temporal drift; Z-Score provides per-sensor attribution. 2-of-3 voting reduces false alarm rate to acceptable industrial levels.
Failure Probability	XGBoost + Platt Scaling + SHAP	Native imbalanced class handling; calibrated probabilities (not raw scores); SHAP attribution in every alert so operators understand why, not just what.
RUL	Weibull Survival Analysis	Industry standard (GE, Siemens, aerospace); shape parameter k>1 confirms wear-out failure mode for CNC spindles; confidence intervals included.
Prescriptive	Deep Q-Network (DQN)	RL agent learns maintenance policy from simulated cost episodes without hand-coded rules. Reward signal = avoided downtime. Computes 30-day cost exposure per recommended action.
Unified Signal	Composite Risk (CR)	Eliminates contradiction between model layers. Single operator-facing score drives alerts, action plan, and all bot displays consistently.

Bot Interface

Persistent keyboard — no commands to memorize, no dashboard to open.
Every button shows live risk icons (🔴🟡🟢) for each machine.

Button	What it shows
🏭 Fleet Status	Risk Score for all 3 machines + Failure / Anomaly / Wear breakdown
⚠️ Actions	What to do right now — INSPECT / REPLACE / MONITOR + cost
📊 Risk Breakdown	Component bars (40% Failure · 40% Anomaly · 20% Wear) + top sensors
📈 CR Trend	ASCII chart of Risk Score over last 30 readings with timestamps
🔬 Parameters	Raw sensor readings + cumulative wear %
📋 Report	On-demand shift summary with pending actions and cost exposure
⚙️ CR Ranking	All machines ranked highest risk first
❓ Help	Plain-English explanation of every number and every button

Automated Alerts

06:00 every day     🏭 Morning Report — Risk overview + pending actions + cost exposure
14:00 / 22:00       🔄 Shift Change — Anomalies in shift + top priority action
Real-time           🔴 Risk Alert — CR > 65% · max 1 per machine per 15 min

Dataset

Source: CNC machining center production floor simulation
Scope: 6,000 production cycles · 3 machines · 6 sensor channels · 8% anomaly rate

Machine	Profile
CNC-01	Low degradation — stable operating conditions
CNC-02	Moderate degradation — mid-life wear indicators
CNC-03	Accelerated wear — high utilization, replacement pending

Sensors: vibration (mm/s) · spindle temperature (°C) · dimensional deviation (μm) · cutting force (N) · servo current (A) · energy (kWh)

Key Results

Model	Metric	Result
VAE Health Index	Anomaly separation	Anomalies: score≈0 · Normal: mean=57.9
Ensemble Anomaly	Recall	1.00 — zero missed anomalies
Ensemble Anomaly	Precision	0.48 — acceptable for industrial PdM
XGBoost Failure	AUC	1.00 — strong multi-sensor class separation
Weibull RUL	Failure mode confirmed	k>1 — wear-out pattern for CNC spindles
DQN Policy	Critical machine	REPLACE ✓
DQN Policy	Healthy machine	IGNORE ✓

Business Value

Scenario: 3 CNC machines · $850/hr downtime cost · 8hr MTTR

Reactive maintenance	ShiftSentinel
Failure detected after it happens	Risk alert before failure — CR > 65%
$6,800 average unplanned failure cost	$200–1,500 planned intervention
No visibility into remaining life	Weibull RUL with 90% confidence interval
Operator checks dashboard manually	Alert arrives on operator's phone
Maintenance scheduled by calendar	Maintenance triggered by actual machine state

30-day cost exposure per critical machine: $8,400–$18,768

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Configure Telegram (.env.example → .env)
# Add TELEGRAM_TOKEN and TELEGRAM_CHAT_ID

# 3. Train all models (~5 min)
python main.py --train

# 4. Run
python main.py          # real-time (60 cycles/hr)
python main.py --demo   # x10 speed demo

# 5. Smoke test (no Telegram needed)
python main.py --test

Get your Telegram credentials:
Token → @BotFather → /newbot
Chat ID → send any message to @userinfobot

Notebooks

Each notebook is a standalone technical deep-dive into one model layer:

Notebook	Content
`01_EDA_raw_data`	Fleet dataset exploration · sensor correlations · degradation profiles per machine
`02_VAE_health_index`	VAE architecture · latent space visualization · reconstruction error as health signal
`03_anomaly_detection`	IF vs LSTM-AE vs Z-Score comparison · ensemble advantage · false alarm analysis
`04_failure_probability`	XGBoost training · calibration curve · SHAP force plots per sensor
`05_prescriptive_engine`	DQN training · policy analysis · cost sensitivity · RL reward design

Project Structure

ShiftSentinel/
├── data/historical/              # 6,000-cycle fleet dataset
├── src/
│   ├── simulator/cnc_generator.py     # CNC stream (3 machines · 6 sensors)
│   ├── models/
│   │   ├── health_index.py            # VAE health score
│   │   ├── anomaly.py                 # Ensemble anomaly detector
│   │   ├── failure_prob.py            # XGBoost + SHAP
│   │   ├── survival.py                # Weibull RUL
│   │   └── prescriptive.py            # DQN policy + cost
│   ├── bot/
│   │   ├── alerts.py                  # Telegram message formatters
│   │   ├── commands.py                # Button handlers + keyboard UX
│   │   └── scheduler.py               # ML pipeline + streaming + JobQueue
│   └── utils/
│       ├── config.py                  # YAML config + .env loader
│       └── logger.py                  # Rotating file logger
├── notebooks/                    # 5 technical deep-dives
├── models/saved/                 # Trained artifacts (git-ignored)
├── main.py                       # CLI entrypoint
├── config.yaml                   # All system parameters
└── .env.example                  # Secrets template

Stack

Python · PyTorch · XGBoost · SHAP · Lifelines · python-telegram-bot v22 · PTB JobQueue

ShiftSentinel v1.0 · ML + RL Industrial Portfolio
Unsupervised · Supervised · Reinforcement Learning · Survival Analysis · Real-time Alerting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShiftSentinel

Intelligent Andon System + Prescriptive Maintenance

The Idea

What is Andon?

Why This Project Exists

What the Pipeline Does

The Three Paradigms Combined

From Model Output to Phone Notification

ML Models — Technical Decisions

Bot Interface

Automated Alerts

Dataset

Key Results

Business Value

Quick Start

Notebooks

Project Structure

Stack

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data/historical		data/historical
notebooks		notebooks
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ShiftSentinel

Intelligent Andon System + Prescriptive Maintenance

The Idea

What is Andon?

Why This Project Exists

What the Pipeline Does

The Three Paradigms Combined

From Model Output to Phone Notification

ML Models — Technical Decisions

Bot Interface

Automated Alerts

Dataset

Key Results

Business Value

Quick Start

Notebooks

Project Structure

Stack

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages