Skip to content

Latest commit

 

History

History
821 lines (649 loc) · 23.1 KB

File metadata and controls

821 lines (649 loc) · 23.1 KB

🔄 Project Pipeline — Simple Explanation

AI-Driven Manufacturing Intelligence — Track A

This document explains the complete pipeline of the project in simple language.
What happens first, what happens next, and why — step by step.


🗺️ The Big Picture

YOUR DATA (2 Excel Files)
        │
        ▼
STEP 1: Setup & Load Data
        │
        ▼
STEP 2: Clean & Understand Data (EDA)
        │
        ▼
STEP 3: Feature Engineering
        │  (create missing columns, simulate sensors)
        ▼
STEP 4: Train Prediction Models
        │  (XGBoost + RF + MLP + Stacking)
        ▼
STEP 5: Train Anomaly Detection
        │  (Isolation Forest + LSTM Autoencoder)
        ▼
STEP 6: Add Explainability (SHAP)
        │
        ▼
STEP 7: Carbon Footprint Module
        │
        ▼
STEP 8: Build the API (FastAPI)
        │
        ▼
STEP 9: Build the Dashboard (Next.js)
        │
        ▼
    FINAL SYSTEM ✅

STEP 1 — Setup & Load Data

What You Do

  • Create the project folder structure
  • Install all Python libraries
  • Load both Excel files into Python

Files Involved

data/raw/_h_batch_process_data.xlsx      ← sensor log (211 rows, 1 batch, T001)
data/raw/_h_batch_production_data.xlsx   ← batch records (60 rows, 60 batches)

Code That Runs

src/config.py          ← set all paths and constants FIRST
src/preprocessing.py   ← load_data() and validate_data()

What You Get After This Step

df_process  →  211 rows × 11 columns  (sensor readings for T001)
df_prod     →  60 rows  × 15 columns  (batch settings + quality outcomes)

How to Know It Worked

✅ No error when loading files
✅ Shape prints correctly: (211, 11) and (60, 15)
✅ "Validation passed: no missing values" message appears

STEP 2 — Clean & Understand Data (EDA)

What You Do

  • Plot graphs of every column
  • Find which features are strongly related to quality targets
  • Understand the 8 phases and how energy is distributed across them
  • Detect any outliers

Files Involved

notebooks/01_EDA.ipynb   ← run this notebook

What You Look For

Correlation between features and targets:

Compression_Force vs Hardness       →  0.99  (very strong)
Moisture_Content  vs Dissolution    →  -0.99 (very strong, negative)
Lubricant_Conc    vs Friability     →  0.99  (very strong)

This tells you: these features are the key drivers. Models will work well.

Energy per phase:

Compression phase  →  38.69 kWh  (50.4% of total energy) 🔴
Drying phase       →  10.09 kWh  (13.1%)
Milling phase      →   9.00 kWh  (11.7%)
All others         →  18.96 kWh  (24.8%)

This tells you: Compression is where you save the most energy.

Vibration per phase:

Milling phase      →  max 9.79 mm/s  ⚠️  highest — bearing wear risk
Compression phase  →  max 6.69 mm/s  — second highest
Preparation phase  →  max 0.20 mm/s  — almost zero (expected)

What You Get After This Step

✅ Correlation heatmap saved
✅ Phase energy breakdown chart saved
✅ Distribution plots for all 60 batches saved
✅ Clear understanding of which features matter most

STEP 3 — Feature Engineering

What You Do

This step has 3 parts:


Part A — Derive Missing Energy Column

File 2 (batch_production_data) has no energy column. You create it using physics:

Energy depends on:
  - How fast the machine runs  (Machine_Speed)
  - How hot the dryer is       (Drying_Temp)
  - How hard you compress      (Compression_Force)

Formula:
  Energy_kWh = 76.74 × (Machine_Speed/169.17)^1.5
                     × (Drying_Temp/59.40)^0.8
                     × (Compression_Force/11.60)^1.2
                     + small random noise

  Carbon_kgCO2e = Energy_kWh × 0.716
  (0.716 = India grid emission factor from CEA government data)

Part B — Simulate Sensor Data for T002–T060

Only T001 has minute-by-minute sensor readings. You simulate the rest:

Take T001's sensor profile as the base template
For each batch T002–T060:
  Scale the power UP or DOWN based on their settings:
    Higher Machine_Speed  →  scale power UP
    Higher Drying_Temp    →  scale drying phase power UP
    Higher Compression    →  scale compression phase power UP
  Add small random noise (±5%) to make it realistic
  For 6 batches (10%): inject a fake fault:
    Type A: spike vibration in Milling   (simulates bearing wear)
    Type B: spike power in Compression   (simulates motor overload)
    Type C: elevate power in Drying      (simulates damp raw material)

Why inject faults? So the anomaly detection model has examples of bad patterns to learn from.


Part C — Extract Phase-Level Features from Sensor Data

Instead of using 211 raw timesteps, you summarize each phase into statistics:

For each batch, for each phase:
  power_mean, power_max, power_std
  vibration_mean, vibration_max

This gives you 32 features per batch (8 phases × 4 stats)

Plus 6 extra derived features:
  total_energy_kwh
  compression_energy_share (%)
  power_vibration_ratio
  rolling_5min_power_slope
  fft_dominant_vibration_frequency
  phase_transition_sharpness

Merge Everything Together

df_prod (60 rows × 15 cols)
    +
phase_features (60 rows × 38 cols)
    +
Energy_kWh, Carbon_kgCO2e (derived)
    =
merged_dataset.csv  (60 rows × ~22 cols)  ← This is your final ML input

Files Involved

src/preprocessing.py       ← derive_energy_and_carbon()
src/simulate_sensors.py    ← simulate_all_batches()
src/feature_engineering.py ← extract_phase_features(), merge_datasets()

What You Get After This Step

data/processed/merged_dataset.csv     ← 60 rows × ~22 features, ready for ML
data/simulated/simulated_sensors.csv  ← sensor data for all 60 batches

STEP 4 — Train Prediction Models

What You Do

Split data and train 4 models:

merged_dataset.csv (60 rows)
        │
        ├── 48 rows → TRAINING SET (80%)
        └── 12 rows → TEST SET     (20%)

Input features (X):

Granulation_Time, Binder_Amount, Drying_Temp, Drying_Time,
Compression_Force, Machine_Speed, Lubricant_Conc, Moisture_Content
+ phase energy features from sensor simulation

Output targets (Y) — all predicted at once:

Hardness, Friability, Dissolution_Rate, Content_Uniformity,
Disintegration_Time, Tablet_Weight, Energy_kWh

Model 1 — XGBoost (Primary Model)

What it is:  A powerful decision tree algorithm
Why use it:  Best performance on small tabular datasets like ours (n=60)
Tuning:      Optuna runs 50 trials to find best hyperparameters (takes ~5 min)

Internally:  Trains one XGBRegressor per target (7 targets = 7 models under the hood)

Model 2 — Random Forest (Backup Model)

What it is:  300 decision trees, each trained on a random subset
Why use it:  More stable than XGBoost, good for diversity in ensemble
Tuning:      Fixed params (n_estimators=300, max_depth=8)

Model 3 — MLP Neural Network (Deep Learning Model)

What it is:  Small neural network
Architecture:
  Input layer  (8 features)
      ↓
  Dense(128) + BatchNorm + Dropout(0.3)
      ↓
  Dense(64)  + BatchNorm + Dropout(0.2)
      ↓
  Dense(32)
      ↓
  Output layer (7 targets simultaneously)

Why use it:  Captures correlations between targets
             e.g. knows that Hardness up usually means Friability down

Model 4 — Stacking Ensemble (Final Model)

What it is:  A "model of models"
How it works:
  1. Train XGBoost + RF using 5-fold cross validation
  2. Collect their out-of-fold predictions
  3. Train a Ridge regression on those predictions
  4. Ridge learns: "when XGBoost says X and RF says Y, the real answer is Z"

Why use it:  Combines the strengths of all 3 models
             Always more accurate than any single model alone

Evaluate All Models

For each target, compute:
  R²   (should be ≥ 0.90, we expect ≥ 0.97 for most)
  MAE  (mean absolute error)
  RMSE (root mean squared error)
  MAPE (mean absolute percentage error)

Print result like:
  ✅ Hardness            | R²=0.97 | MAE=3.1  | RMSE=4.2  | MAPE=3.4%
  ✅ Friability          | R²=0.96 | MAE=0.04 | RMSE=0.06 | MAPE=5.1%
  ✅ Dissolution_Rate    | R²=0.97 | MAE=0.9  | RMSE=1.2  | MAPE=1.0%
  ❌ Tablet_Weight       | R²=0.58 | MAE=1.9  | RMSE=2.4  | MAPE=0.9%
     (this one is expected to be weak — mention in gap analysis)

Files Involved

src/multi_target_model.py   ← train_xgboost(), train_rf(), train_mlp(), train_stacking(), evaluate()

What You Get After This Step

models/xgb_multitarget.pkl    ← saved XGBoost model
models/rf_multitarget.pkl     ← saved Random Forest model
models/mlp_model.keras        ← saved MLP model (Keras SavedModel format)
models/stacking_meta.pkl      ← saved Ridge meta-learner
models/scaler.pkl             ← saved data scaler (needed for predictions later)

STEP 5 — Train Anomaly Detection

What You Do

This step watches the sensor time-series (power + vibration) and detects when something is wrong.


Method 1 — Isolation Forest (Fast Check)

Input:   Phase-level features per batch (32 features — power/vibration stats per phase)
How:     Randomly splits data into trees
         Anomalies = points that get isolated very quickly (few splits needed)
         Normal points = need many splits to isolate
Output:  Score between -1 and +1
         Score < 0 = anomaly
         Score > 0 = normal

Speed:   ~5ms per batch
Best for: Quick real-time screening

Method 2 — LSTM Autoencoder (Deep Check)

Input:   Raw time-series per batch (211 timesteps × 2 channels: Power, Vibration)

Training (only on NORMAL batches):
  Show the model 54 normal batches
  It learns to compress the signal → then reconstruct it
  After training: it can reproduce normal patterns very accurately

Testing (on any new batch):
  Feed the new batch's time-series
  Model tries to reconstruct it
  If reconstruction error is HIGH → pattern is unfamiliar → ANOMALY

Architecture:
  211 timesteps → LSTM(64) → LSTM(32) → Dense(16)  [compress]
  Dense(16) → LSTM(32) → LSTM(64) → 211 timesteps  [reconstruct]

Threshold:
  threshold = mean(train errors) + 3 × std(train errors)
  Any batch above threshold = flagged as anomaly

Speed:   ~50ms per batch
Best for: Deeper investigation of flagged batches

Root Cause Rules

After either model flags a batch as anomaly:
  Check WHICH phase caused it:

  IF  Milling vibration_max  > 8.5 mm/s
  →   "⚠️ Bearing wear in milling unit — schedule inspection"

  IF  Compression power_max  > 58.0 kW
  →   "⚡ Motor overload in compression — check tooling"

  IF  Drying power_mean      > 28.0 kW
  →   "💧 Raw material moisture too high — check intake specs"

  IF  No threshold exceeded
  →   "✅ Normal operation"

Files Involved

src/anomaly_detector.py   ← train_isolation_forest(), build_lstm_autoencoder(),
                             compute_threshold(), get_root_cause()

What You Get After This Step

models/isolation_forest.pkl     ← saved Isolation Forest
models/lstm_autoencoder.keras   ← saved LSTM Autoencoder (Keras format)
threshold value saved to:       models/lstm_threshold.json

STEP 6 — Add Explainability (SHAP)

What You Do

After training XGBoost, use SHAP to explain every prediction:

For the GLOBAL view (all 60 batches):
  Compute SHAP values for all predictions
  Plot beeswarm chart:
    Each dot = one batch-feature combination
    X position = how much that feature pushed the prediction
    Color = was the feature value high or low?
  Plot bar chart:
    Mean |SHAP| per feature = simple ranking of importance

For a LOCAL view (one specific batch):
  Plot waterfall chart:
    Starts at average prediction (e.g. 90.9% dissolution)
    Each bar shows how one feature moved it up or down
    Ends at the actual prediction for that batch
    
Example output:
  "Dissolution Rate = 87.2% because:
    Base average:           90.9%
    Compression_Force high: -2.8%
    Moisture_Content high:  -1.5%
    Drying_Temp high:       +1.2%  ← partial save
    Machine_Speed low:      -0.6%
    Final prediction:       87.2%"

Files Involved

src/shap_explainer.py   ← compute_shap_values(), plot_beeswarm(), plot_waterfall()

What You Get After This Step

Beeswarm plots saved for each target
Waterfall plots available on demand via API
SHAP values precomputed and saved: models/shap_values.pkl

STEP 7 — Carbon Footprint Module

What You Do

For every batch:
  Carbon_kgCO2e = predicted Energy_kWh × 0.716
  (0.716 = India's official grid emission factor, CEA FY 2022-23)

Adaptive Target Algorithm:
  Collect last 20 batches' carbon values
  Find the best 10th percentile (best 10% of recent batches)
  Stretch target = best_10p × 0.95  (5% better than your best)
  Final target = max(stretch_target, regulatory_floor)

  This means: as your operations improve, the target tightens automatically
  It's always achievable (based on what you've already done) but always pushing forward

Output per batch:
  carbon_kg_co2e:    how much CO₂ this batch emitted
  dynamic_target:    what you should aim for
  on_target:         are you hitting it? ✅ or ❌
  trend:             📉 Improving or 📈 Worsening
  gap_to_target:     how far off you are

Files Involved

src/carbon_calculator.py   ← calculate_carbon(), adaptive_target()

STEP 8 — Build the API (FastAPI)

What You Do

The API loads all your saved models and makes them accessible via HTTP endpoints. Think of it as a service window where the dashboard can ask for predictions.

On startup:
  Load xgb_multitarget.pkl
  Load isolation_forest.pkl
  Load lstm_autoencoder.keras
  Load scaler.pkl
  Load shap_values.pkl
  Print "✅ All models loaded, API ready"

Then listen for requests on port 8000

8 Endpoints:

POST /api/predict
  Receives:  8 process parameters (JSON)
  Does:      normalizes input → runs stacking model → computes carbon
  Returns:   7 quality predictions + energy + carbon + quality score

POST /api/anomaly
  Receives:  batch_id (e.g. "T045")
  Does:      loads that batch's sensor features →
             runs Isolation Forest → applies root cause rules
  Returns:   is_anomaly, severity, root causes, scores

GET /api/explain/{batch_id}
  Receives:  batch_id + target name (e.g. "Dissolution_Rate")
  Does:      looks up precomputed SHAP values for that batch + target
  Returns:   feature contributions dict + top driver sentence

GET /api/carbon/{batch_id}
  Receives:  batch_id
  Does:      looks up that batch's energy → calculates carbon →
             runs adaptive target algorithm
  Returns:   carbon, target, on_target, trend, gap

GET /api/batches
  Returns:   list of all 60 batch IDs (T001–T060)

GET /api/carbon_history
  Receives:  grid (India/EU/US/Renewable)
  Returns:   CO₂e for every batch adjusted to selected grid

GET /api/model_metrics
  Returns:   full benchmark — R², MAE, RMSE, MAPE per model & target
             + anomaly detector Precision, Recall, F1, AUC-ROC

GET /api/health
  Returns:   load status of every model (xgb, rf, lstm, scaler, shap...)

Files Involved

api/main.py      ← all route handlers
api/schemas.py   ← Pydantic models for request/response validation

How to Start It

uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

What You Get After This Step

API running at:      http://localhost:8000
Swagger docs at:     http://localhost:8000/docs   ← auto-generated, test here
All endpoints return JSON in < 100ms

STEP 9 — Build the Dashboard (Next.js + React)

What You Do

Build 6 interactive tabs in a Next.js 16 + React 19 + TypeScript application that consume the FastAPI backend. Run with npm run dev — no Python required for the frontend.


Tab 1 — 🔮 Batch Predictor

User sees:   8 sliders for all process parameters
User does:   adjusts sliders → clicks "Predict"
System does: sends parameters to POST /api/predict
             sends batch to GET /api/explain
User gets:   metric cards (Hardness, Friability, Dissolution Rate, Energy, Carbon)
             Composite Quality Score gauge (0-100)
             SHAP waterfall chart explaining WHY

Tab 2 — ⚡ Energy Pattern Monitor

User sees:   dropdown to select any batch (T001–T060)
User does:   selects a batch
System does: loads that batch's simulated sensor data
             calls POST /api/anomaly
User gets:   line chart of Power + Vibration colored by phase
             stacked bar chart of energy per phase
             anomaly score gauge (green/yellow/red)
             root cause alert box if anomaly found

Tab 3 — 📊 Batch Comparison

User sees:   two dropdowns — pick Batch A and Batch B
User does:   selects two batches
User gets:   two radar charts side by side
             (7 targets normalized 0-1, polygon shape)
             golden fingerprint overlay (best-ever batch)
             delta table showing which targets improved/worsened

Tab 4 — 🌍 Carbon Footprint

User sees:   trend chart of all 60 batches' CO₂e
             dynamic target line on the chart
             grid selector (India / EU / US / Renewable)
User does:   changes grid → chart recalculates live
User gets:   are we on track? ✅ or ❌
             best 10% performance, stretch goal, regulatory floor
             📉 Improving or 📈 Worsening indicator

Tab 5 — 🎛️ What-If Optimizer

User sees:   all 8 sliders
User does:   moves ANY slider
System does: immediately calls POST /api/predict  (<100ms)
User gets:   all predictions update in real time as sliders move

Tab 6 — 📈 Benchmark (NEW)

User sees:   full model performance report
System does: calls GET /api/model_metrics
User gets:   R², MAE, RMSE, MAPE per target for all 4 models
             anomaly detector metrics (Precision, Recall, F1, AUC-ROC)
             dataset metadata (batch count, feature count, CV folds)

Files Involved

dashboard/src/app/ClientLayout.tsx                    ← navigation
dashboard/src/components/tabs/PredictionsTab.tsx      ← Tab 1
dashboard/src/components/tabs/EnergyTab.tsx           ← Tab 2
dashboard/src/components/tabs/ComparisonTab.tsx       ← Tab 3
dashboard/src/components/tabs/CarbonTab.tsx           ← Tab 4
dashboard/src/components/tabs/WhatIfTab.tsx           ← Tab 5
dashboard/src/components/tabs/BenchmarkTab.tsx        ← Tab 6

How to Start It

cd dashboard
npm install
npm run dev

What You Get After This Step

Dashboard running at:   http://localhost:3000
All 6 tabs working
Predictions in < 2 seconds
Real-time What-If updates in < 100ms

🔗 How Everything Connects

Excel Files
    │
    │  preprocessing.py reads them
    ▼
Raw DataFrames (df_process, df_prod)
    │
    │  simulate_sensors.py + feature_engineering.py transform them
    ▼
merged_dataset.csv  +  simulated_sensors.csv
    │
    │  multi_target_model.py trains on merged_dataset
    │  anomaly_detector.py trains on simulated_sensors
    ▼
Saved Model Files (models/*.pkl, models/*.keras)
    │
    │  api/main.py loads ALL models at startup
    ▼
FastAPI (port 8000)
  /predict  /anomaly  /explain  /carbon  /batches  /carbon_history  /model_metrics  /health
    │
    │  Next.js dashboard calls these endpoints
    ▼
Next.js Dashboard (port 3000)
  Tab 1 (Predictions)  Tab 2 (Energy)  Tab 3 (Comparison)
  Tab 4 (Carbon)       Tab 5 (What-If) Tab 6 (Benchmark)
    │
    ▼
OPERATOR uses the system ✅

Dashboard URL: http://localhost:3000 (cd dashboard && npm run dev)
API URL: http://localhost:8000 (uvicorn api.main:app --port 8000 --reload)


📋 Master Checklist (tick as you go)

Environment

  • Project folders created
  • Virtual environment created and activated
  • All libraries installed (pip install -r requirements.txt)
  • Both Excel files in data/raw/

Step 1 — Data Loading

  • src/config.py created and filled
  • src/preprocessing.py created and filled
  • python test_run.py prints shapes correctly with no errors

Step 2 — EDA

  • notebooks/01_EDA.ipynb run completely
  • Correlation matrix plotted and saved
  • Phase energy breakdown chart saved
  • No surprises in the data (no weird distributions)

Step 3 — Feature Engineering

  • src/simulate_sensors.py created and filled
  • src/feature_engineering.py created and filled
  • data/processed/merged_dataset.csv generated (60 rows × ~22 cols)
  • data/simulated/simulated_sensors.csv generated (60 × 211 rows)

Step 4 — Prediction Models

  • src/multi_target_model.py created and filled
  • XGBoost trained and evaluated
  • Random Forest trained and evaluated
  • MLP trained and evaluated
  • Stacking ensemble trained and evaluated
  • All primary targets show R² ≥ 0.90
  • All 4 model files saved in models/

Step 5 — Anomaly Detection

  • src/anomaly_detector.py created and filled
  • Isolation Forest trained and saved
  • LSTM Autoencoder trained and saved
  • Threshold computed and saved
  • Root cause rules tested on a sample anomaly batch

Step 6 — Explainability

  • src/shap_explainer.py created and filled
  • SHAP values computed for all 60 batches
  • Beeswarm plots generated for primary targets
  • Waterfall plot works for a sample batch

Step 7 — Carbon Module

  • src/carbon_calculator.py created and filled
  • Carbon calculated for all 60 batches
  • Adaptive target algorithm tested

Step 8 — API

  • api/schemas.py created
  • api/main.py created with all 8 endpoints
  • API starts without errors
  • All endpoints tested via Swagger at /docs
  • /api/predict returns correct JSON
  • /api/batches returns list of batch IDs
  • /api/model_metrics returns benchmark data
  • /api/health returns 200 OK

Step 9 — Dashboard (Next.js)

  • dashboard/ Next.js project set up (npm install)
  • Tab 1 (Predictions) works end-to-end
  • Tab 2 (Energy Monitor) shows charts + anomaly
  • Tab 3 (Batch Comparison) shows radar charts
  • Tab 4 (Carbon Tracker) shows trend + targets
  • Tab 5 (What-If) updates in real time
  • Tab 6 (Benchmark) shows model metrics from /api/model_metrics
  • Dashboard loads at http://localhost:3000 in < 5 seconds

Final

  • src/run_pipeline.py runs everything in one command
  • Full end-to-end demo tested (slider → prediction → SHAP chart)
  • Presentation slides ready
  • Gap analysis slide written
  • Live demo rehearsed

⚡ Order of Priority (if time is short)

🔴 MUST BUILD FIRST (core of the submission):
   Step 1 → Step 2 → Step 3 → Step 4

🟡 BUILD SECOND (makes it competitive):
   Step 5 → Step 6 → Step 8 (API) → Dashboard Tab 1 only

🟢 BUILD IF TIME ALLOWS (makes it strong):
   Step 7 → Dashboard Tabs 2–5 → CUSUM drift → Batch fingerprinting

Pipeline version: 2.0 | AI-Driven Manufacturing Intelligence Hackathon