This document explains the complete pipeline of the project in simple language.
What happens first, what happens next, and why — step by step.
YOUR DATA (2 Excel Files)
│
▼
STEP 1: Setup & Load Data
│
▼
STEP 2: Clean & Understand Data (EDA)
│
▼
STEP 3: Feature Engineering
│ (create missing columns, simulate sensors)
▼
STEP 4: Train Prediction Models
│ (XGBoost + RF + MLP + Stacking)
▼
STEP 5: Train Anomaly Detection
│ (Isolation Forest + LSTM Autoencoder)
▼
STEP 6: Add Explainability (SHAP)
│
▼
STEP 7: Carbon Footprint Module
│
▼
STEP 8: Build the API (FastAPI)
│
▼
STEP 9: Build the Dashboard (Next.js)
│
▼
FINAL SYSTEM ✅
- Create the project folder structure
- Install all Python libraries
- Load both Excel files into Python
data/raw/_h_batch_process_data.xlsx ← sensor log (211 rows, 1 batch, T001)
data/raw/_h_batch_production_data.xlsx ← batch records (60 rows, 60 batches)
src/config.py ← set all paths and constants FIRST
src/preprocessing.py ← load_data() and validate_data()
df_process → 211 rows × 11 columns (sensor readings for T001)
df_prod → 60 rows × 15 columns (batch settings + quality outcomes)
✅ No error when loading files
✅ Shape prints correctly: (211, 11) and (60, 15)
✅ "Validation passed: no missing values" message appears
- Plot graphs of every column
- Find which features are strongly related to quality targets
- Understand the 8 phases and how energy is distributed across them
- Detect any outliers
notebooks/01_EDA.ipynb ← run this notebook
Correlation between features and targets:
Compression_Force vs Hardness → 0.99 (very strong)
Moisture_Content vs Dissolution → -0.99 (very strong, negative)
Lubricant_Conc vs Friability → 0.99 (very strong)
This tells you: these features are the key drivers. Models will work well.
Energy per phase:
Compression phase → 38.69 kWh (50.4% of total energy) 🔴
Drying phase → 10.09 kWh (13.1%)
Milling phase → 9.00 kWh (11.7%)
All others → 18.96 kWh (24.8%)
This tells you: Compression is where you save the most energy.
Vibration per phase:
Milling phase → max 9.79 mm/s ⚠️ highest — bearing wear risk
Compression phase → max 6.69 mm/s — second highest
Preparation phase → max 0.20 mm/s — almost zero (expected)
✅ Correlation heatmap saved
✅ Phase energy breakdown chart saved
✅ Distribution plots for all 60 batches saved
✅ Clear understanding of which features matter most
This step has 3 parts:
File 2 (batch_production_data) has no energy column. You create it using physics:
Energy depends on:
- How fast the machine runs (Machine_Speed)
- How hot the dryer is (Drying_Temp)
- How hard you compress (Compression_Force)
Formula:
Energy_kWh = 76.74 × (Machine_Speed/169.17)^1.5
× (Drying_Temp/59.40)^0.8
× (Compression_Force/11.60)^1.2
+ small random noise
Carbon_kgCO2e = Energy_kWh × 0.716
(0.716 = India grid emission factor from CEA government data)
Only T001 has minute-by-minute sensor readings. You simulate the rest:
Take T001's sensor profile as the base template
For each batch T002–T060:
Scale the power UP or DOWN based on their settings:
Higher Machine_Speed → scale power UP
Higher Drying_Temp → scale drying phase power UP
Higher Compression → scale compression phase power UP
Add small random noise (±5%) to make it realistic
For 6 batches (10%): inject a fake fault:
Type A: spike vibration in Milling (simulates bearing wear)
Type B: spike power in Compression (simulates motor overload)
Type C: elevate power in Drying (simulates damp raw material)
Why inject faults? So the anomaly detection model has examples of bad patterns to learn from.
Instead of using 211 raw timesteps, you summarize each phase into statistics:
For each batch, for each phase:
power_mean, power_max, power_std
vibration_mean, vibration_max
This gives you 32 features per batch (8 phases × 4 stats)
Plus 6 extra derived features:
total_energy_kwh
compression_energy_share (%)
power_vibration_ratio
rolling_5min_power_slope
fft_dominant_vibration_frequency
phase_transition_sharpness
df_prod (60 rows × 15 cols)
+
phase_features (60 rows × 38 cols)
+
Energy_kWh, Carbon_kgCO2e (derived)
=
merged_dataset.csv (60 rows × ~22 cols) ← This is your final ML input
src/preprocessing.py ← derive_energy_and_carbon()
src/simulate_sensors.py ← simulate_all_batches()
src/feature_engineering.py ← extract_phase_features(), merge_datasets()
data/processed/merged_dataset.csv ← 60 rows × ~22 features, ready for ML
data/simulated/simulated_sensors.csv ← sensor data for all 60 batches
Split data and train 4 models:
merged_dataset.csv (60 rows)
│
├── 48 rows → TRAINING SET (80%)
└── 12 rows → TEST SET (20%)
Input features (X):
Granulation_Time, Binder_Amount, Drying_Temp, Drying_Time,
Compression_Force, Machine_Speed, Lubricant_Conc, Moisture_Content
+ phase energy features from sensor simulation
Output targets (Y) — all predicted at once:
Hardness, Friability, Dissolution_Rate, Content_Uniformity,
Disintegration_Time, Tablet_Weight, Energy_kWh
What it is: A powerful decision tree algorithm
Why use it: Best performance on small tabular datasets like ours (n=60)
Tuning: Optuna runs 50 trials to find best hyperparameters (takes ~5 min)
Internally: Trains one XGBRegressor per target (7 targets = 7 models under the hood)
What it is: 300 decision trees, each trained on a random subset
Why use it: More stable than XGBoost, good for diversity in ensemble
Tuning: Fixed params (n_estimators=300, max_depth=8)
What it is: Small neural network
Architecture:
Input layer (8 features)
↓
Dense(128) + BatchNorm + Dropout(0.3)
↓
Dense(64) + BatchNorm + Dropout(0.2)
↓
Dense(32)
↓
Output layer (7 targets simultaneously)
Why use it: Captures correlations between targets
e.g. knows that Hardness up usually means Friability down
What it is: A "model of models"
How it works:
1. Train XGBoost + RF using 5-fold cross validation
2. Collect their out-of-fold predictions
3. Train a Ridge regression on those predictions
4. Ridge learns: "when XGBoost says X and RF says Y, the real answer is Z"
Why use it: Combines the strengths of all 3 models
Always more accurate than any single model alone
For each target, compute:
R² (should be ≥ 0.90, we expect ≥ 0.97 for most)
MAE (mean absolute error)
RMSE (root mean squared error)
MAPE (mean absolute percentage error)
Print result like:
✅ Hardness | R²=0.97 | MAE=3.1 | RMSE=4.2 | MAPE=3.4%
✅ Friability | R²=0.96 | MAE=0.04 | RMSE=0.06 | MAPE=5.1%
✅ Dissolution_Rate | R²=0.97 | MAE=0.9 | RMSE=1.2 | MAPE=1.0%
❌ Tablet_Weight | R²=0.58 | MAE=1.9 | RMSE=2.4 | MAPE=0.9%
(this one is expected to be weak — mention in gap analysis)
src/multi_target_model.py ← train_xgboost(), train_rf(), train_mlp(), train_stacking(), evaluate()
models/xgb_multitarget.pkl ← saved XGBoost model
models/rf_multitarget.pkl ← saved Random Forest model
models/mlp_model.keras ← saved MLP model (Keras SavedModel format)
models/stacking_meta.pkl ← saved Ridge meta-learner
models/scaler.pkl ← saved data scaler (needed for predictions later)
This step watches the sensor time-series (power + vibration) and detects when something is wrong.
Input: Phase-level features per batch (32 features — power/vibration stats per phase)
How: Randomly splits data into trees
Anomalies = points that get isolated very quickly (few splits needed)
Normal points = need many splits to isolate
Output: Score between -1 and +1
Score < 0 = anomaly
Score > 0 = normal
Speed: ~5ms per batch
Best for: Quick real-time screening
Input: Raw time-series per batch (211 timesteps × 2 channels: Power, Vibration)
Training (only on NORMAL batches):
Show the model 54 normal batches
It learns to compress the signal → then reconstruct it
After training: it can reproduce normal patterns very accurately
Testing (on any new batch):
Feed the new batch's time-series
Model tries to reconstruct it
If reconstruction error is HIGH → pattern is unfamiliar → ANOMALY
Architecture:
211 timesteps → LSTM(64) → LSTM(32) → Dense(16) [compress]
Dense(16) → LSTM(32) → LSTM(64) → 211 timesteps [reconstruct]
Threshold:
threshold = mean(train errors) + 3 × std(train errors)
Any batch above threshold = flagged as anomaly
Speed: ~50ms per batch
Best for: Deeper investigation of flagged batches
After either model flags a batch as anomaly:
Check WHICH phase caused it:
IF Milling vibration_max > 8.5 mm/s
→ "⚠️ Bearing wear in milling unit — schedule inspection"
IF Compression power_max > 58.0 kW
→ "⚡ Motor overload in compression — check tooling"
IF Drying power_mean > 28.0 kW
→ "💧 Raw material moisture too high — check intake specs"
IF No threshold exceeded
→ "✅ Normal operation"
src/anomaly_detector.py ← train_isolation_forest(), build_lstm_autoencoder(),
compute_threshold(), get_root_cause()
models/isolation_forest.pkl ← saved Isolation Forest
models/lstm_autoencoder.keras ← saved LSTM Autoencoder (Keras format)
threshold value saved to: models/lstm_threshold.json
After training XGBoost, use SHAP to explain every prediction:
For the GLOBAL view (all 60 batches):
Compute SHAP values for all predictions
Plot beeswarm chart:
Each dot = one batch-feature combination
X position = how much that feature pushed the prediction
Color = was the feature value high or low?
Plot bar chart:
Mean |SHAP| per feature = simple ranking of importance
For a LOCAL view (one specific batch):
Plot waterfall chart:
Starts at average prediction (e.g. 90.9% dissolution)
Each bar shows how one feature moved it up or down
Ends at the actual prediction for that batch
Example output:
"Dissolution Rate = 87.2% because:
Base average: 90.9%
Compression_Force high: -2.8%
Moisture_Content high: -1.5%
Drying_Temp high: +1.2% ← partial save
Machine_Speed low: -0.6%
Final prediction: 87.2%"
src/shap_explainer.py ← compute_shap_values(), plot_beeswarm(), plot_waterfall()
Beeswarm plots saved for each target
Waterfall plots available on demand via API
SHAP values precomputed and saved: models/shap_values.pkl
For every batch:
Carbon_kgCO2e = predicted Energy_kWh × 0.716
(0.716 = India's official grid emission factor, CEA FY 2022-23)
Adaptive Target Algorithm:
Collect last 20 batches' carbon values
Find the best 10th percentile (best 10% of recent batches)
Stretch target = best_10p × 0.95 (5% better than your best)
Final target = max(stretch_target, regulatory_floor)
This means: as your operations improve, the target tightens automatically
It's always achievable (based on what you've already done) but always pushing forward
Output per batch:
carbon_kg_co2e: how much CO₂ this batch emitted
dynamic_target: what you should aim for
on_target: are you hitting it? ✅ or ❌
trend: 📉 Improving or 📈 Worsening
gap_to_target: how far off you are
src/carbon_calculator.py ← calculate_carbon(), adaptive_target()
The API loads all your saved models and makes them accessible via HTTP endpoints. Think of it as a service window where the dashboard can ask for predictions.
On startup:
Load xgb_multitarget.pkl
Load isolation_forest.pkl
Load lstm_autoencoder.keras
Load scaler.pkl
Load shap_values.pkl
Print "✅ All models loaded, API ready"
Then listen for requests on port 8000
8 Endpoints:
POST /api/predict
Receives: 8 process parameters (JSON)
Does: normalizes input → runs stacking model → computes carbon
Returns: 7 quality predictions + energy + carbon + quality score
POST /api/anomaly
Receives: batch_id (e.g. "T045")
Does: loads that batch's sensor features →
runs Isolation Forest → applies root cause rules
Returns: is_anomaly, severity, root causes, scores
GET /api/explain/{batch_id}
Receives: batch_id + target name (e.g. "Dissolution_Rate")
Does: looks up precomputed SHAP values for that batch + target
Returns: feature contributions dict + top driver sentence
GET /api/carbon/{batch_id}
Receives: batch_id
Does: looks up that batch's energy → calculates carbon →
runs adaptive target algorithm
Returns: carbon, target, on_target, trend, gap
GET /api/batches
Returns: list of all 60 batch IDs (T001–T060)
GET /api/carbon_history
Receives: grid (India/EU/US/Renewable)
Returns: CO₂e for every batch adjusted to selected grid
GET /api/model_metrics
Returns: full benchmark — R², MAE, RMSE, MAPE per model & target
+ anomaly detector Precision, Recall, F1, AUC-ROC
GET /api/health
Returns: load status of every model (xgb, rf, lstm, scaler, shap...)
api/main.py ← all route handlers
api/schemas.py ← Pydantic models for request/response validation
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reloadAPI running at: http://localhost:8000
Swagger docs at: http://localhost:8000/docs ← auto-generated, test here
All endpoints return JSON in < 100ms
Build 6 interactive tabs in a Next.js 16 + React 19 + TypeScript application that consume the FastAPI backend. Run with npm run dev — no Python required for the frontend.
User sees: 8 sliders for all process parameters
User does: adjusts sliders → clicks "Predict"
System does: sends parameters to POST /api/predict
sends batch to GET /api/explain
User gets: metric cards (Hardness, Friability, Dissolution Rate, Energy, Carbon)
Composite Quality Score gauge (0-100)
SHAP waterfall chart explaining WHY
User sees: dropdown to select any batch (T001–T060)
User does: selects a batch
System does: loads that batch's simulated sensor data
calls POST /api/anomaly
User gets: line chart of Power + Vibration colored by phase
stacked bar chart of energy per phase
anomaly score gauge (green/yellow/red)
root cause alert box if anomaly found
User sees: two dropdowns — pick Batch A and Batch B
User does: selects two batches
User gets: two radar charts side by side
(7 targets normalized 0-1, polygon shape)
golden fingerprint overlay (best-ever batch)
delta table showing which targets improved/worsened
User sees: trend chart of all 60 batches' CO₂e
dynamic target line on the chart
grid selector (India / EU / US / Renewable)
User does: changes grid → chart recalculates live
User gets: are we on track? ✅ or ❌
best 10% performance, stretch goal, regulatory floor
📉 Improving or 📈 Worsening indicator
User sees: all 8 sliders
User does: moves ANY slider
System does: immediately calls POST /api/predict (<100ms)
User gets: all predictions update in real time as sliders move
User sees: full model performance report
System does: calls GET /api/model_metrics
User gets: R², MAE, RMSE, MAPE per target for all 4 models
anomaly detector metrics (Precision, Recall, F1, AUC-ROC)
dataset metadata (batch count, feature count, CV folds)
dashboard/src/app/ClientLayout.tsx ← navigation
dashboard/src/components/tabs/PredictionsTab.tsx ← Tab 1
dashboard/src/components/tabs/EnergyTab.tsx ← Tab 2
dashboard/src/components/tabs/ComparisonTab.tsx ← Tab 3
dashboard/src/components/tabs/CarbonTab.tsx ← Tab 4
dashboard/src/components/tabs/WhatIfTab.tsx ← Tab 5
dashboard/src/components/tabs/BenchmarkTab.tsx ← Tab 6
cd dashboard
npm install
npm run devDashboard running at: http://localhost:3000
All 6 tabs working
Predictions in < 2 seconds
Real-time What-If updates in < 100ms
Excel Files
│
│ preprocessing.py reads them
▼
Raw DataFrames (df_process, df_prod)
│
│ simulate_sensors.py + feature_engineering.py transform them
▼
merged_dataset.csv + simulated_sensors.csv
│
│ multi_target_model.py trains on merged_dataset
│ anomaly_detector.py trains on simulated_sensors
▼
Saved Model Files (models/*.pkl, models/*.keras)
│
│ api/main.py loads ALL models at startup
▼
FastAPI (port 8000)
/predict /anomaly /explain /carbon /batches /carbon_history /model_metrics /health
│
│ Next.js dashboard calls these endpoints
▼
Next.js Dashboard (port 3000)
Tab 1 (Predictions) Tab 2 (Energy) Tab 3 (Comparison)
Tab 4 (Carbon) Tab 5 (What-If) Tab 6 (Benchmark)
│
▼
OPERATOR uses the system ✅
Dashboard URL: http://localhost:3000 (
cd dashboard && npm run dev)
API URL: http://localhost:8000 (uvicorn api.main:app --port 8000 --reload)
- Project folders created
- Virtual environment created and activated
- All libraries installed (
pip install -r requirements.txt) - Both Excel files in
data/raw/
-
src/config.pycreated and filled -
src/preprocessing.pycreated and filled -
python test_run.pyprints shapes correctly with no errors
-
notebooks/01_EDA.ipynbrun completely - Correlation matrix plotted and saved
- Phase energy breakdown chart saved
- No surprises in the data (no weird distributions)
-
src/simulate_sensors.pycreated and filled -
src/feature_engineering.pycreated and filled -
data/processed/merged_dataset.csvgenerated (60 rows × ~22 cols) -
data/simulated/simulated_sensors.csvgenerated (60 × 211 rows)
-
src/multi_target_model.pycreated and filled - XGBoost trained and evaluated
- Random Forest trained and evaluated
- MLP trained and evaluated
- Stacking ensemble trained and evaluated
- All primary targets show R² ≥ 0.90
- All 4 model files saved in
models/
-
src/anomaly_detector.pycreated and filled - Isolation Forest trained and saved
- LSTM Autoencoder trained and saved
- Threshold computed and saved
- Root cause rules tested on a sample anomaly batch
-
src/shap_explainer.pycreated and filled - SHAP values computed for all 60 batches
- Beeswarm plots generated for primary targets
- Waterfall plot works for a sample batch
-
src/carbon_calculator.pycreated and filled - Carbon calculated for all 60 batches
- Adaptive target algorithm tested
-
api/schemas.pycreated -
api/main.pycreated with all 8 endpoints - API starts without errors
- All endpoints tested via Swagger at
/docs -
/api/predictreturns correct JSON -
/api/batchesreturns list of batch IDs -
/api/model_metricsreturns benchmark data -
/api/healthreturns 200 OK
-
dashboard/Next.js project set up (npm install) - Tab 1 (Predictions) works end-to-end
- Tab 2 (Energy Monitor) shows charts + anomaly
- Tab 3 (Batch Comparison) shows radar charts
- Tab 4 (Carbon Tracker) shows trend + targets
- Tab 5 (What-If) updates in real time
- Tab 6 (Benchmark) shows model metrics from
/api/model_metrics - Dashboard loads at http://localhost:3000 in < 5 seconds
-
src/run_pipeline.pyruns everything in one command - Full end-to-end demo tested (slider → prediction → SHAP chart)
- Presentation slides ready
- Gap analysis slide written
- Live demo rehearsed
🔴 MUST BUILD FIRST (core of the submission):
Step 1 → Step 2 → Step 3 → Step 4
🟡 BUILD SECOND (makes it competitive):
Step 5 → Step 6 → Step 8 (API) → Dashboard Tab 1 only
🟢 BUILD IF TIME ALLOWS (makes it strong):
Step 7 → Dashboard Tabs 2–5 → CUSUM drift → Batch fingerprinting
Pipeline version: 2.0 | AI-Driven Manufacturing Intelligence Hackathon