Track: A — Predictive Modelling Specialization
Domain: Pharmaceutical Tablet Manufacturing
Data:_h_batch_process_data.xlsx+_h_batch_production_data.xlsx
- High-Level Architecture
- Layer-by-Layer Breakdown
- Module Designs
- Full Pipeline Flow
- Project Folder Structure
- API Design
- Dashboard Design
- Data Flow Between Files
┌──────────────────────────────────────────────────────────────────────────────────┐
│ DATA INGESTION LAYER │
│ │
│ ┌─────────────────────────────┐ ┌──────────────────────────────────────┐ │
│ │ _h_batch_process_data.xlsx │ │ _h_batch_production_data.xlsx │ │
│ │ (Time-Series Sensors) │ │ (Batch Outcome Records) │ │
│ │ │ │ │ │
│ │ • 211 rows, 1 batch │ │ • 60 batches (T001–T060) │ │
│ │ • 8 manufacturing phases │ │ • 8 input process features │ │
│ │ • Power + Vibration sigs │ │ • 6 quality/yield/perf targets │ │
│ └──────────────┬──────────────┘ └──────────────────┬───────────────────┘ │
└─────────────────┼────────────────────────────────────────┼─────────────────────┘
│ │
▼ ▼
┌──────────────────────────────────────────────────────────────────────────────────┐
│ PREPROCESSING & FEATURE ENGINEERING │
│ │
│ ┌────────────────────────────────┐ ┌──────────────────────────────────────┐ │
│ │ Time-Series Processing │ │ Batch-Level Engineering │ │
│ │ │ │ │ │
│ │ • Phase segmentation │ │ • MinMax normalization │ │
│ │ • Phase-wise aggregation │ │ • IQR outlier detection │ │
│ │ • Rolling statistics (5-min) │ │ • Physics-based energy simulation │ │
│ │ • FFT vibration features │ │ • Carbon footprint derivation │ │
│ │ • Anomaly injection (sim) │ │ • Feature correlation analysis │ │
│ └───────────────┬────────────────┘ └──────────────────────┬───────────────┘ │
│ └──────────────────────────┬─────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ MERGED FEATURE MATRIX │ │
│ │ 60 batches × ~22 features │ │
│ └──────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────────┐
│ ML MODEL LAYER │
│ │
│ ┌──────────────────────────────────┐ ┌────────────────────────────────────┐ │
│ │ MODULE 1 │ │ MODULE 2 │ │
│ │ Multi-Target Regressor │ │ Energy Pattern Analyser │ │
│ │ │ │ │ │
│ │ Input: 8 process params │ │ Input: Power_kW, Vibration_mm_s │ │
│ │ Output: 7 simultaneous targets │ │ Time_Minutes, Phase │ │
│ │ │ │ │ │
│ │ Stack: │ │ Stack: │ │
│ │ • XGBoost MultiOutputRegressor │ │ • Isolation Forest (batch-level) │ │
│ │ • Random Forest │ │ • LSTM Autoencoder (time-series) │ │
│ │ • MLP Neural Network │ │ • Phase-wise z-score rules │ │
│ │ • Ridge Stacking Meta-Learner │ │ │ │
│ │ │ │ Output: │ │
│ │ Target: R² ≥ 0.90 all targets │ │ • Anomaly score (0–1) │ │
│ └────────────────┬─────────────────┘ │ • Root cause attribution │ │
│ │ │ • Phase health flags │ │
│ │ └──────────────────┬─────────────────┘ │
│ └──────────────────────────┬─────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ MODULE 3: SHAP Explainability Engine │ │
│ │ • Per-target feature importance │ │
│ │ • Per-batch waterfall explanations │ │
│ │ • Beeswarm + bar summary plots │ │
│ └──────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────┐ │
│ │ MODULE 4: Carbon Footprint Tracker │ │
│ │ • CO₂e per batch (Energy × 0.716) │ │
│ │ • Adaptive target setting │ │
│ │ • Regulatory compliance tracking │ │
│ └──────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────────┐
│ SERVING & VISUALIZATION LAYER │
│ │
│ ┌──────────────────────────────────┐ ┌────────────────────────────────────┐ │
│ │ FastAPI REST Backend │ │ Next.js Web Dashboard │ │
│ │ │ │ (React 19 + TypeScript) │ │
│ │ POST /api/predict │ │ Tab 1: Predictions │ │
│ │ POST /api/anomaly │ │ Tab 2: Energy Monitor │ │
│ │ GET /api/explain/{batch_id} │ │ Tab 3: Batch Comparison │ │
│ │ GET /api/carbon/{batch_id} │ │ Tab 4: Carbon Footprint │ │
│ │ GET /api/batches │ │ Tab 5: What-If Optimizer │ │
│ │ GET /api/carbon_history │ │ Tab 6: Benchmark Report │ │
│ │ GET /api/model_metrics │ │ │ │
│ │ GET /api/health │ │ Recharts visualizations │ │
│ │ < 100ms inference time │ │ Real-time slider predictions │ │
│ │ Swagger auto-docs at /docs │ │ http://localhost:3000 │ │
│ └──────────────────────────────────┘ └────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────────┘
Two source files feed the system. They join on Batch_ID where T001 is the only common batch. This join is used for sensor simulation — T001 sensor profile becomes the template scaled to all 60 batches.
| File | What It Represents | Primary Use |
|---|---|---|
_h_batch_process_data.xlsx |
Sensor log of one batch (T001), minute-by-minute | Energy pattern module + sensor simulation template |
_h_batch_production_data.xlsx |
Summary records of 60 batches | Multi-target prediction + target derivation |
Two parallel pipelines that merge before model training:
Time-Series Pipeline (from File 1):
Raw sensor data (211 rows)
→ Segment by Phase (8 phases)
→ Aggregate per phase: mean, max, std of Power and Vibration
→ Compute rolling 5-min power slope (trend indicator)
→ FFT dominant frequency of vibration (motor frequency signature)
→ Result: 1 row of 32 features representing T001's phase profile
→ Simulate for T002–T060 using physics-based scaling
Batch-Level Pipeline (from File 2):
Raw batch records (60 rows)
→ Validate: no nulls, check value ranges
→ Detect outliers using IQR (flag, do not drop)
→ Derive Energy_kWh using physics formula
→ Derive Carbon_kgCO2e = Energy_kWh × 0.716
→ Normalize with MinMaxScaler (fit on train only)
→ Result: 60 rows × 10 engineered features
Merge:
Join both pipelines on Batch_ID
→ Final matrix: 60 rows × ~22 features
→ 80/20 split: 48 train, 12 test
Four modules, each independently trained and serialized:
| Module | Algorithm(s) | Input | Output |
|---|---|---|---|
| 1 — Prediction | XGBoost + RF + MLP → Ridge Stacking | 8 process params + 4 sim features | 7 quality + energy targets |
| 2 — Anomaly | Isolation Forest + LSTM Autoencoder | Phase-aggregated sensor features / raw time-series | Anomaly score + root cause |
| 3 — Explainability | SHAP TreeExplainer | XGBoost model + input features | Feature contribution scores |
| 4 — Carbon | Rule-based formula | Predicted Energy_kWh | CO₂e + dynamic targets |
FastAPI (port 8000):
→ Loads all serialized models at startup
→ Handles prediction, anomaly, explanation, carbon requests
→ All responses in JSON
Next.js Dashboard (port 3000):
→ Calls FastAPI endpoints via REST
→ Renders Recharts visualizations for all charts
→ Real-time slider updates trigger /api/predict calls (<100ms)
Input (8 features):
Granulation_Time, Binder_Amount, Drying_Temp, Drying_Time,
Compression_Force, Machine_Speed, Lubricant_Conc, Moisture_Content
+ sim features: phase_power_compression, phase_vibration_milling, etc.
┌─────────────────┐
│ 5-Fold CV │
└────────┬────────┘
│ out-of-fold predictions
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ XGBoost │ │ Random │ │ MLP │
│ MultiOut │ │ Forest │ │ 7-output │
└──────┬─────┘ └─────┬──────┘ └──────┬─────┘
│ │ │
└────────────────┼──────────────────┘
▼
┌─────────────────┐
│ Ridge Stacking │ ← meta-learner
│ Meta-Learner │
└────────┬────────┘
▼
Output (7 targets):
Hardness, Friability, Dissolution_Rate, Content_Uniformity,
Disintegration_Time, Tablet_Weight, Energy_kWh
| Model | Strength | Weakness | Role in Ensemble |
|---|---|---|---|
| XGBoost | Best single-model performance; handles non-linearity | Sensitive to hyperparameters | Primary base learner |
| Random Forest | Stable, low-variance; bagging diversity | Slightly lower accuracy than XGBoost | Diversity provider |
| MLP | Captures inter-target correlations via shared layers | Needs more data; slower to train | Non-linear interaction catcher |
| Ridge Stacking | Learns optimal weights for blending | Linear only | Final combiner |
Batch sensor data (211 timesteps × 2 channels: Power, Vibration)
│
┌──────────────┴──────────────┐
▼ ▼
┌───────────────┐ ┌──────────────────────┐
│ Layer 1: │ │ Layer 2: │
│ Isolation │ │ LSTM Autoencoder │
│ Forest │ │ │
│ │ │ Encoder: │
│ Input: 32 │ │ LSTM(64) → LSTM(32) │
│ phase-agg │ │ → Dense(16) [z] │
│ features │ │ │
│ │ │ Decoder: │
│ Output: │ │ Dense(32) → │
│ anomaly_flag │ │ LSTM(32) → │
│ score (0-1) │ │ LSTM(64) → │
│ │ │ TimeDistributed(2) │
│ Speed: ~5ms │ │ │
└───────┬───────┘ │ Output: │
│ │ reconstruction_error │
│ │ per timestep │
│ │ │
│ │ Speed: ~50ms │
│ └──────────┬───────────┘
└──────────────────────────┬─────┘
▼
┌──────────────────────┐
│ Root Cause Engine │
│ (Rules + ML fusion) │
│ │
│ IF vib_milling > 8.5 │
│ → bearing wear │
│ │
│ IF pwr_comp > 58.0 │
│ → motor overload │
│ │
│ IF pwr_dry > 28.0 │
│ → damp raw material │
└──────────────────────┘
| Anomaly Type | Affected Phase | Signal Pattern | Root Cause Message |
|---|---|---|---|
| Bearing wear | Milling | Vibration > 8.5 mm/s | "Bearing wear suspected — schedule inspection" |
| Motor overload | Compression | Power > 58.0 kW | "Motor stress — check tooling and die fill" |
| Process drift | Drying | Power > 28.0 kW consistently | "High moisture in raw material — check intake specs" |
| Gradual degradation | Any | Batch-to-batch power increase | "CUSUM drift detected — maintenance review needed" |
Trained XGBoost models (one per target)
│
▼
shap.TreeExplainer(xgb_model)
│
┌──────────┴──────────────────────┐
▼ ▼
Global view Local view
(all 60 batches) (1 specific batch)
Beeswarm plot: Waterfall plot:
Feature vs SHAP value Base value
spread across batches + Compression_Force (+18N)
+ Moisture_Content (-7N)
Bar chart: + Machine_Speed (+2N)
Mean |SHAP| ranking = Final prediction: 95N
per target
{
"target": "Dissolution_Rate",
"base_value": 90.93,
"prediction": 87.2,
"feature_contributions": {
"Compression_Force": -2.8,
"Moisture_Content": -1.5,
"Machine_Speed": +0.6,
"Binder_Amount": -0.3,
"Drying_Temp": +1.2,
"Granulation_Time": -0.9,
"Drying_Time": -0.1,
"Lubricant_Conc": +0.07
},
"top_driver": "Compression_Force is pulling Dissolution_Rate down by 2.8%"
}Predicted Energy_kWh (from Module 1)
│
▼
Carbon_kgCO2e = Energy_kWh × 0.716
(India grid emission factor, CEA FY 2022-23)
│
▼
Adaptive Target Algorithm:
┌────────────────────────────────────────────┐
│ 1. Look at last 20 batches' emissions │
│ 2. best_10p = percentile(emissions, 10) │
│ 3. stretch = best_10p × 0.95 │
│ 4. target = max(stretch, regulatory_floor)│
└────────────────────────────────────────────┘
│
▼
Output: {
dynamic_target, current_avg,
trend (📉/📈), on_target (✅/❌),
gap_to_target
}
START
│
▼
[1] LOAD DATA
├── _h_batch_process_data.xlsx → df_process (211 × 11)
└── _h_batch_production_data.xlsx → df_prod (60 × 15)
│
▼
[2] VALIDATE
├── Assert zero nulls (both files confirmed clean)
├── IQR outlier detection → flag extremes
└── Value range checks per column
│
▼
[3] FEATURE ENGINEERING
├── Phase-aggregate df_process → 32 phase features (1 row)
├── Physics-based sensor simulation for T002–T060
├── Inject anomalies into 10% of simulated batches
├── Derive Energy_kWh per batch (physics formula)
├── Derive Carbon_kgCO2e = Energy_kWh × 0.716
└── Merge all → merged_dataset (60 × ~22)
│
▼
[4] TRAIN/TEST SPLIT
├── 80/20 → 48 train / 12 test
└── MinMaxScaler fit on train only, transform both
│
▼
[5] TRAIN MODULE 1 — MULTI-TARGET PREDICTION
├── 5-fold CV → out-of-fold predictions
├── XGBoost MultiOutputRegressor (Optuna tuning, 50 trials)
├── Random Forest MultiOutputRegressor
├── MLP Neural Network (shared layers + 7 output heads)
├── Ridge Stacking Meta-Learner on OOF predictions
└── Evaluate: assert R² ≥ 0.90 on all primary targets
│
▼
[6] TRAIN MODULE 2 — ANOMALY DETECTION
├── Extract phase-aggregated features from simulated sensors
├── Isolation Forest (contamination=0.05, n_estimators=200)
├── LSTM Autoencoder (train on normal batches only)
├── Compute reconstruction threshold = mean + 3×std
└── Evaluate: Precision, Recall, AUC-ROC on injected anomalies
│
▼
[7] SET UP MODULE 3 — SHAP
├── shap.TreeExplainer on trained XGBoost models
├── Compute SHAP values for all 60 batches
└── Pre-generate beeswarm + bar plots for dashboard
│
▼
[8] SET UP MODULE 4 — CARBON
├── Compute Carbon_kgCO2e for all 60 batches
└── Initialize adaptive target state from batch history
│
▼
[9] SERIALIZE ALL MODELS
├── models/xgb_multitarget.pkl
├── models/rf_multitarget.pkl
├── models/mlp_model.keras
├── models/stacking_meta.pkl
├── models/isolation_forest.pkl
├── models/lstm_autoencoder.keras
└── models/scaler.pkl
│
▼
[10] START SERVICES
├── uvicorn api/main.py --port 8000
└── cd dashboard && npm run dev (http://localhost:3000)
│
▼
END — System is live
manufacturing-intelligence/
│
├── README.md
├── SETUP.md ← Environment setup guide
├── PIPELINE.md ← Plain-English pipeline walkthrough
├── BENCHMARK.md ← Model benchmark report
│
├── data/
│ ├── raw/
│ │ ├── _h_batch_process_data.xlsx ← T001 sensor log (211 rows × 11 cols)
│ │ └── _h_batch_production_data.xlsx ← T001–T060 batch records
│ ├── processed/
│ │ ├── merged_dataset.csv ← 60 batches × ~22 features (final ML input)
│ │ ├── phase_features.csv ← Phase-aggregated sensor features
│ │ ├── batch_outcomes.csv ← Cleaned production data + derived targets
│ │ └── carbon_history.csv ← Per-batch CO₂e with adaptive targets
│ └── simulated/
│ └── simulated_sensors.csv ← Physics-based sensor data T001–T060
│
├── notebooks/ ← Core analysis notebooks (run in order)
│ ├── 01_EDA.ipynb
│ ├── 02_feature_engineering.ipynb
│ ├── 03_multitarget_models.ipynb
│ ├── 04_anomaly_detection.ipynb
│ └── 05_explainability.ipynb
│
├── analysis/ ← Deep-dive analysis notebooks
│ ├── 01_data_profiling.ipynb
│ ├── 02_correlation_deep_dive.ipynb
│ ├── 03_phase_energy_analysis.ipynb
│ ├── 04_model_comparison.ipynb
│ └── 05_business_impact.ipynb
│
├── src/
│ ├── __init__.py
│ ├── config.py ← All constants: paths, thresholds, emission factors
│ ├── preprocessing.py ← Load, validate, normalize
│ ├── simulate_sensors.py ← Physics-based simulation for T002–T060
│ ├── feature_engineering.py ← Phase aggregation, FFT, derived features
│ ├── multi_target_model.py ← XGBoost + RF + MLP + stacking
│ ├── anomaly_detector.py ← Isolation Forest + LSTM Autoencoder
│ ├── shap_explainer.py ← SHAP value computation + plots
│ ├── carbon_calculator.py ← CO₂e computation + adaptive targets
│ ├── run_pipeline.py ← Master script: runs all training steps
│ └── utils.py ← Metrics, plot helpers, serialization
│
├── models/
│ ├── xgb_multitarget.pkl
│ ├── rf_multitarget.pkl
│ ├── mlp_model.keras ← Keras SavedModel format
│ ├── stacking_meta.pkl
│ ├── isolation_forest.pkl
│ ├── lstm_autoencoder.keras ← Keras SavedModel format
│ ├── scaler.pkl
│ ├── shap_values.pkl
│ ├── lstm_threshold.json
│ ├── lstm_norm_params.json
│ ├── evaluation_results.json ← Per-target R², MAE, RMSE, MAPE
│ └── pipeline_summary.json
│
├── api/
│ ├── main.py ← FastAPI app + all route handlers
│ └── schemas.py ← Pydantic request/response models
│
├── dashboard/ ← Next.js web dashboard
│ ├── package.json
│ ├── next.config.ts
│ └── src/
│ ├── app/
│ │ ├── layout.tsx
│ │ ├── page.tsx
│ │ └── ClientLayout.tsx ← Tab-based navigation
│ └── components/
│ ├── MetricCard.tsx
│ ├── Slider.tsx
│ └── tabs/
│ ├── PredictionsTab.tsx
│ ├── EnergyTab.tsx
│ ├── ComparisonTab.tsx
│ ├── CarbonTab.tsx
│ ├── WhatIfTab.tsx
│ └── BenchmarkTab.tsx ← NEW: model benchmark report
│
├── tests/
│ ├── test_preprocessing.py
│ ├── test_models.py
│ └── test_api.py
│
└── requirements.txt
Base URL: http://localhost:8000
POST /api/predict → Quality + energy prediction for given parameters
POST /api/anomaly → Energy pattern anomaly check for a batch
GET /api/explain/{batch_id} → SHAP explanation for a specific prediction
GET /api/carbon/{batch_id} → Carbon footprint + adaptive targets
GET /api/batches → List all available batch IDs
GET /api/carbon_history → Full carbon history for all batches
GET /api/model_metrics → Full benchmark (R², MAE, RMSE, MAPE, anomaly metrics)
GET /api/health → Service health check (all model load statuses)
Request:
{
"granulation_time": 16.0,
"binder_amount": 9.0,
"drying_temp": 60,
"drying_time": 29,
"compression_force": 12.0,
"machine_speed": 170,
"lubricant_conc": 1.2,
"moisture_content": 2.0
}
Response:
{
"hardness": 89.4,
"friability": 0.81,
"dissolution_rate": 90.7,
"content_uniformity": 98.2,
"disintegration_time": 8.3,
"tablet_weight": 202.1,
"energy_kwh": 72.4,
"carbon_kg_co2e": 51.8,
"composite_quality_score": 82.3
}
Request: { "batch_id": "T045" }
Response:
{
"batch_id": "T045",
"is_anomaly": true,
"severity": "HIGH",
"root_causes": [
{
"phase": "Milling",
"signal": "Vibration spike (11.2 mm/s vs normal 6.5)",
"interpretation": "Bearing wear suspected in milling unit",
"action": "Schedule inspection before next batch run"
}
],
"isolation_forest_score": -0.18,
"lstm_reconstruction_error": 0.042
}
Response:
{
"target": "Dissolution_Rate",
"base_value": 90.93,
"prediction": 87.2,
"feature_contributions": {
"Compression_Force": -2.8,
"Moisture_Content": -1.5,
"Machine_Speed": +0.6,
"Binder_Amount": -0.3,
"Drying_Temp": +1.2,
"Granulation_Time": -0.9,
"Drying_Time": -0.1,
"Lubricant_Conc": +0.07
},
"top_driver": "Compression_Force reduced Dissolution_Rate by 2.8% below average"
}
Response:
{
"batch_id": "T045",
"energy_kwh": 74.2,
"carbon_kg_co2e": 53.1,
"grid": "India (0.716 kg CO2e/kWh)",
"dynamic_target": 48.5,
"on_target": false,
"trend": "📈 Worsening",
"gap_to_target_kg": +4.6
}
Tab Layout (Next.js Dashboard — http://localhost:3000)
Next.js App (port 3000)
│
├── Tab 1 — 🔮 Predictions
│ ├── 8 input sliders for process parameters
│ ├── Metric cards (Hardness, Friability, Dissolution, Energy, Carbon)
│ └── Composite Quality Score display
│
├── Tab 2 — ⚡ Energy Monitor
│ ├── Batch selector dropdown (T001–T060)
│ ├── Power + Vibration recharts line chart colored by Phase
│ └── Anomaly score + root cause alerts (🔴 HIGH / 🟡 MEDIUM / 🟢 OK)
│
├── Tab 3 — 📊 Batch Comparison
│ ├── Two batch selectors
│ ├── Normalized radar charts for each batch
│ └── Delta table showing improvements/worsened targets
│
├── Tab 4 — 🌍 Carbon Footprint
│ ├── Carbon trend chart (all 60 batches)
│ ├── Adaptive target line
│ └── Grid selector (India / EU / US / Renewable)
│
├── Tab 5 — 🎛️ What-If Optimizer
│ ├── All 8 parameter sliders (real-time update < 100ms)
│ └── Live predictions update as sliders move
│
└── Tab 6 — 📈 Benchmark (NEW)
├── Full model performance table (R², MAE, RMSE, MAPE)
├── Per-target breakdown for XGBoost / RF / MLP / Stacking
├── Anomaly detector metrics (Precision, Recall, F1, AUC-ROC)
└── Dataset metadata (batch count, feature count, CV folds)
_h_batch_process_data.xlsx (T001 only)
│
├──→ Phase energy profile extraction
│ (8 phase segments → aggregated stats)
│
├──→ Physics-based scaling with _h_batch_production_data params
│ → Simulated sensor profiles for T002–T060
│
└──→ LSTM Autoencoder training
(learns "normal" power + vibration patterns)
_h_batch_production_data.xlsx (T001–T060)
│
├──→ Input features → XGBoost / RF / MLP training
│
├──→ Target variables → model output supervision
│
├──→ Scaling params for sensor simulation
│ (Machine_Speed, Drying_Temp, Compression_Force per batch)
│
└──→ Energy_kWh derivation
→ Carbon_kgCO2e computation
→ Adaptive target history
BOTH join on Batch_ID (T001)
→ Correlation between sensor patterns and quality outcomes
→ Used to validate simulation realism
Architecture version: 2.0 | AI-Driven Manufacturing Intelligence Hackathon