📋 Project Documentation

AI-Driven Manufacturing Intelligence — Complete Hackathon Reference

Event: AI-Driven Manufacturing Intelligence Hackathon
Track: A — Predictive Modelling Specialization
Domain: Pharmaceutical Tablet Manufacturing
Status: Complete Reference Document for Team Review

Problem Statement Deep Dive
Why We Chose Track A
What We Are Building
Dataset Analysis
How We Will Do It
Tech Stack
ML Models — Research Rationale
Innovations & Extra Features
Business Impact
Risks & Mitigations
References & Research Basis
Quick Start Guide

1. Problem Statement Deep Dive

What's the Real-World Problem?

Modern manufacturing — especially pharmaceutical tablet manufacturing — faces enormous pressure on three simultaneous fronts:

Front 1: Energy Costs & Carbon
Manufacturing accounts for roughly 30% of global energy use. Pharmaceutical tablet presses, granulators, and dryers run continuously. A single batch of tablets might consume 60–80 kWh. At scale (thousands of batches/year), this translates to millions in energy costs and thousands of tonnes of CO₂e. The Indian pharmaceutical sector alone produces over $50B in tablets annually — energy optimization at even 5% batch level translates to huge real savings.

Front 2: Quality Consistency
A tablet that fails dissolution testing, is too brittle (high friability), or has inconsistent drug content (poor content uniformity) is a regulatory failure. Rejected batches are pure waste — energy spent, materials lost, time wasted. The challenge is that quality is determined after the batch is finished. Predicting quality during or before the process would save enormous waste.

Front 3: Equipment Health & Unplanned Downtime
Pharmaceutical manufacturers lose an estimated 5–15% of production time to unexpected equipment failures. A tablet press bearing that fails mid-batch ruins the entire run. Current maintenance is either calendar-based (wasteful) or reactive (too late). Energy consumption patterns actually reveal equipment degradation before physical failure — a motor drawing more current than usual is showing signs of mechanical stress.

The Specific Problem This Hackathon Targets

The hackathon isolates the most critical pain points:

PAIN POINT 1: Batch-Level Variability
  Every batch behaves differently even with the same recipe.
  Machine age, ambient humidity, raw material lot variation, operator decisions
  all cause energy and quality to fluctuate.
  Current systems have no predictive view of this variability.

PAIN POINT 2: Static Management Systems
  Energy KPIs are set once and never updated.
  Post-process monitoring = you find out problems AFTER money is wasted.
  No learning from good batches to define what "optimal" looks like.

PAIN POINT 3: Conflicting Objectives
  Faster machine speed → more tablets, but higher energy AND worse friability.
  Higher compression force → harder tablets, but more energy AND slower disintegration.
  Lower drying temp → less energy, but higher residual moisture AND longer drying time.
  These tradeoffs are currently managed by experience, not data.

What the Hackathon Asks Us to Solve

Track A requires us to build a system that:

1. MULTI-TARGET PREDICTION
   Simultaneously predict: Quality (rejection rate), Yield (output efficiency),
   Performance (interruption rate), and Energy Consumption
   — all from process parameters, before the batch is complete.
   Target accuracy: > 90% (R² ≥ 0.90)

2. ENERGY PATTERN INTELLIGENCE  
   Analyze power consumption time-series to identify:
   — Which phase is consuming excess energy and why?
   — Is the power pattern indicating equipment wear?
   — Does a change in pattern correlate to a specific process problem?

3. REAL-TIME FORECASTING
   The system must be deployable for real-time use, not just offline analysis.
   Predictions must come in < 100ms for production-floor use.

2. Why We Chose Track A

Track A vs Track B — The Decision

Dimension	Track A (Predictive Modelling)	Track B (Optimization Engine)
Core strength needed	ML modelling, feature engineering	Optimization algorithms, system design
Data dependency	Can work well with 60 batches + simulation	Needs rich historical data for golden signatures
Demo-ability	Easy to show live predictions with sliders	Requires more complex real-time orchestration
Research backing	Extensive literature on XGBoost + SHAP in pharma	Less established for batch-level golden signatures
Judging clarity	R² ≥ 0.90 is a crisp, verifiable benchmark	Optimization quality is harder to quantify
Expandability	Can add optimization on top of predictions	Harder to add explainability retroactively

Decision: Track A — because our dataset has extremely high feature-target correlations (0.96–0.99), making predictive modelling near-certain to hit the ≥ 0.90 benchmark. The energy pattern analysis module (using Isolation Forest + LSTM Autoencoder) directly addresses the asset reliability angle. And SHAP explainability lets us tell a clear story to judges: "Here's what drives quality, here's what drives energy — and here's proof."

What Track A Unlocks for Us

By building predictive models first, we get these "for free":

Implicit optimization: Our What-If Optimizer tab is Track A's answer to Track B's optimization — operators can explore parameter space in real time using our predictive model
Batch fingerprinting: Our radar chart comparison is a lightweight version of Track B's golden signature concept
Foundation for future Track B: If we continue this project, our prediction models become the surrogate model for a future NSGA-II multi-objective optimizer

3. What We Are Building

System Overview in Plain English

We are building a 4-module AI system for pharmaceutical tablet manufacturing that:

Predicts batch outcomes before/during production — give it the process parameters and it tells you what quality, yield, performance, and energy to expect
Monitors energy patterns for equipment health — watches the power and vibration sensor data phase-by-phase and alerts when something looks wrong
Explains every prediction — uses SHAP values so operators understand why the model made a prediction, not just what it predicted
Tracks carbon footprint with adaptive targets — calculates CO₂e per batch and dynamically adjusts targets based on operational best performance + regulatory requirements

The 5 Things Users Can Do

User Action	System Response
Enter process parameters → click Predict	Get all quality targets + energy forecast + SHAP explanation + carbon footprint
Select a batch → click Analyze Energy	Get phase-wise power chart + anomaly score + root cause diagnosis
Compare two batches	Side-by-side radar chart + delta table + "which is better and why"
View carbon dashboard	Trend chart + adaptive target + regulatory compliance status
Move sliders in What-If mode	Real-time prediction updates as parameters change (<100ms)

4. Dataset Analysis

What We Have

Two Excel files that together represent one pharmaceutical tablet manufacturing process:

File 1: `_h_batch_process_data.xlsx` — The Process Monitor

Think of this as the black box recorder for one batch (T001). It captures sensor readings every minute for 211 minutes across 8 manufacturing phases.

The 8 Phases (in sequence):

Preparation (0–24 min)      → Raw material weighing, equipment setup
Granulation (25–39 min)     → Wet mixing of active + binder to form granules  
Drying (40–64 min)          → Fluidized bed dryer to remove moisture
Milling (65–79 min)         → Size reduction of dried granules
Blending (80–101 min)       → Mixing granules with lubricant and excipients
Compression (102–153 min)   → Tablet press compresses blend into tablets ← CRITICAL
Coating (154–173 min)       → Film coating for appearance/protection
Quality Testing (174–210)   → In-process dissolution, hardness checks

Why This File Matters for Track A:

Power_Consumption_kW and Vibration_mm_s are the energy pattern signals
Phase structure gives us context: Milling vibration is expected to be high; Preparation vibration should be near-zero
Deviations from expected phase profiles = anomalies

Key Finding from Data:

Compression phase uses 50.4% of total batch energy (38.69 kWh out of 76.74 kWh total)
Milling phase has the highest vibration (up to 9.79 mm/s) — equipment stress indicator
These two phases are the highest-ROI monitoring targets

File 2: `_h_batch_production_data.xlsx` — The Batch Record

Think of this as the production log for 60 batches — what settings were used and what quality came out.

The Critical Data Relationship:

INPUT (what you set):                OUTPUT (what you measure after batch):
─────────────────────                ───────────────────────────────────────
Granulation_Time   ──┐               Hardness (structural integrity)
Binder_Amount      ──┤               Friability (breakage resistance)
Drying_Temp        ──┼──[AI MODEL]──▶ Dissolution_Rate (drug release)
Drying_Time        ──┤               Content_Uniformity (dose consistency)
Compression_Force  ──┤               Disintegration_Time (breakdown speed)
Machine_Speed      ──┤               Tablet_Weight (mass consistency)
Lubricant_Conc     ──┤               [derived] Energy_kWh
Moisture_Content   ──┘               [derived] Carbon_kgCO2e

The Stunning Correlation Finding: The feature-target correlations in this dataset are almost all between 0.92–0.99. This is unusually strong and means:

Our models will achieve high accuracy (R² > 0.95 for most targets)
The physics of tablet manufacturing is highly deterministic within this dataset's range
The main driver is Compression_Force and Moisture_Content — these two features dominate most quality outcomes

The Dataset Challenge We Face (and How We Solve It)

Problem: File 1 has sensor time-series data for only T001. File 2 has 60 batches but only summary-level data — no minute-by-minute sensors.

Impact: We can't directly correlate per-minute energy patterns with quality outcomes across all 60 batches.

Solution — Physics-Based Sensor Simulation:

We take T001's sensor profile as a "template" and generate 59 more sensor profiles.
The scaling is physics-informed:
  
  Power in Compression phase ∝ (Compression_Force)^1.2
  Power in Drying phase      ∝ (Drying_Temp)^0.8 × Drying_Time
  Vibration                  ∝ Machine_Speed^1.5 × (1 + Moisture_Content × 0.3)
  
We add ±5% Gaussian noise to simulate measurement variability.
We inject 3 types of synthetic faults into 10% of batches to train anomaly detection.

This is explicitly permitted in the hackathon problem statement:
"Conduct comprehensive testing using simulated and real manufacturing data"

5. How We Will Do It

The 4-Module Build Plan

Module 1: Multi-Target Prediction System

What it does: Takes 8 process parameters → predicts 7 outputs simultaneously

Why an ensemble, not one model:

With only 60 data points, a single model risks overfitting to training noise
XGBoost handles non-linearity and interactions well, but can be sensitive to hyperparameters
Random Forest provides stable variance reduction through bagging
MLP can capture inter-target correlations (e.g., Hardness and Friability are inversely related)
Stacking combines all three: each model contributes its strength, meta-learner learns the optimal blend

The training process:

Step 1: 5-fold cross-validation to get out-of-fold predictions from each base model
Step 2: Use those OOF predictions as inputs to train a Ridge meta-learner
Step 3: Final prediction = meta-learner output (weighted blend of base models)
Step 4: Evaluate on held-out test set (12 batches)
Step 5: Assert R² ≥ 0.90 for all primary targets before deployment

Hyperparameter optimization: We use Optuna (a modern Bayesian optimization framework) to search for the best XGBoost parameters in 50 trials, taking about 5 minutes. This is much more efficient than GridSearch.

Module 2: Energy Pattern Intelligence

What it does: Watches power and vibration time-series → identifies anomalies → explains the likely cause

Two-layer detection:

Layer 1 — Isolation Forest (fast, batch-level):

Trained on phase-aggregated features (mean/max/std of power and vibration per phase = 32 features)
Score reflects how "isolated" a batch is from normal batches
Advantage: runs in milliseconds, no sequence modelling needed
Use case: quick batch-level health check

Layer 2 — LSTM Autoencoder (deep, time-series-level):

Trained only on "normal" batches (learns what normal energy patterns look like)
Given a new batch, it reconstructs the time-series and measures how far the original was from reconstruction
High reconstruction error = the pattern is unusual = anomaly
Advantage: catches subtle sequential patterns that batch-level features miss
Use case: deeper investigation of suspected batches

Root cause attribution:

After detecting an anomaly, we look which phase had the largest deviation from normal
We then apply domain-knowledge rules to translate the signal into a maintenance or process recommendation
Examples:
- Milling vibration 2× normal → "Bearing wear suspected, check milling unit"
- Compression power 1.8× normal → "Motor overload, check tooling and die fill"
- Drying power 1.3× normal + slow compression → "Raw material moisture too high"

Module 3: SHAP Explainability

What it does: For any prediction, it shows how much each input feature contributed to the output

Why this matters for the hackathon:

Judges want to see that you understand your model, not just run it
Operators in a real factory won't trust a black-box — they need to know WHY
SHAP is the industry standard for explaining tree-based models (XGBoost, Random Forest)
Research in pharmaceutical manufacturing specifically cites SHAP as a key tool for process understanding

What we'll show:

Global view (across all batches):
  Beeswarm plot: each dot is one batch-feature combination
  Compression_Force has the widest spread → most important feature
  
Local view (for one specific batch):
  Waterfall plot: starts at average prediction, then shows feature contributions
  "Hardness prediction is 95N because:
    Compression_Force = 14kN pushed it up by +22N above average
    Moisture_Content = 3.1% pushed it down by -7N
    Binder_Amount = 8g had minimal effect (+0.5N)"

Module 4: Carbon Footprint Tracker

What it does: Converts energy predictions into CO₂ emissions and sets adaptive reduction targets

The emission factor we use: The Central Electricity Authority of India publishes official grid emission factors. For FY 2022-23, the Indian grid emits 0.716 kg CO₂e per kWh (0.716 tCO₂/MWh). This is based on the actual generation mix of coal, gas, hydro, nuclear, and renewables feeding the Indian grid.

For comparison: EU grid = 0.29, US grid = 0.39, pure renewable = 0.02 kg CO₂e/kWh.

Adaptive target-setting logic:

Rather than a fixed annual target, we compute a rolling target:

1. Look at last 20 batches' carbon emissions
2. Find the best 10th percentile (what the best 10% of batches achieved)
3. Set stretch target = that value × 0.95 (5% better than our recent best)
4. Apply regulatory floor (can't set target below legal requirement)
5. Dynamic target = max(stretch target, regulatory floor)

Result: As operations improve, the target automatically becomes more ambitious.
        This is "continuous improvement" baked into the target-setting algorithm.

6. Tech Stack

Core Languages & Frameworks

Layer	Technology	Why
ML / Data Science	Python 3.11	Universal in data science; best library ecosystem
Data Processing	Pandas, NumPy	Industry standard for tabular data manipulation
ML Models	scikit-learn	MultiOutputRegressor, IsolationForest, StandardScaler, Ridge
Gradient Boosting	XGBoost 2.x	Best performance on small tabular datasets; supports multi-output
Deep Learning	TensorFlow / Keras	LSTM Autoencoder for sequential anomaly detection
Hyperparameter Tuning	Optuna	Modern Bayesian optimization; more efficient than GridSearch
Explainability	SHAP	Industry-standard for tree model explanation; TreeExplainer is fast
API Backend	FastAPI	Async Python API framework; auto-generates Swagger docs; <100ms latency
Dashboard	Next.js 16 + React 19 + TypeScript	Modern web dashboard consuming FastAPI; 6 interactive tabs
Charts	Recharts	Interactive data visualizations in the Next.js dashboard

Dependency Table

# requirements.txt

# Data
pandas>=2.0
numpy>=1.24
openpyxl>=3.1           # Excel file reading

# ML
scikit-learn>=1.3
xgboost>=2.0
tensorflow>=2.13
optuna>=3.3

# Explainability
shap>=0.43

# API
fastapi>=0.104
uvicorn>=0.24
pydantic>=2.4

# Dashboard (Next.js — installed via npm, not pip)
# cd dashboard && npm install
# npm run dev

# Visualizations (notebooks & reports)
plotly>=5.17
matplotlib>=3.7
seaborn>=0.12

# Utilities
joblib>=1.3             # Model serialization
scipy>=1.11             # FFT for vibration analysis

Why These Choices Are Justified

XGBoost over LightGBM/CatBoost:
All three are excellent gradient boosting frameworks. We chose XGBoost because:

Most documentation and community support for MultiOutputRegressor pattern
XGBoost 2.x natively supports multi-output regression (experimental but functional)
shap.TreeExplainer has first-class support for XGBoost — critical for Module 3
Slightly better performance than LightGBM on small datasets per benchmarks

FastAPI over Flask/Django:

Built-in async support → lower latency for real-time predictions
Automatic Swagger UI at /docs → immediate API documentation for judges
Pydantic models → automatic request validation with helpful error messages
3–5× faster than Flask for similar workloads per benchmarks

Next.js over Streamlit/Dash/Gradio:

Production-grade web framework — more polished & professional than Streamlit for demos
React component architecture enables 6 fully interactive tabs with Recharts visualizations
Real-time What-If updates via REST API calls (<100ms round-trip)
TypeScript provides type safety when consuming the FastAPI response schemas

SHAP over LIME:

Faster for tree models (TreeExplainer uses exact Shapley values, not approximations)
Mathematically consistent (satisfies all Shapley axioms: efficiency, symmetry, dummy, additivity)
Better visualization library (beeswarm, waterfall, force plots all built-in)
Active development; widely cited in pharma manufacturing literature

7. ML Models — Research Rationale

Why Multi-Output Regression (Not 7 Separate Models)?

Option A — 7 separate models:
Simple. One XGBRegressor per target. Easy to tune individually.
Problem: Ignores correlations between targets. In our data, Hardness and Friability are perfectly anti-correlated (ρ = −0.99). A multi-output model can exploit this shared signal.

Option B — Multi-output regression (our choice):
MultiOutputRegressor wraps one model per target but shares the same feature engineering.
For MLP: shared hidden layers + separate output heads explicitly shares learning.
RegressorChain (alternative): each target prediction uses previous target as additional input — useful when targets have causal ordering.
Advantage: Consistent predictions (e.g., if Compression_Force is high, model predicts high Hardness AND low Friability coherently), faster to train + evaluate.

XGBoost: Why It Works on 60 Samples

XGBoost is particularly well-suited for small tabular datasets because:

Regularization built-in: L1 (reg_alpha) and L2 (reg_lambda) penalties prevent overfitting
Subsampling: subsample and colsample_bytree parameters introduce randomness like bagging
Early stopping: Stops training when validation performance plateaus
Feature importance: Gains-based or SHAP-based; both work well on small data

Research on manufacturing quality prediction consistently shows XGBoost outperforming deep learning on small tabular datasets (n < 200). The intuition is: deep neural networks need large data to learn representations; tree ensembles learn directly from feature values and their interactions.

LSTM Autoencoder: Why for Anomaly Detection?

An LSTM Autoencoder is trained as follows:

Feed it normal time-series data (power + vibration over 211 timesteps)
It learns to compress the signal → reconstruct it accurately
For new (potentially anomalous) data: if reconstruction error is high, the pattern is unfamiliar = anomaly

This approach is unsupervised — we don't need labeled anomaly examples. We only need examples of "normal" behavior.

Key advantage over simple statistical methods: LSTM captures temporal dependencies — it knows that "power rises in Compression phase following Blending phase" is normal, but "power spikes in Preparation phase" is not. A simple threshold on max power would miss the phase-context.

Research shows LSTM-AE achieves 93.6% detection accuracy on IoT sensor anomalies, versus 88–91% for Isolation Forest alone. Our hybrid approach (IF for fast screening + LSTM-AE for deep validation) gets the best of both.

Stacking Ensemble: Why Not Just Use the Best Single Model?

With n=60, variance is the enemy. Any single model's performance on a test set of 12 batches could fluctuate significantly depending on which 12 batches ended up in the test set. Stacking mitigates this by:

Diversity: XGBoost, RF, and MLP all see the same data differently
Out-of-fold stacking: Meta-learner is trained on predictions that base models never saw during their own training → no leakage
Meta-learner: Ridge regression is simple, regularized, and interpretable — it learns the optimal blend rather than averaging blindly

Research on ensemble stacking for regression tasks consistently shows 2–8% improvement in R² over the best single base model on small datasets.

8. Innovations & Extra Features

Innovation 1: Physics-Based Energy Simulation

Most teams will either (a) skip energy prediction entirely or (b) just add random noise to T001 data. We use physics-informed scaling rules derived from motor power equations:

Motor power ∝ torque × angular velocity
Torque in compression ∝ applied force × die diameter
Angular velocity ∝ machine speed (rpm)
Therefore: Power_Compression ∝ Compression_Force × Machine_Speed

Dryer energy ∝ latent heat of evaporation × moisture removed
Therefore: Power_Drying ∝ Drying_Temp × (Moisture_In - Moisture_Out)

This makes our simulated sensor data physically realistic, not just statistically similar. Judges who know manufacturing physics will notice and appreciate this.

Innovation 2: Composite Quality Score (CQS)

The hackathon asks for predictions of multiple targets. We go one step further: a single 0–100 score that blends all quality targets into one number. This is operationally valuable — instead of tracking 7 separate metrics, a quality manager gets one number: "Batch T045 scored 87/100."

Weights are derived from pharmaceutical quality standards where dissolution rate (drug release efficacy) is the most critical parameter.

Innovation 3: CUSUM Drift Detection

Standard anomaly detection flags individual bad batches. CUSUM (Cumulative Sum control chart) detects gradual drift across batches — a slow deterioration in quality or energy efficiency that individual batch checks would miss.

Example: Compression force gradually increasing over 20 batches (tooling wear) would not trigger single-batch anomaly alerts, but CUSUM would catch the trend.

Innovation 4: Adaptive Carbon Targets with Real Emission Data

Instead of a fixed annual carbon target, we use an algorithm that:

Watches actual batch performance
Identifies what the best 10% of batches achieved
Sets the next target 5% better than that
Anchors against the official India CEA grid emission factor (0.716 kg CO₂e/kWh)

This is aligned with real-world sustainability frameworks (science-based targets, internal carbon pricing programs) and uses real published data rather than made-up numbers.

Innovation 5: Batch Fingerprinting

Each batch gets a radar chart ("fingerprint") showing all 7 normalized quality targets. The historical best batch is shown as a golden overlay. Operators can immediately see: "Our target batch looks like this; last batch was close on most dimensions but Dissolution_Rate was weak."

This is a visual, intuitive version of the "golden signature" concept from Track B — achieved using our prediction models.

9. Business Impact

Quantifiable Claims We Can Make

Metric	Current (Baseline)	With Our System	Saving
Batch rejection rate	~8–12% (industry avg)	< 5% (prevented by pre-batch optimization)	40–60% reduction in waste
Energy per batch	76.74 kWh (T001)	~68 kWh (8-10% reduction via parameter guidance)	~8.7 kWh/batch
Carbon per batch	54.9 kg CO₂e	~48.7 kg CO₂e	6.2 kg CO₂e/batch
Maintenance decision lead time	0 (reactive)	1–3 batches early warning	Prevent catastrophic failure
Time to quality insight	2–4 hours post-batch	< 1 second (pre-batch prediction)	Near-instant

Scaling These Numbers

For a facility running 500 batches/year:

Energy savings: 500 × 8.7 kWh = 4,350 kWh/year → ~₹30,000–50,000/year in electricity
Carbon savings: 500 × 6.2 kg = 3,100 kg CO₂e/year (~3 tonnes)
Rejection reduction: If each rejected batch wastes ₹50,000 in materials + energy, preventing 30 rejections/year = ₹15 lakh/year saved

ROI for Hackathon Judges

Investment:   Development + cloud hosting + integration ≈ ₹5–15 lakh
Annual return: ₹15–20 lakh (conservative)
ROI:          > 100% in Year 1
Payback:      6–12 months

10. Risks & Mitigations

Risk	Probability	Impact	Mitigation
Models don't hit R² ≥ 0.90	Very Low	High	Data shows 0.96–0.99 correlations; even basic linear regression would likely work
Sensor simulation too far from reality	Medium	Medium	Use physics-based scaling, not random noise; present as explicit assumption in slides
LSTM Autoencoder training time	Medium	Low	Can skip LSTM-AE in time crunch; Isolation Forest alone is sufficient for submission
Dashboard too slow for demo	Low	High	All inference < 100ms; pre-load models at startup; cache SHAP values
API doesn't start cleanly	Low	High	Add `/health` endpoint; test all endpoints before presentation
Team runs out of time	Medium	Medium	Core deliverables are Module 1 + Module 2; Modules 3 and 4 are bonus

Priority Order (If Time Runs Out)

MUST HAVE (Pass criteria):
  ✅ Multi-target XGBoost prediction model (R² ≥ 0.90)
  ✅ Basic evaluation metrics table
  ✅ FastAPI /predict endpoint
  ✅ Next.js dashboard with Predictions tab (Tab 1)

SHOULD HAVE (Good submission):
  ✅ SHAP explainability
  ✅ Isolation Forest anomaly detection
  ✅ Energy simulation + phase charts

NICE TO HAVE (Strong submission):
  ✅ LSTM Autoencoder
  ✅ Carbon tracker with adaptive targets
  ✅ All 6 dashboard tabs (including Benchmark tab)
  ✅ What-If Optimizer
  ✅ Batch fingerprinting + CUSUM drift

11. References & Research Basis

Multi-Output Regression

Scikit-learn MultiOutputRegressor documentation — trains one regressor per target; confirmed working with XGBoost
XGBoost 2.x introduces native multi-output regression support (multi_strategy="multi_output_tree") — experimental but functional
Research finding: ensemble methods (RF + AdaBoost combination) outperform single models on small manufacturing datasets, achieving up to 86% prediction fitness even with n < 200

SHAP Explainability in Pharmaceutical Manufacturing

Lundberg & Lee (2017) introduced SHAP as a game-theoretically optimal feature attribution method
Pharmaceutical applications: SHAP has been specifically applied to predict drug dissolution, identify critical process parameters (CPPs), and support Quality by Design (QbD) frameworks
shap.TreeExplainer provides exact Shapley values for tree models in O(TLD²) time — efficient enough for real-time use

LSTM Autoencoder for Anomaly Detection

Hybrid LSTM-AE + Isolation Forest approach achieves 91.5–93.6% accuracy on IoT sensor anomaly detection tasks, with inference latency ≈ 40–50ms on constrained hardware
LSTM-AE achieves higher recall (catches more true anomalies) while Isolation Forest has lower latency — combination provides best of both
Vibration-based anomaly detection for wind turbines using LSTM-AE shows same approach is valid for rotating equipment (relevant to our Milling and Compression phases)

Carbon Emission Factors

Central Electricity Authority (CEA), India: official grid emission factor for FY 2022-23 = 0.716 tCO₂/MWh (0.716 kg CO₂e/kWh)
Source: CO2 Baseline Database for the Indian Power Sector, Version 19.0
IEA Emissions Factors 2025: country-level grid emission factors for international comparison
GHG Protocol Scope 2 guidance: location-based emission factors should be used for grid electricity; our implementation follows this standard

Pharmaceutical Manufacturing Context

Optimization studies on batch pharmaceutical processes show potential for 70–83% energy reduction in optimized versus nominal operations (Sampat et al., 2022)
Quality by Design (QbD) and Process Analytical Technology (PAT) regulatory frameworks require predictive models for critical quality attributes — our system aligns with these frameworks
Current Good Manufacturing Practices (cGMP) require process understanding and control — SHAP explainability directly supports this regulatory requirement

12. Quick Start Guide

Environment Setup

# Clone/create project directory
mkdir manufacturing-intelligence && cd manufacturing-intelligence

# Create virtual environment
python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Place data files
mkdir -p data/raw
cp /path/to/_h_batch_process_data.xlsx data/raw/
cp /path/to/_h_batch_production_data.xlsx data/raw/

Run the Full Pipeline

# Step 1: Preprocessing + Feature Engineering + Model Training
python src/run_pipeline.py

# Step 2: Start API backend
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

# Step 3: Start Next.js dashboard (in a new terminal)
cd dashboard
npm install   # first time only
npm run dev

Access the Application

Dashboard:    http://localhost:3000
API Docs:     http://localhost:8000/docs
API Health:   http://localhost:8000/api/health

Quick API Test

# Test prediction endpoint
curl -X POST "http://localhost:8000/api/predict" \
  -H "Content-Type: application/json" \
  -d '{
    "granulation_time": 16,
    "binder_amount": 9.0,
    "drying_temp": 60,
    "drying_time": 29,
    "compression_force": 12.0,
    "machine_speed": 170,
    "lubricant_conc": 1.2,
    "moisture_content": 2.0
  }'

# Expected response:
# {
#   "hardness": 89.4,
#   "friability": 0.81,
#   "dissolution_rate": 90.7,
#   "content_uniformity": 98.2,
#   "disintegration_time": 8.3,
#   "tablet_weight": 202.1,
#   "energy_kwh": 72.4,
#   "carbon_kg_co2e": 51.8,
#   "composite_quality_score": 82.3
# }

Run Notebooks in Order

notebooks/01_EDA.ipynb                ← Start here for data understanding
notebooks/02_feature_engineering.ipynb
notebooks/03_multitarget_models.ipynb  ← Core ML, ~15 min runtime
notebooks/04_anomaly_detection.ipynb  ← LSTM training, ~20 min runtime
notebooks/05_explainability.ipynb     ← SHAP plots

Appendix A: Glossary

Term	Definition
Multi-Output Regression	ML where one model predicts several numerical outputs simultaneously
SHAP	SHapley Additive exPlanations — method to explain individual ML predictions using game theory
Isolation Forest	Unsupervised anomaly detection algorithm that isolates outliers via random tree partitioning
LSTM Autoencoder	Deep learning model that compresses and reconstructs time-series; anomalies = high reconstruction error
Stacking Ensemble	ML technique where multiple base model predictions are combined by a meta-learner
Optuna	Python framework for automatic hyperparameter optimization using Bayesian search
Friability	Measure of tablet brittleness — % weight loss after tumbling test; lower is better
Dissolution Rate	% of drug released in specified time under standard conditions; key efficacy metric
Content Uniformity	How consistently the drug dose is distributed across tablets in a batch; target = 100%
GEF	Grid Emission Factor — kg CO₂e emitted per kWh of grid electricity consumed
CEA	Central Electricity Authority — Indian government body publishing official grid emission factors
QbD	Quality by Design — regulatory framework requiring understanding of process-quality relationships
CUSUM	Cumulative Sum control chart — statistical method for detecting gradual process drift
CQS	Composite Quality Score — our custom 0–100 unified quality metric
PAT	Process Analytical Technology — real-time monitoring framework for pharmaceutical manufacturing

Appendix B: Evaluation Checklist

Use this before the presentation to verify all deliverables:

MODEL PERFORMANCE
  [ ] XGBoost MultiOutput trained and evaluated
  [ ] Random Forest trained and evaluated  
  [ ] MLP Neural Network trained and evaluated
  [ ] Stacking Ensemble trained and evaluated
  [ ] All primary targets show R² ≥ 0.90
  [ ] Evaluation table ready (R², MAE, RMSE, MAPE per target)
  [ ] 5-fold cross-validation scores documented

ANOMALY DETECTION
  [ ] Sensor simulation complete for T002–T060
  [ ] Anomaly injection complete (10% of batches)
  [ ] Isolation Forest trained and scored
  [ ] LSTM Autoencoder trained (or skip if time constrained)
  [ ] Root cause rules implemented and tested
  [ ] Precision/Recall metrics computed

EXPLAINABILITY
  [ ] SHAP TreeExplainer initialized
  [ ] Beeswarm plot generated for primary targets
  [ ] Waterfall plot working for individual batch
  [ ] SHAP values exported for dashboard use

CARBON MODULE
  [ ] Energy derivation formula implemented
  [ ] Carbon calculation using 0.716 kg CO₂e/kWh
  [ ] Adaptive target algorithm implemented
  [ ] Trend analysis working

API
  [ ] POST /api/predict returns all 7 targets + carbon
  [ ] POST /api/anomaly returns score + root causes
  [ ] GET /api/explain returns SHAP contributions
  [ ] GET /api/health returns 200 OK
  [ ] All endpoints tested with curl/Swagger

DASHBOARD (Next.js — http://localhost:3000)
  [ ] Tab 1: Predictions tab — all sliders + predictions
  [ ] Tab 2: Energy monitor — phase chart + anomaly detection
  [ ] Tab 3: Batch comparison — radar chart
  [ ] Tab 4: Carbon trend chart + targets
  [ ] Tab 5: What-If optimizer (real-time, < 100ms)
  [ ] Tab 6: Benchmark — model metrics from /api/model_metrics
  [ ] Dashboard loads in < 5 seconds
  [ ] All predictions display in < 2 seconds

PRESENTATION
  [ ] Architecture diagram slide
  [ ] Results table slide (R² scores)
  [ ] SHAP beeswarm slide
  [ ] Energy pattern demo slide
  [ ] Gap analysis slide (honest + thoughtful)
  [ ] Future work slide (scalable, futuristic ideas)
  [ ] Business impact numbers (ROI, energy savings, carbon)
  [ ] Live demo prepared and rehearsed

Document version: 1.0 | AI-Driven Manufacturing Intelligence Hackathon
Keep this document updated as implementation progresses.

FilesExpand file tree

PROJECT_DOCUMENTATION.md

Latest commit

History