You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An AI-powered distillation intelligence platform that combines physics-informed machine learning with Google Gemini to deliver real-time purity predictions, natural-language diagnostics, SHAP-based explainability, anomaly detection, what-if simulations, and operator-facing optimization recommendations — all from a single dashboard.
Distillation Column AI is a full-stack intelligent platform for predicting, analyzing, and optimizing ethanol purity in industrial distillation columns. It goes far beyond a simple prediction tool — it acts as an AI copilot for chemical process engineers.
The platform operates on three layers:
Prediction Layer — A physics-informed Random Forest model trained on 4,000+ real distillation experiments, transforming 7 raw sensor inputs into 21 engineered features to predict ethanol mole fraction with R² > 0.98.
Intelligence Layer — Google Gemini AI interprets every prediction in plain English, explains why the model made its decision using SHAP values, detects anomalies in real-time, and suggests concrete optimization actions.
Interaction Layer — A conversational AI assistant that lets operators ask questions like "Why did purity drop?" or "What should I adjust to reach 92%?" and get instant, context-aware answers.
Aspect
Detail
Domain
Chemical Engineering — Distillation & Separation
Core AI
Google Gemini Pro — Natural-language diagnostics & conversational assistant
ML Model
Random Forest & XGBoost (best auto-selected) — 21 physics-augmented features
Explainability
SHAP (SHapley Additive exPlanations) — per-prediction waterfall & force plots
Streamlit dashboard with integrated AI copilot & conversational chat
✨ AI Capabilities at a Glance
#
Capability
Powered By
Description
1
Real-Time Purity Prediction
Random Forest / XGBoost
Predicts ethanol mole fraction from 7 operating parameters in under 50ms
2
Natural Language Diagnostics
Google Gemini Pro
AI writes a plain-English report explaining what the prediction means and why
3
SHAP Explainability
SHAP Library
Per-prediction waterfall chart showing each feature's contribution
4
Anomaly Detection
Isolation Forest + Rules
Flags sensor drift, mass-balance violations, and out-of-distribution inputs
5
Smart Alerts
Gemini + Statistical Engine
Context-aware warnings with specific remediation instructions
6
What-If Simulator
Model + Gemini
Change any parameter virtually and see the predicted impact before acting
7
Optimization Advisor
Gemini + Sensitivity Analysis
Actionable recommendations: "Increase reflux by 8% to gain +3.2% purity"
8
Target Purity Solver
Inverse Model + Gemini
Specify desired purity and get recommended operating conditions
9
Energy Efficiency Advisor
Gemini + Domain Rules
Balances purity targets against steam/energy consumption
10
Conversational AI Chat
Gemini Chat API
Ask anything about your column in natural language
11
Trend Analysis
Time-Series Detection
Identifies gradual degradation patterns across sequential predictions
12
Batch Analysis
Parallel Inference + Gemini
Upload multiple operating snapshots and get comparative AI analysis
🏗️ System Architecture
graph TB
subgraph INPUT["Operator Input Layer"]
A["Pressure P"]
B["Temperature T1"]
C["Reflux Flow L"]
D["Vapor Flow V"]
E["Distillate D"]
F["Bottoms B"]
G["Feed Flow F"]
end
subgraph ENGINE["Physics Feature Engine — 21 Features"]
direction TB
H["Derived Ratios: Reflux, Reboiler, Condenser, Feed Norm, D/F, B/F"]
I["Interaction Terms: Reflux x Temp, Reboiler x Temp, Feed x Reflux"]
J["Efficiency Metrics: Column Load, Separation Duty, Column Efficiency"]
K["Temperature Approximation: T_bottom = T1 + 20C"]
end
subgraph ML["ML Prediction Layer"]
L["StandardScaler — Normalize 21 Features"]
M["Random Forest / XGBoost — Best Auto-Selected"]
end
subgraph AI["AI Intelligence Layer — Gemini Core"]
N["Gemini NLP — Diagnostic Report Generator"]
O["SHAP Engine — Feature Attribution Analysis"]
P["Anomaly Detector — Sensor Drift and Outlier Detection"]
Q["Optimization Advisor — Actionable Recommendations"]
R["Conversational AI — Natural Language Chat Interface"]
end
subgraph OUTPUT["Intelligent Output Layer"]
S["Purity Prediction with Confidence Score"]
T["AI Diagnostic Report in Plain English"]
U["SHAP Waterfall and Force Plots"]
V["Anomaly Alerts and Smart Warnings"]
W["Optimization Recommendations"]
end
INPUT --> ENGINE
ENGINE --> ML
ML --> AI
AI --> OUTPUT
style INPUT fill:#E3F2FD,stroke:#1565C0,stroke-width:2px,color:#0D47A1
style ENGINE fill:#FFF3E0,stroke:#E65100,stroke-width:2px,color:#BF360C
style ML fill:#E8F5E9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20
style AI fill:#EDE7F6,stroke:#4527A0,stroke-width:2px,color:#311B92
style OUTPUT fill:#FCE4EC,stroke:#C62828,stroke-width:2px,color:#B71C1C
style A fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
style B fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
style C fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
style D fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
style E fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
style F fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
style G fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
style H fill:#FFE0B2,stroke:#E65100,color:#BF360C
style I fill:#FFE0B2,stroke:#E65100,color:#BF360C
style J fill:#FFE0B2,stroke:#E65100,color:#BF360C
style K fill:#FFE0B2,stroke:#E65100,color:#BF360C
style L fill:#C8E6C9,stroke:#2E7D32,color:#1B5E20
style M fill:#C8E6C9,stroke:#2E7D32,color:#1B5E20
style N fill:#D1C4E9,stroke:#4527A0,color:#311B92
style O fill:#D1C4E9,stroke:#4527A0,color:#311B92
style P fill:#D1C4E9,stroke:#4527A0,color:#311B92
style Q fill:#D1C4E9,stroke:#4527A0,color:#311B92
style R fill:#D1C4E9,stroke:#4527A0,color:#311B92
style S fill:#FFCDD2,stroke:#C62828,color:#B71C1C
style T fill:#FFCDD2,stroke:#C62828,color:#B71C1C
style U fill:#FFCDD2,stroke:#C62828,color:#B71C1C
style V fill:#FFCDD2,stroke:#C62828,color:#B71C1C
style W fill:#FFCDD2,stroke:#C62828,color:#B71C1C
Loading
🔄 AI Intelligence Pipeline
flowchart TD
subgraph PHASE1["Phase 1: Data and Feature Preparation"]
A["Load Raw Dataset — 4000+ experiments"] --> B["Clean Data — Duplicates, nulls, outliers"]
B --> C["Engineer 21 Physics-Based Features"]
C --> D["Balance Dataset — Undersample dominant ranges"]
end
subgraph PHASE2["Phase 2: Model Training and Validation"]
E["Train/Val/Test Split — 60/20/20"] --> F["StandardScaler — Fit on train only"]
F --> G["Train XGBoost + Random Forest"]
G --> H["5-Fold Cross-Validation"]
H --> I["Auto-Select Best Model by R2"]
end
subgraph PHASE3["Phase 3: AI Intelligence Integration"]
J["Integrate SHAP — Feature Attribution Engine"] --> K["Build Anomaly Detection — Isolation Forest + Rules"]
K --> L["Connect Gemini Pro API — NLP Diagnostics"]
L --> M["Build What-If Simulator — Sensitivity Engine"]
M --> N["Build Conversational AI — Gemini Chat"]
end
subgraph PHASE4["Phase 4: Deployment"]
O["Export model.pkl, scaler.pkl, features_names.pkl"] --> P["Streamlit AI Dashboard — Full Platform"]
end
PHASE1 --> PHASE2
PHASE2 --> PHASE3
PHASE3 --> PHASE4
style PHASE1 fill:#E3F2FD,stroke:#1565C0,stroke-width:2px,color:#0D47A1
style PHASE2 fill:#FFF3E0,stroke:#E65100,stroke-width:2px,color:#BF360C
style PHASE3 fill:#EDE7F6,stroke:#4527A0,stroke-width:2px,color:#311B92
style PHASE4 fill:#E8F5E9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20
style A fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
style B fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
style C fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
style D fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
style E fill:#FFE0B2,stroke:#E65100,color:#BF360C
style F fill:#FFE0B2,stroke:#E65100,color:#BF360C
style G fill:#FFE0B2,stroke:#E65100,color:#BF360C
style H fill:#FFE0B2,stroke:#E65100,color:#BF360C
style I fill:#FFE0B2,stroke:#E65100,color:#BF360C
style J fill:#D1C4E9,stroke:#4527A0,color:#311B92
style K fill:#D1C4E9,stroke:#4527A0,color:#311B92
style L fill:#D1C4E9,stroke:#4527A0,color:#311B92
style M fill:#D1C4E9,stroke:#4527A0,color:#311B92
style N fill:#D1C4E9,stroke:#4527A0,color:#311B92
style O fill:#C8E6C9,stroke:#2E7D32,color:#1B5E20
style P fill:#C8E6C9,stroke:#2E7D32,color:#1B5E20
The platform's AI Copilot is powered by Google Gemini Pro and serves as an intelligent assistant that understands distillation chemistry, interprets ML predictions, and communicates in natural language.
How Gemini is Used
Function
Gemini Role
Example Output
Diagnostic Report
Interprets prediction + SHAP in context
"Your column is operating efficiently at 87.4% purity. The high reflux ratio is the primary driver."
Root Cause Analysis
Analyzes why purity is low
"Purity dropped to 71% because top temperature rose to 84°C — water is condensing into the product stream."
Optimization Advice
Recommends specific parameter changes
"Increase reflux flow L from 780 to 840 kmol/hr (+7.7%) to reach your 90% purity target."
Anomaly Explanation
Explains detected anomalies in plain English
"Warning: Mass balance deviation detected — D + B exceeds F by 12%. Check flow sensors."
Conversational Chat
Answers operator questions
Operator asks: "Why is separation duty so high?" → Gemini responds with a contextual answer
What-If Narration
Describes simulation results
"If you reduce vapor flow by 10%, predicted purity drops from 87.4% to 83.1%. This may still be acceptable."
Shift Handover Summary
Generates end-of-shift report
"During this shift, purity averaged 86.2%. Two anomaly alerts were triggered. Reflux was stable."
Gemini Prompt Architecture
The system constructs structured prompts that include:
Prompt Component
Content
System Context
Role definition: "You are a chemical engineering AI assistant specialized in distillation column operations."
Prediction Data
Current purity value, confidence interval, and status classification
SHAP Summary
Top 5 feature contributions (positive and negative drivers)
Anomaly Flags
Any detected anomalies with severity levels
Operating Context
All 7 input parameters + 14 derived metrics
Historical Baseline
Comparison against typical training data distributions
User Question
(For chat mode) The operator's natural-language question
Conversational AI — Example Interactions
┌──────────────────────────────────────────────────────────────────────┐
│ 💬 AI COPILOT CHAT │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ 👷 Operator: "Why is purity only 74% today?" │
│ │
│ 🤖 Gemini: "Based on your current inputs, purity is low because: │
│ │
│ 1. Your reflux ratio (0.62) is well below the optimal range │
│ of 1.2–2.5. This means insufficient liquid is returning │
│ to the column for separation. │
│ │
│ 2. Top temperature (83.5°C) is elevated — at this temperature, │
│ significant water vapor enters the distillate product. │
│ │
│ Recommended actions: │
│ • Increase reflux flow (L) from 520 to at least 780 kmol/hr │
│ • Reduce reboiler steam (V) by 10% to lower top temperature │
│ • Expected improvement: +8 to +12% purity" │
│ │
│ 👷 Operator: "What if I only increase reflux to 700?" │
│ │
│ 🤖 Gemini: "Simulating... With L = 700 kmol/hr (keeping other │
│ parameters constant), predicted purity rises to 79.3%. │
│ This crosses into ACCEPTABLE range but remains below OPTIMAL. │
│ For 82%+, you would need L ≥ 760 kmol/hr." │
│ │
└──────────────────────────────────────────────────────────────────────┘
🔬 SHAP Explainability Engine
Every prediction includes a full SHAP (SHapley Additive exPlanations) breakdown so operators and engineers can understand exactly why the model made its decision.
SHAP Output Components
Component
What It Shows
Use Case
Waterfall Chart
Stepwise feature contributions from base value to final prediction
Understand exactly which features pushed purity up or down
Force Plot
Single-row compact visualization of all 21 feature effects
Quick at-a-glance summary for operators
Summary Plot (Beeswarm)
Global feature importance across all predictions
Identify which features matter most overall
Dependence Plot
How one feature's effect changes as its value changes
Explore non-linear relationships
Interaction Plot
Two-way feature interactions captured by the model
Each recommendation includes an estimated energy cost change
Multi-Objective Ranking
Ranks recommendations by purity gain per unit energy cost
Gemini Interpretation
AI translates numerical recommendations into actionable operator instructions
Example Optimization Output
┌──────────────────────────────────────────────────────────────────────┐
│ 📈 OPTIMIZATION RECOMMENDATIONS │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ Current Purity: 0.8742 (87.42%) │
│ Target Purity: 0.9000 (90.00%) │
│ Gap to Target: +2.58% │
│ │
│ RECOMMENDED ACTIONS (ranked by impact / energy cost): │
│ │
│ 1. ↑ Increase Reflux Flow (L): 780 → 842 kmol/hr (+7.9%) │
│ → Expected purity gain: +2.1% │
│ → Energy impact: +4.2% steam consumption │
│ → Confidence: High │
│ │
│ 2. ↓ Reduce Top Temperature (T1): 78.5 → 77.2°C (-1.7%) │
│ → Expected purity gain: +0.8% │
│ → Energy impact: Neutral (condenser adjustment) │
│ → Confidence: Medium │
│ │
│ 3. ↓ Reduce Feed Flow (F): 580 → 555 kmol/hr (-4.3%) │
│ → Expected purity gain: +0.5% │
│ → Energy impact: -2.1% steam consumption │
│ → Confidence: Medium │
│ │
│ 💡 AI NOTE: Combining actions 1 and 2 should reach 90.1% purity │
│ with a net energy increase of only 3.8%. │
│ │
└──────────────────────────────────────────────────────────────────────┘
🧬 Feature Engineering
The prediction model's accuracy stems from physics-informed feature engineering — transforming raw sensor values into features that capture the thermodynamics of distillation.
All 21 Features
#
Feature Name
Category
Formula / Description
1
Pressure
🔵 Core
Operating pressure (bar)
2
L
🔵 Core
Reflux flow rate (kmol/hr)
3
V
🔵 Core
Vapor flow rate (kmol/hr)
4
D
🔵 Core
Distillate flow rate (kmol/hr)
5
B
🔵 Core
Bottoms flow rate (kmol/hr)
6
F
🔵 Core
Feed flow rate (kmol/hr)
7
Temp_Bottom
🟠 Reference
T₁ + 20°C (reboiler approximation)
8
Reflux_Ratio
🟢 Derived
L / V
9
Reboiler_Intensity
🟢 Derived
V / F
10
Condenser_Load
🟢 Derived
L / F
11
Feed_Normalized
🟢 Derived
F / Mean(F_train)
12
Distillate_Withdrawal
🟢 Derived
D / F
13
Bottoms_Withdrawal
🟢 Derived
B / F
14
Column_Load
🟣 Efficiency
(L + V) / F
15
Reflux_x_Temp_Top
🔴 Interaction
Reflux_Ratio × T₁
16
Reflux_x_Temp_Diff
🔴 Interaction
Reflux_Ratio × (T_bottom − T_top)
17
Reboiler_x_Temp_Bottom
🔴 Interaction
Reboiler_Intensity × T_bottom
18
Feed_x_Reflux
🔴 Interaction
Feed_Normalized × Reflux_Ratio
19
Feed_x_Reboiler
🔴 Interaction
Feed_Normalized × Reboiler_Intensity
20
Separation_Duty
🟣 Efficiency
Reflux_Ratio × Reboiler_Intensity
21
Column_Efficiency
🟣 Efficiency
Reflux_Ratio × Column_Load
Feature Category Summary
Category
Count
Purpose
🔵 Core Parameters
6
Direct sensor readings from DCS/SCADA
🟠 Temperature Reference
1
Approximated reboiler temperature
🟢 Derived Ratios
6
Normalized operating conditions
🔴 Interaction Terms
5
Capture thermodynamic coupling effects
🟣 Efficiency Metrics
3
Combined column performance indicators
📈 Model Performance
Model Comparison
Metric
Random Forest
XGBoost
Winner
R² Score
~0.98
~0.97
🏆 Random Forest
RMSE
0.0155
0.0178
🏆 Random Forest
MAE
0.0122
0.0141
🏆 Random Forest
5-Fold CV (Mean R²)
~0.97
~0.96
🏆 Random Forest
Model Specifications
Property
Value
Selected Model
Random Forest Regressor
Estimators
200 trees
Max Depth
20
Min Samples Split
5
Training Samples
~1,200 (balanced)
Original Dataset
4,000+ experiments
Cross-Validation
5-Fold (shuffled)
Scaling
StandardScaler (fit on train only)
Validation Strategy
┌──────────────────────────────────────────────────────────────────┐
│ Full Dataset (4,000+) │
├────────────────────┬──────────────────────────────────────────────┤
│ Cleaned & Balanced │ ~1,200 rows after outlier removal & │
│ │ undersampling of dominant middle range │
├──────────┬─────────┴──────────┬───────────────────────────────────┤
│ Train │ Validation │ Test │
│ 60% │ 20% │ 20% │
│ │ │ │
│ Scaler │ Hyperparameter │ Final unbiased │
│ fitted │ tuning & model │ performance │
│ here │ selection │ evaluation │
└──────────┴────────────────────┴───────────────────────────────────┘
# 1. Clone the repository
git clone https://github.com/Mausam5055/Distillation-Column-Prediction.git
cd Distillation-Column-Prediction
# 2. Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac# venv\Scripts\activate # Windows# 3. Install dependencies
pip install -r requirements.txt
# 4. Set up your Gemini API keyexport GEMINI_API_KEY="your-api-key-here"# Linux/Mac# set GEMINI_API_KEY=your-api-key-here # Windows# 5. Launch the AI platform
streamlit run app.py
The app will open automatically at http://localhost:8501
Note: The ML prediction features work without an API key. The AI-powered diagnostics, conversational chat, and optimization advisor require a valid Google Gemini API key.
📁 Project Structure
Distillation-Column-Prediction/
│
├── 📄 README.md # Project documentation (this file)
├── 📄 LICENSE # MIT License
├── 📄 requirements.txt # Python dependencies
├── 📄 .env.example # Template for API keys
├── 📄 .gitattributes # Git LFS / attribute config
│
├── 🐍 app.py # Main Streamlit AI dashboard
├── 🐍 model_training_script.py # Full ML training pipeline (Colab-ready)
├── 🤖 ai_copilot.py # Gemini AI integration — diagnostics & chat
├── 🔬 shap_engine.py # SHAP explainability module
├── 🚨 anomaly_detector.py # Multi-layer anomaly detection system
├── 🧪 whatif_simulator.py # What-if scenario engine & optimizer
│
├── 🤖 model.pkl # Trained Random Forest model (~8 MB)
├── 📏 scaler.pkl # Fitted StandardScaler
├── 📋 features_names.pkl # Ordered list of 21 feature names
├── 📊 feature_importance.png # Feature importance bar chart
│
├── 📂 sample_data/
│ └── 📄 dataset_distill.csv # Training dataset (~579 KB)
│
└── 📂 .devcontainer/
└── 📄 devcontainer.json # GitHub Codespaces configuration
File Descriptions
File
Purpose
app.py
Main Streamlit dashboard — input UI, AI controls, prediction display, chat interface
ai_copilot.py
Gemini Pro integration — prompt construction, diagnostic report generation, conversational AI
shap_engine.py
SHAP computation — waterfall charts, force plots, feature attribution analysis
anomaly_detector.py
Multi-layer anomaly detection — range check, mass balance, Isolation Forest, drift detection
Built with ❤️ for Chemical Engineering, Machine Learning & Artificial Intelligence
Author: Krishna Narayan Singh · Last Updated: April 2026 · Version: 2.0 (AI-Powered)
About
An AI-powered distillation intelligence platform that combines physics-informed machine learning with Google Gemini to deliver real-time purity predictions, natural-language diagnostics, SHAP-based explainability, anomaly detection, what-if simulations, and operator-facing optimization recommendations — all from a single dashboard.