⚗️ Distillation Column AI

Intelligent Prediction · Explainable Diagnostics · Real-Time Optimization

An AI-powered distillation intelligence platform that combines physics-informed machine learning with Google Gemini to deliver real-time purity predictions, natural-language diagnostics, SHAP-based explainability, anomaly detection, what-if simulations, and operator-facing optimization recommendations — all from a single dashboard.

📑 Table of Contents

Overview
AI Capabilities at a Glance
System Architecture
AI Intelligence Pipeline
Application Workflow
AI Copilot — Gemini Integration
SHAP Explainability Engine
Anomaly Detection & Smart Alerts
What-If Simulator & Optimization
Feature Engineering
Model Performance
Quick Start
Project Structure
Input Parameters
Output Interpretation
AI Diagnostic Report
Model Training
Tech Stack
Roadmap
Contributing
License
Contact

🔍 Overview

Distillation Column AI is a full-stack intelligent platform for predicting, analyzing, and optimizing ethanol purity in industrial distillation columns. It goes far beyond a simple prediction tool — it acts as an AI copilot for chemical process engineers.

The platform operates on three layers:

Prediction Layer — A physics-informed Random Forest model trained on 4,000+ real distillation experiments, transforming 7 raw sensor inputs into 21 engineered features to predict ethanol mole fraction with R² > 0.98.
Intelligence Layer — Google Gemini AI interprets every prediction in plain English, explains why the model made its decision using SHAP values, detects anomalies in real-time, and suggests concrete optimization actions.
Interaction Layer — A conversational AI assistant that lets operators ask questions like "Why did purity drop?" or "What should I adjust to reach 92%?" and get instant, context-aware answers.

Aspect	Detail
Domain	Chemical Engineering — Distillation & Separation
Core AI	Google Gemini Pro — Natural-language diagnostics & conversational assistant
ML Model	Random Forest & XGBoost (best auto-selected) — 21 physics-augmented features
Explainability	SHAP (SHapley Additive exPlanations) — per-prediction waterfall & force plots
Anomaly Detection	Isolation Forest + statistical bounds + mass-balance physics rules
Optimization	What-if simulator · sensitivity analysis · target purity solver · energy advisor
Accuracy	R² > 0.98 · RMSE 0.0155 · MAE 0.0122 · ±1.55% error margin
Interface	Streamlit dashboard with integrated AI copilot & conversational chat

✨ AI Capabilities at a Glance

#	Capability	Powered By	Description
1	Real-Time Purity Prediction	Random Forest / XGBoost	Predicts ethanol mole fraction from 7 operating parameters in under 50ms
2	Natural Language Diagnostics	Google Gemini Pro	AI writes a plain-English report explaining what the prediction means and why
3	SHAP Explainability	SHAP Library	Per-prediction waterfall chart showing each feature's contribution
4	Anomaly Detection	Isolation Forest + Rules	Flags sensor drift, mass-balance violations, and out-of-distribution inputs
5	Smart Alerts	Gemini + Statistical Engine	Context-aware warnings with specific remediation instructions
6	What-If Simulator	Model + Gemini	Change any parameter virtually and see the predicted impact before acting
7	Optimization Advisor	Gemini + Sensitivity Analysis	Actionable recommendations: "Increase reflux by 8% to gain +3.2% purity"
8	Target Purity Solver	Inverse Model + Gemini	Specify desired purity and get recommended operating conditions
9	Energy Efficiency Advisor	Gemini + Domain Rules	Balances purity targets against steam/energy consumption
10	Conversational AI Chat	Gemini Chat API	Ask anything about your column in natural language
11	Trend Analysis	Time-Series Detection	Identifies gradual degradation patterns across sequential predictions
12	Batch Analysis	Parallel Inference + Gemini	Upload multiple operating snapshots and get comparative AI analysis

🏗️ System Architecture

graph TB
    subgraph INPUT["Operator Input Layer"]
        A["Pressure P"]
        B["Temperature T1"]
        C["Reflux Flow L"]
        D["Vapor Flow V"]
        E["Distillate D"]
        F["Bottoms B"]
        G["Feed Flow F"]
    end

    subgraph ENGINE["Physics Feature Engine — 21 Features"]
        direction TB
        H["Derived Ratios: Reflux, Reboiler, Condenser, Feed Norm, D/F, B/F"]
        I["Interaction Terms: Reflux x Temp, Reboiler x Temp, Feed x Reflux"]
        J["Efficiency Metrics: Column Load, Separation Duty, Column Efficiency"]
        K["Temperature Approximation: T_bottom = T1 + 20C"]
    end

    subgraph ML["ML Prediction Layer"]
        L["StandardScaler — Normalize 21 Features"]
        M["Random Forest / XGBoost — Best Auto-Selected"]
    end

    subgraph AI["AI Intelligence Layer — Gemini Core"]
        N["Gemini NLP — Diagnostic Report Generator"]
        O["SHAP Engine — Feature Attribution Analysis"]
        P["Anomaly Detector — Sensor Drift and Outlier Detection"]
        Q["Optimization Advisor — Actionable Recommendations"]
        R["Conversational AI — Natural Language Chat Interface"]
    end

    subgraph OUTPUT["Intelligent Output Layer"]
        S["Purity Prediction with Confidence Score"]
        T["AI Diagnostic Report in Plain English"]
        U["SHAP Waterfall and Force Plots"]
        V["Anomaly Alerts and Smart Warnings"]
        W["Optimization Recommendations"]
    end

    INPUT --> ENGINE
    ENGINE --> ML
    ML --> AI
    AI --> OUTPUT

    style INPUT fill:#E3F2FD,stroke:#1565C0,stroke-width:2px,color:#0D47A1
    style ENGINE fill:#FFF3E0,stroke:#E65100,stroke-width:2px,color:#BF360C
    style ML fill:#E8F5E9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20
    style AI fill:#EDE7F6,stroke:#4527A0,stroke-width:2px,color:#311B92
    style OUTPUT fill:#FCE4EC,stroke:#C62828,stroke-width:2px,color:#B71C1C
    style A fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
    style B fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
    style C fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
    style D fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
    style E fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
    style F fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
    style G fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
    style H fill:#FFE0B2,stroke:#E65100,color:#BF360C
    style I fill:#FFE0B2,stroke:#E65100,color:#BF360C
    style J fill:#FFE0B2,stroke:#E65100,color:#BF360C
    style K fill:#FFE0B2,stroke:#E65100,color:#BF360C
    style L fill:#C8E6C9,stroke:#2E7D32,color:#1B5E20
    style M fill:#C8E6C9,stroke:#2E7D32,color:#1B5E20
    style N fill:#D1C4E9,stroke:#4527A0,color:#311B92
    style O fill:#D1C4E9,stroke:#4527A0,color:#311B92
    style P fill:#D1C4E9,stroke:#4527A0,color:#311B92
    style Q fill:#D1C4E9,stroke:#4527A0,color:#311B92
    style R fill:#D1C4E9,stroke:#4527A0,color:#311B92
    style S fill:#FFCDD2,stroke:#C62828,color:#B71C1C
    style T fill:#FFCDD2,stroke:#C62828,color:#B71C1C
    style U fill:#FFCDD2,stroke:#C62828,color:#B71C1C
    style V fill:#FFCDD2,stroke:#C62828,color:#B71C1C
    style W fill:#FFCDD2,stroke:#C62828,color:#B71C1C

🔄 AI Intelligence Pipeline

flowchart TD
    subgraph PHASE1["Phase 1: Data and Feature Preparation"]
        A["Load Raw Dataset — 4000+ experiments"] --> B["Clean Data — Duplicates, nulls, outliers"]
        B --> C["Engineer 21 Physics-Based Features"]
        C --> D["Balance Dataset — Undersample dominant ranges"]
    end

    subgraph PHASE2["Phase 2: Model Training and Validation"]
        E["Train/Val/Test Split — 60/20/20"] --> F["StandardScaler — Fit on train only"]
        F --> G["Train XGBoost + Random Forest"]
        G --> H["5-Fold Cross-Validation"]
        H --> I["Auto-Select Best Model by R2"]
    end

    subgraph PHASE3["Phase 3: AI Intelligence Integration"]
        J["Integrate SHAP — Feature Attribution Engine"] --> K["Build Anomaly Detection — Isolation Forest + Rules"]
        K --> L["Connect Gemini Pro API — NLP Diagnostics"]
        L --> M["Build What-If Simulator — Sensitivity Engine"]
        M --> N["Build Conversational AI — Gemini Chat"]
    end

    subgraph PHASE4["Phase 4: Deployment"]
        O["Export model.pkl, scaler.pkl, features_names.pkl"] --> P["Streamlit AI Dashboard — Full Platform"]
    end

    PHASE1 --> PHASE2
    PHASE2 --> PHASE3
    PHASE3 --> PHASE4

    style PHASE1 fill:#E3F2FD,stroke:#1565C0,stroke-width:2px,color:#0D47A1
    style PHASE2 fill:#FFF3E0,stroke:#E65100,stroke-width:2px,color:#BF360C
    style PHASE3 fill:#EDE7F6,stroke:#4527A0,stroke-width:2px,color:#311B92
    style PHASE4 fill:#E8F5E9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20
    style A fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
    style B fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
    style C fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
    style D fill:#BBDEFB,stroke:#1565C0,color:#0D47A1
    style E fill:#FFE0B2,stroke:#E65100,color:#BF360C
    style F fill:#FFE0B2,stroke:#E65100,color:#BF360C
    style G fill:#FFE0B2,stroke:#E65100,color:#BF360C
    style H fill:#FFE0B2,stroke:#E65100,color:#BF360C
    style I fill:#FFE0B2,stroke:#E65100,color:#BF360C
    style J fill:#D1C4E9,stroke:#4527A0,color:#311B92
    style K fill:#D1C4E9,stroke:#4527A0,color:#311B92
    style L fill:#D1C4E9,stroke:#4527A0,color:#311B92
    style M fill:#D1C4E9,stroke:#4527A0,color:#311B92
    style N fill:#D1C4E9,stroke:#4527A0,color:#311B92
    style O fill:#C8E6C9,stroke:#2E7D32,color:#1B5E20
    style P fill:#C8E6C9,stroke:#2E7D32,color:#1B5E20

🧑‍💻 Application Workflow

flowchart TD
    START(["Operator Opens AI Dashboard"]) --> LOAD{"Model + AI Services Ready?"}

    LOAD -- "No" --> ERR["Display Error: Missing model or API key"]
    ERR --> STOP(["App Stops"])

    LOAD -- "Yes" --> INPUT["Enter 7 Operating Parameters"]
    INPUT --> CALC["Physics Engine: Auto-Calculate 14 Derived Features"]
    CALC --> METRICS["Display Live Operating Metrics"]
    METRICS --> BTN{"Run AI Analysis?"}

    BTN -- "Not yet" --> INPUT

    BTN -- "Yes" --> PREDICT["ML Model: Predict Purity — Random Forest"]
    PREDICT --> SHAP_STEP["SHAP: Compute Feature Attributions"]
    SHAP_STEP --> ANOMALY["Anomaly Detector: Check All Parameters"]
    ANOMALY --> GEMINI["Gemini AI: Generate Diagnostic Report"]
    GEMINI --> CHECK{"Purity Level?"}

    CHECK -- "Above 82%" --> GREEN["OPTIMAL — AI confirms healthy operation"]
    CHECK -- "75% to 82%" --> YELLOW["ACCEPTABLE — AI suggests improvements"]
    CHECK -- "Below 75%" --> RED["LOW PURITY — AI flags critical actions"]

    GREEN --> DISPLAY["Full AI Report: Prediction + SHAP + Alerts + Recommendations"]
    YELLOW --> DISPLAY
    RED --> DISPLAY

    DISPLAY --> WHATIF{"Explore What-If?"}
    WHATIF -- "Yes" --> SIMULATOR["What-If Simulator: Adjust parameters virtually"]
    SIMULATOR --> PREDICT
    WHATIF -- "No" --> CHAT{"Ask AI a question?"}
    CHAT -- "Yes" --> COPILOT["Gemini Chat: Conversational AI Assistant"]
    COPILOT --> INPUT
    CHAT -- "No" --> INPUT

    style START fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20
    style STOP fill:#FFCDD2,stroke:#C62828,stroke-width:2px,color:#B71C1C
    style ERR fill:#FFCDD2,stroke:#C62828,stroke-width:2px,color:#B71C1C
    style LOAD fill:#FFF9C4,stroke:#F57F17,stroke-width:2px,color:#E65100
    style INPUT fill:#BBDEFB,stroke:#1565C0,stroke-width:2px,color:#0D47A1
    style CALC fill:#FFE0B2,stroke:#E65100,stroke-width:2px,color:#BF360C
    style METRICS fill:#E1BEE7,stroke:#7B1FA2,stroke-width:2px,color:#4A148C
    style BTN fill:#FFF9C4,stroke:#F57F17,stroke-width:2px,color:#E65100
    style PREDICT fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20
    style SHAP_STEP fill:#D1C4E9,stroke:#4527A0,stroke-width:2px,color:#311B92
    style ANOMALY fill:#D1C4E9,stroke:#4527A0,stroke-width:2px,color:#311B92
    style GEMINI fill:#D1C4E9,stroke:#4527A0,stroke-width:2px,color:#311B92
    style CHECK fill:#FFF9C4,stroke:#F57F17,stroke-width:2px,color:#E65100
    style GREEN fill:#C8E6C9,stroke:#1B5E20,stroke-width:2px,color:#1B5E20
    style YELLOW fill:#FFF9C4,stroke:#F57F17,stroke-width:2px,color:#E65100
    style RED fill:#FFCDD2,stroke:#B71C1C,stroke-width:2px,color:#B71C1C
    style DISPLAY fill:#B3E5FC,stroke:#0277BD,stroke-width:2px,color:#01579B
    style WHATIF fill:#FFF9C4,stroke:#F57F17,stroke-width:2px,color:#E65100
    style SIMULATOR fill:#D1C4E9,stroke:#4527A0,stroke-width:2px,color:#311B92
    style CHAT fill:#FFF9C4,stroke:#F57F17,stroke-width:2px,color:#E65100
    style COPILOT fill:#D1C4E9,stroke:#4527A0,stroke-width:2px,color:#311B92

🤖 AI Copilot — Gemini Integration

The platform's AI Copilot is powered by Google Gemini Pro and serves as an intelligent assistant that understands distillation chemistry, interprets ML predictions, and communicates in natural language.

How Gemini is Used

Function	Gemini Role	Example Output
Diagnostic Report	Interprets prediction + SHAP in context	"Your column is operating efficiently at 87.4% purity. The high reflux ratio is the primary driver."
Root Cause Analysis	Analyzes why purity is low	"Purity dropped to 71% because top temperature rose to 84°C — water is condensing into the product stream."
Optimization Advice	Recommends specific parameter changes	"Increase reflux flow L from 780 to 840 kmol/hr (+7.7%) to reach your 90% purity target."
Anomaly Explanation	Explains detected anomalies in plain English	"Warning: Mass balance deviation detected — D + B exceeds F by 12%. Check flow sensors."
Conversational Chat	Answers operator questions	Operator asks: "Why is separation duty so high?" → Gemini responds with a contextual answer
What-If Narration	Describes simulation results	"If you reduce vapor flow by 10%, predicted purity drops from 87.4% to 83.1%. This may still be acceptable."
Shift Handover Summary	Generates end-of-shift report	"During this shift, purity averaged 86.2%. Two anomaly alerts were triggered. Reflux was stable."

Gemini Prompt Architecture

The system constructs structured prompts that include:

Prompt Component	Content
System Context	Role definition: "You are a chemical engineering AI assistant specialized in distillation column operations."
Prediction Data	Current purity value, confidence interval, and status classification
SHAP Summary	Top 5 feature contributions (positive and negative drivers)
Anomaly Flags	Any detected anomalies with severity levels
Operating Context	All 7 input parameters + 14 derived metrics
Historical Baseline	Comparison against typical training data distributions
User Question	(For chat mode) The operator's natural-language question

Conversational AI — Example Interactions

┌──────────────────────────────────────────────────────────────────────┐
│  💬 AI COPILOT CHAT                                                  │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  👷 Operator: "Why is purity only 74% today?"                        │
│                                                                      │
│  🤖 Gemini: "Based on your current inputs, purity is low because:   │
│                                                                      │
│     1. Your reflux ratio (0.62) is well below the optimal range      │
│        of 1.2–2.5. This means insufficient liquid is returning       │
│        to the column for separation.                                 │
│                                                                      │
│     2. Top temperature (83.5°C) is elevated — at this temperature,   │
│        significant water vapor enters the distillate product.        │
│                                                                      │
│     Recommended actions:                                             │
│     • Increase reflux flow (L) from 520 to at least 780 kmol/hr     │
│     • Reduce reboiler steam (V) by 10% to lower top temperature     │
│     • Expected improvement: +8 to +12% purity"                      │
│                                                                      │
│  👷 Operator: "What if I only increase reflux to 700?"               │
│                                                                      │
│  🤖 Gemini: "Simulating... With L = 700 kmol/hr (keeping other      │
│     parameters constant), predicted purity rises to 79.3%.           │
│     This crosses into ACCEPTABLE range but remains below OPTIMAL.    │
│     For 82%+, you would need L ≥ 760 kmol/hr."                      │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

🔬 SHAP Explainability Engine

Every prediction includes a full SHAP (SHapley Additive exPlanations) breakdown so operators and engineers can understand exactly why the model made its decision.

SHAP Output Components

Component	What It Shows	Use Case
Waterfall Chart	Stepwise feature contributions from base value to final prediction	Understand exactly which features pushed purity up or down
Force Plot	Single-row compact visualization of all 21 feature effects	Quick at-a-glance summary for operators
Summary Plot (Beeswarm)	Global feature importance across all predictions	Identify which features matter most overall
Dependence Plot	How one feature's effect changes as its value changes	Explore non-linear relationships
Interaction Plot	Two-way feature interactions captured by the model	Understand coupling effects (e.g., Reflux × Temperature)

How SHAP Works in This System

Base Prediction (Training Mean)     →   0.8200
  + Reflux_x_Temp_Top               →  +0.0834   (strongest positive driver)
  + Feed_x_Reflux                   →  +0.0287
  + Bottoms_Withdrawal              →  +0.0156
  - Temp_Bottom                     →  -0.0312   (pushing purity lower)
  - Reboiler_x_Temp_Bottom          →  -0.0198
  + 16 other features               →  +0.0075
  ─────────────────────────────────────────────
  Final Prediction                  →   0.8742   (87.42%)

SHAP Integration Points

Integration	Description
Dashboard Widget	Interactive waterfall chart embedded directly in the prediction result panel
Gemini Context	SHAP values are fed into the Gemini prompt so the AI can reference specific feature contributions
Export	Download SHAP analysis as PNG or interactive HTML for reporting
Comparison Mode	Compare SHAP breakdowns across multiple predictions side-by-side

🚨 Anomaly Detection & Smart Alerts

The platform runs a multi-layer anomaly detection system on every input, catching issues that even experienced operators might miss.

Detection Layers

Layer	Method	What It Catches	Alert Level
L1 — Range Check	Statistical bounds (μ ± 3σ from training data)	Any parameter outside 99.7% of training distribution	⚠️ Warning
L2 — Mass Balance	Physics rule: D + B ≈ F	Flow sensor errors or leaks (deviation > 5%)	🔴 Critical
L3 — Isolation Forest	Unsupervised ML anomaly scoring	Multi-dimensional outlier (unusual combination of normal-looking values)	⚠️ Warning
L4 — Correlation Monitor	Feature covariance tracking	Broken sensor reading (parameter stops correlating with related features)	🔴 Critical
L5 — Temporal Drift	Moving average + trend detection	Gradual sensor drift over consecutive predictions	ℹ️ Info

Smart Alert Examples

Scenario	Alert	AI Recommendation
Top temp = 92°C	🔴 Critical: Temperature far above normal range	"Reduce reboiler steam immediately. At this temperature, product is mostly water."
D + B = 720, F = 580	🔴 Critical: Mass balance violated by 24%	"Flow sensor calibration needed. D + B should equal F. Check distillate flow meter."
Reflux ratio gradually declining over 10 predictions	⚠️ Warning: Temporal drift detected	"Reflux ratio has dropped 15% over the last 10 readings. Check condenser performance."
All values normal but Isolation Forest flags anomaly	⚠️ Warning: Unusual input combination	"Individual parameters look normal, but this combination is rarely seen in training data. Prediction confidence is reduced."

🧪 What-If Simulator & Optimization

The What-If Simulator lets operators explore parameter changes before making them on the actual column.

Simulator Features

Feature	Description
Single Parameter Sweep	Adjust one parameter (e.g., Reflux Flow) across a range and see how predicted purity responds
Multi-Parameter Scenario	Change multiple parameters simultaneously and compare to current prediction
Target Purity Solver	Enter a desired purity (e.g., 90%) and get the recommended operating conditions to achieve it
Sensitivity Ranking	See which parameter has the most impact on purity — helps prioritize operator actions
Energy vs Purity Trade-off	Visualize the trade-off between higher purity and increased energy (steam) consumption
AI Narration	Every scenario result is explained in natural language by Gemini

Optimization Engine

Capability	How It Works
Gradient-Based Recommendations	Uses partial dependence to find the direction of greatest purity improvement
Constraint-Aware Optimization	Respects physical limits (e.g., can't exceed column flooding capacity)
Energy Penalty Scoring	Each recommendation includes an estimated energy cost change
Multi-Objective Ranking	Ranks recommendations by purity gain per unit energy cost
Gemini Interpretation	AI translates numerical recommendations into actionable operator instructions

Example Optimization Output

┌──────────────────────────────────────────────────────────────────────┐
│  📈 OPTIMIZATION RECOMMENDATIONS                                    │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Current Purity: 0.8742 (87.42%)                                     │
│  Target Purity:  0.9000 (90.00%)                                     │
│  Gap to Target:  +2.58%                                              │
│                                                                      │
│  RECOMMENDED ACTIONS (ranked by impact / energy cost):               │
│                                                                      │
│  1. ↑ Increase Reflux Flow (L): 780 → 842 kmol/hr (+7.9%)           │
│     → Expected purity gain: +2.1%                                    │
│     → Energy impact: +4.2% steam consumption                         │
│     → Confidence: High                                               │
│                                                                      │
│  2. ↓ Reduce Top Temperature (T1): 78.5 → 77.2°C (-1.7%)            │
│     → Expected purity gain: +0.8%                                    │
│     → Energy impact: Neutral (condenser adjustment)                  │
│     → Confidence: Medium                                             │
│                                                                      │
│  3. ↓ Reduce Feed Flow (F): 580 → 555 kmol/hr (-4.3%)               │
│     → Expected purity gain: +0.5%                                    │
│     → Energy impact: -2.1% steam consumption                         │
│     → Confidence: Medium                                             │
│                                                                      │
│  💡 AI NOTE: Combining actions 1 and 2 should reach 90.1% purity    │
│     with a net energy increase of only 3.8%.                         │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

🧬 Feature Engineering

The prediction model's accuracy stems from physics-informed feature engineering — transforming raw sensor values into features that capture the thermodynamics of distillation.

All 21 Features

#	Feature Name	Category	Formula / Description
1	`Pressure`	🔵 Core	Operating pressure (bar)
2	`L`	🔵 Core	Reflux flow rate (kmol/hr)
3	`V`	🔵 Core	Vapor flow rate (kmol/hr)
4	`D`	🔵 Core	Distillate flow rate (kmol/hr)
5	`B`	🔵 Core	Bottoms flow rate (kmol/hr)
6	`F`	🔵 Core	Feed flow rate (kmol/hr)
7	`Temp_Bottom`	🟠 Reference	T₁ + 20°C (reboiler approximation)
8	`Reflux_Ratio`	🟢 Derived	L / V
9	`Reboiler_Intensity`	🟢 Derived	V / F
10	`Condenser_Load`	🟢 Derived	L / F
11	`Feed_Normalized`	🟢 Derived	F / Mean(F_train)
12	`Distillate_Withdrawal`	🟢 Derived	D / F
13	`Bottoms_Withdrawal`	🟢 Derived	B / F
14	`Column_Load`	🟣 Efficiency	(L + V) / F
15	`Reflux_x_Temp_Top`	🔴 Interaction	Reflux_Ratio × T₁
16	`Reflux_x_Temp_Diff`	🔴 Interaction	Reflux_Ratio × (T_bottom − T_top)
17	`Reboiler_x_Temp_Bottom`	🔴 Interaction	Reboiler_Intensity × T_bottom
18	`Feed_x_Reflux`	🔴 Interaction	Feed_Normalized × Reflux_Ratio
19	`Feed_x_Reboiler`	🔴 Interaction	Feed_Normalized × Reboiler_Intensity
20	`Separation_Duty`	🟣 Efficiency	Reflux_Ratio × Reboiler_Intensity
21	`Column_Efficiency`	🟣 Efficiency	Reflux_Ratio × Column_Load

Feature Category Summary

Category	Count	Purpose
🔵 Core Parameters	6	Direct sensor readings from DCS/SCADA
🟠 Temperature Reference	1	Approximated reboiler temperature
🟢 Derived Ratios	6	Normalized operating conditions
🔴 Interaction Terms	5	Capture thermodynamic coupling effects
🟣 Efficiency Metrics	3	Combined column performance indicators

📈 Model Performance

Model Comparison

Metric	Random Forest	XGBoost	Winner
R² Score	~0.98	~0.97	🏆 Random Forest
RMSE	0.0155	0.0178	🏆 Random Forest
MAE	0.0122	0.0141	🏆 Random Forest
5-Fold CV (Mean R²)	~0.97	~0.96	🏆 Random Forest

Model Specifications

Property	Value
Selected Model	Random Forest Regressor
Estimators	200 trees
Max Depth	20
Min Samples Split	5
Training Samples	~1,200 (balanced)
Original Dataset	4,000+ experiments
Cross-Validation	5-Fold (shuffled)
Scaling	StandardScaler (fit on train only)

Validation Strategy

┌──────────────────────────────────────────────────────────────────┐
│                        Full Dataset (4,000+)                     │
├────────────────────┬──────────────────────────────────────────────┤
│  Cleaned & Balanced │  ~1,200 rows after outlier removal &       │
│                     │  undersampling of dominant middle range     │
├──────────┬─────────┴──────────┬───────────────────────────────────┤
│  Train   │    Validation      │            Test                   │
│  60%     │       20%          │            20%                    │
│          │                    │                                   │
│ Scaler   │  Hyperparameter    │   Final unbiased                  │
│ fitted   │  tuning & model    │   performance                     │
│ here     │  selection         │   evaluation                      │
└──────────┴────────────────────┴───────────────────────────────────┘

🚀 Quick Start

Prerequisites

Requirement	Version
Python	3.8 or higher
pip	Latest recommended
Git	Any recent version
Google Gemini API Key	Required for AI features (Get one here)

Installation

# 1. Clone the repository
git clone https://github.com/Mausam5055/Distillation-Column-Prediction.git
cd Distillation-Column-Prediction

# 2. Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate        # Linux/Mac
# venv\Scripts\activate         # Windows

# 3. Install dependencies
pip install -r requirements.txt

# 4. Set up your Gemini API key
export GEMINI_API_KEY="your-api-key-here"     # Linux/Mac
# set GEMINI_API_KEY=your-api-key-here        # Windows

# 5. Launch the AI platform
streamlit run app.py

The app will open automatically at http://localhost:8501

Note: The ML prediction features work without an API key. The AI-powered diagnostics, conversational chat, and optimization advisor require a valid Google Gemini API key.

📁 Project Structure

Distillation-Column-Prediction/
│
├── 📄 README.md                    # Project documentation (this file)
├── 📄 LICENSE                      # MIT License
├── 📄 requirements.txt             # Python dependencies
├── 📄 .env.example                 # Template for API keys
├── 📄 .gitattributes               # Git LFS / attribute config
│
├── 🐍 app.py                      # Main Streamlit AI dashboard
├── 🐍 model_training_script.py    # Full ML training pipeline (Colab-ready)
├── 🤖 ai_copilot.py               # Gemini AI integration — diagnostics & chat
├── 🔬 shap_engine.py              # SHAP explainability module
├── 🚨 anomaly_detector.py         # Multi-layer anomaly detection system
├── 🧪 whatif_simulator.py         # What-if scenario engine & optimizer
│
├── 🤖 model.pkl                   # Trained Random Forest model (~8 MB)
├── 📏 scaler.pkl                  # Fitted StandardScaler
├── 📋 features_names.pkl          # Ordered list of 21 feature names
├── 📊 feature_importance.png      # Feature importance bar chart
│
├── 📂 sample_data/
│   └── 📄 dataset_distill.csv     # Training dataset (~579 KB)
│
└── 📂 .devcontainer/
    └── 📄 devcontainer.json       # GitHub Codespaces configuration

File Descriptions

File	Purpose
`app.py`	Main Streamlit dashboard — input UI, AI controls, prediction display, chat interface
`ai_copilot.py`	Gemini Pro integration — prompt construction, diagnostic report generation, conversational AI
`shap_engine.py`	SHAP computation — waterfall charts, force plots, feature attribution analysis
`anomaly_detector.py`	Multi-layer anomaly detection — range check, mass balance, Isolation Forest, drift detection
`whatif_simulator.py`	What-if engine — parameter sweeps, sensitivity ranking, target purity solver, optimization
`model_training_script.py`	End-to-end ML pipeline: cleaning → feature engineering → training → model selection → export
`model.pkl`	Serialized best model (Random Forest with 200 estimators)
`scaler.pkl`	StandardScaler fitted on training data's 21 features
`features_names.pkl`	Ordered feature list ensuring inference matches training feature order
`dataset_distill.csv`	Semicolon-delimited training dataset with temperatures in Kelvin

📥 Input Parameters

The platform accepts 7 operating parameters commonly available from process control systems:

#	Parameter	Symbol	Unit	Range	Description
1	Pressure	P	bar	0.5 – 3.0	Column operating pressure
2	Top Temperature	T₁	°C	60 – 120	Temperature at the top tray sensor
3	Reflux Flow	L	kmol/hr	300 – 1,200	Liquid returned from the condenser
4	Vapor Flow	V	kmol/hr	600 – 1,500	Vapor rising from the reboiler
5	Distillate Flow	D	kmol/hr	100 – 500	Top product withdrawal rate
6	Bottoms Flow	B	kmol/hr	100 – 500	Bottom product withdrawal rate
7	Feed Flow	F	kmol/hr	350 – 700	Raw material feed rate

Typical Operating Ranges

Parameter	Normal	⚠️ Critical High	⚠️ Critical Low
Top Temp (T₁)	77 – 80 °C	> 85 °C (water in product)	< 75 °C (subcooled)
Reflux Ratio (L/V)	1.2 – 2.5	> 3.0 (flooding risk)	< 0.6 (poor separation)
Reboiler Vapor (V)	900 – 1,100	> 1,200	< 800
Mass Balance (D+B)	≈ F	D+B >> F	D+B << F

📤 Output Interpretation

Purity Status Levels

Status	Purity Range	Color	AI Behavior
🟢 OPTIMAL	≥ 82%	Green	AI confirms healthy operation and suggests energy savings
🟡 ACCEPTABLE	75% – 82%	Yellow	AI suggests specific improvements to reach optimal range
🔴 LOW PURITY	< 75%	Red	AI flags critical actions with prioritized remediation steps

What the Operator Receives

Output Component	Description
Prediction Value	Ethanol mole fraction (e.g., 0.8742) and percentage (87.42%)
Confidence Score	How similar the input is to the training data distribution
Status Badge	Color-coded OPTIMAL / ACCEPTABLE / LOW PURITY indicator
SHAP Waterfall	Interactive chart showing each feature's contribution
AI Diagnostic Report	Natural-language explanation from Gemini
Anomaly Alerts	Any detected anomalies with severity and recommendations
Optimization Steps	Ranked list of parameter changes to improve purity

📋 AI Diagnostic Report

Every prediction generates a comprehensive AI diagnostic report:

┌──────────────────────────────────────────────────────────────────────┐
│  🤖 AI DIAGNOSTIC REPORT                                            │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  📊 PREDICTION                                                       │
│  Ethanol Purity:  0.8742 (87.42%)                                    │
│  Status:          🟢 OPTIMAL                                         │
│  Confidence:      High (input within training envelope)              │
│                                                                      │
│  🔬 SHAP ANALYSIS — Top Drivers                                      │
│  ┌─────────────────────────────┬──────────┬─────────────────────┐    │
│  │ Feature                     │ Impact   │ Direction           │    │
│  ├─────────────────────────────┼──────────┼─────────────────────┤    │
│  │ Reflux_x_Temp_Top           │ +0.0834  │ ↑ Pushing UP        │    │
│  │ Feed_x_Reflux               │ +0.0287  │ ↑ Pushing UP        │    │
│  │ Temp_Bottom                 │ -0.0312  │ ↓ Pushing DOWN      │    │
│  │ Reboiler_x_Temp_Bottom      │ -0.0198  │ ↓ Pushing DOWN      │    │
│  │ Bottoms_Withdrawal          │ +0.0156  │ ↑ Pushing UP        │    │
│  └─────────────────────────────┴──────────┴─────────────────────┘    │
│                                                                      │
│  💡 AI INTERPRETATION                                                │
│  "Your column is running well. The reflux ratio (0.751) combined     │
│  with a favorable top temperature (78.5°C) is driving strong         │
│  ethanol separation. Reboiler temperature is slightly elevated —     │
│  reducing steam flow by 3-5% could save energy without impacting     │
│  purity. Overall efficiency score: 8.4 / 10."                       │
│                                                                      │
│  🚨 ANOMALY STATUS                                                   │
│  ✅ Sensor ranges:     All within normal bounds                      │
│  ✅ Mass balance:      D + B ≈ F (deviation: 1.7%)                   │
│  ⚠️ Reboiler load:    Upper quartile (1.79) — monitor               │
│  ✅ Isolation Forest:  No anomaly detected                           │
│  ✅ Temporal drift:    No drift in recent readings                   │
│                                                                      │
│  📈 OPTIMIZATION                                                     │
│  • To reach 90%: Increase L by 8% (780 → 842 kmol/hr)               │
│  • To save energy: Reduce V by 5% (purity drops ~1%)                │
│  • Current efficiency: 8.4 / 10                                     │
│                                                                      │
│  💬 ASK ME ANYTHING                                                  │
│  Type a question below to chat with the AI about this prediction.   │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

🎓 Model Training (Colab)

The training pipeline is contained in model_training_script.py and is designed to run on Google Colab.

Training Steps

Step	Description	Key Details
1	Load Data	Read `dataset_distill.csv` (semicolon-delimited)
2	Unit Conversion	Convert temperatures T1–T14 from Kelvin → Celsius
3	Data Cleaning	Remove duplicates, nulls, and outliers (1st–99th percentile)
4	Feature Engineering	Create 21 features from 6 core parameters + temperature
5	Dataset Balancing	Undersample dominant middle purity range (40%)
6	Train/Val/Test Split	60% train · 20% validation · 20% test
7	Scaling	StandardScaler fit on training data only
8	Model Training	Train XGBoost (500 trees) and Random Forest (200 trees)
9	Cross-Validation	5-Fold CV on training set for both models
10	Model Selection	Auto-select the model with the highest test R²
11	Residual Analysis	Predicted vs Actual, Q-Q Plot, Homoscedasticity check
12	Export Artifacts	Save `model.pkl`, `scaler.pkl`, `features_names.pkl`

Model Hyperparameters

🌲 Random Forest Configuration

Parameter	Value
`n_estimators`	200
`max_depth`	20
`min_samples_split`	5
`random_state`	42
`n_jobs`	-1 (all cores)

⚡ XGBoost Configuration

Parameter	Value
`n_estimators`	500
`max_depth`	6
`learning_rate`	0.03
`subsample`	0.85
`colsample_bytree`	0.85
`reg_alpha`	0.1
`reg_lambda`	1.0
`random_state`	42

🛠️ Tech Stack

Category	Technology	Purpose
Language	Python 3.8+	Core programming language
Web Framework	Streamlit	Interactive AI dashboard UI
AI / LLM	Google Gemini Pro API	Natural-language diagnostics, conversational AI, report generation
Explainability	SHAP	Per-prediction feature attribution — waterfall, force, and summary plots
ML Models	scikit-learn, XGBoost	Random Forest & XGBoost training, inference, and cross-validation
Anomaly Detection	scikit-learn (Isolation Forest), scipy.stats	Multi-layer real-time anomaly detection
Data Processing	pandas, NumPy	Data manipulation, feature engineering, and computation
Visualization	Matplotlib, Seaborn, Plotly	Feature importance, SHAP charts, interactive what-if plots
Image Processing	Pillow (PIL)	Display feature importance chart
Serialization	pickle	Model, scaler, and feature persistence
Environment	python-dotenv	Secure API key management
Development	Google Colab	Model training environment
Deployment	GitHub Codespaces	Cloud development via devcontainer

🗺️ Roadmap

Phase	Status	Features
v1.0 — ML Core	✅ Complete	Physics-informed feature engineering, Random Forest + XGBoost, Streamlit dashboard
v2.0 — AI Intelligence	✅ Complete	Gemini diagnostics, SHAP explainability, anomaly detection, smart alerts
v2.1 — AI Chat	✅ Complete	Conversational AI copilot, natural-language Q&A, context-aware responses
v2.2 — Optimization	✅ Complete	What-if simulator, sensitivity analysis, target purity solver, energy advisor
v3.0 — Advanced Analytics	🔄 Planned	Time-series trend analysis, shift handover reports, batch comparison
v3.1 — Multi-Column	🔄 Planned	Support for multiple distillation columns in a single dashboard
v4.0 — Edge Deployment	🔜 Future	Docker container, REST API, OPC-UA integration for live DCS data

🤝 Contributing

Contributions are welcome! Here's how you can help:

🍴 Fork the repository
🌿 Create a feature branch (git checkout -b feature/amazing-feature)
💾 Commit your changes (git commit -m 'Add amazing feature')
📤 Push to the branch (git push origin feature/amazing-feature)
🔀 Open a Pull Request

Contribution Ideas

Area	Suggestion
🤖 AI	Fine-tune Gemini prompts for better diagnostics; add GPT-4 fallback
🔬 Explainability	Add LIME as an alternative to SHAP; integrate counterfactual explanations
🧪 Data	Add more distillation experiments or real plant data
🧠 Models	Try neural networks, gradient boosting variants, or ensemble stacking
📊 Analytics	Build time-series dashboards for long-term column performance tracking
🎨 UI	Dark mode, mobile-responsive layout, trend charts
🚀 Deployment	Docker support, cloud deployment (AWS/GCP/Azure), REST API
🏭 Integration	OPC-UA/Modbus connectors for live DCS/SCADA data feed

📝 License

This project is licensed under the MIT License — see the LICENSE file for details.

MIT License · Copyright (c) 2026 Krishna Narayan Singh

📞 Contact

Channel	Link
📧 Email	krishnanarayansingh65@gmail.com
💼 LinkedIn	Krishna Narayan Singh
🐙 GitHub (Author)	KrishnaNsingh
🐙 GitHub (Contributor)	Mausam5055

Built with ❤️ for Chemical Engineering, Machine Learning & Artificial Intelligence

Author: Krishna Narayan Singh · Last Updated: April 2026 · Version: 2.0 (AI-Powered)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.devcontainer		.devcontainer
sample_data		sample_data
.env.example		.env.example
.env_1.example		.env_1.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Screenshot (224).png		Screenshot (224).png
Screenshot (225).png		Screenshot (225).png
app.py		app.py
feature_importance.png		feature_importance.png
features_names.pkl		features_names.pkl
model.pkl		model.pkl
model_training_script.py		model_training_script.py
requirements.txt		requirements.txt
scaler.pkl		scaler.pkl

Folders and files

Latest commit

History

Repository files navigation

⚗️ Distillation Column AI

Intelligent Prediction · Explainable Diagnostics · Real-Time Optimization

📑 Table of Contents

🔍 Overview

✨ AI Capabilities at a Glance

🏗️ System Architecture

🔄 AI Intelligence Pipeline

🧑‍💻 Application Workflow

🤖 AI Copilot — Gemini Integration

How Gemini is Used

Gemini Prompt Architecture

Conversational AI — Example Interactions

🔬 SHAP Explainability Engine

SHAP Output Components

How SHAP Works in This System

SHAP Integration Points

🚨 Anomaly Detection & Smart Alerts

Detection Layers

Smart Alert Examples

🧪 What-If Simulator & Optimization

Simulator Features

Optimization Engine

Example Optimization Output

🧬 Feature Engineering

All 21 Features

Feature Category Summary

📈 Model Performance

Model Comparison

Model Specifications

Validation Strategy

🚀 Quick Start

Prerequisites

Installation

📁 Project Structure

File Descriptions

📥 Input Parameters

Typical Operating Ranges

📤 Output Interpretation

Purity Status Levels

What the Operator Receives

📋 AI Diagnostic Report

🎓 Model Training (Colab)

Training Steps

Model Hyperparameters

🛠️ Tech Stack

🗺️ Roadmap

🤝 Contributing

Contribution Ideas

📝 License

📞 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages