IPL Intelligence Engine

Transforming IPL ball-by-ball data into actionable cricket intelligence using statistical analysis and machine learning.

Project Overview

The Indian Premier League generates one of the richest ball-by-ball datasets in professional sport. With over 260,000 delivery records spanning 2008 to 2020, this project applies the full data science workflow to answer questions that teams, analysts, and fans have debated for years: Which phase of the innings matters most? Are boundaries the primary driver of T20 scoring? Do death overs genuinely score more than the powerplay?

This project was submitted as part of INT375 — Data Science Tool Box: Python Programming at Lovely Professional University. It covers ten analytical objectives — from raw data cleaning to a formal statistical hypothesis test — and presents all findings through a deployed interactive website.

Quick Preview

Property	Detail
Total Deliveries Analyzed	260,000+
Seasons Covered	2008–2020
Key Focus	Match phase analysis, player performance, statistical validation
Analytical Objectives	10
Visualizations Produced	12
Output	Interactive website + Python visual analytics

Key Highlights

Ball-by-ball analysis of 260,920 IPL deliveries across 13 seasons (2008–2020)
Full EDA pipeline: cleaning, missing value handling, feature engineering, and outlier removal
Three levels of exploratory analysis: univariate, bivariate, and multivariate
Linear Regression model quantifying the relationship between over number and run rate
Welch's independent t-test providing statistically rigorous phase comparison (p < 0.000001)
Advanced player analytics: top run-scorers, wicket-takers, boundary hitters, economy rates
Deployed interactive portfolio website built with HTML5, CSS3, and vanilla JavaScript
All findings are data-backed, reproducible, and fully documented

Dataset

Property	Value
Source	Kaggle — IPL Complete Dataset (2008–2020)
Contributor	Patrick B
File Used	`deliveries.csv`
Raw Shape	260,920 rows × 17 columns
Post-Cleaning Shape	~256,000 rows × 18 columns
Coverage	IPL Seasons 2008–2020
Granularity	One row per delivery bowled

Column Reference

Column	Type	Description
`match_id`	Integer	Unique match identifier
`inning`	Integer	Innings number (1–2; super overs = 3+)
`batting_team`	String	Team currently batting
`bowling_team`	String	Team currently bowling
`over`	Integer	Over number (0-indexed in raw; corrected to 1–20)
`ball`	Integer	Ball number within the over
`batter`	String	Batsman facing the delivery
`bowler`	String	Bowler delivering the ball
`non_striker`	String	Batsman at the non-striking end
`batsman_runs`	Integer	Runs scored by the batsman
`extra_runs`	Integer	Extra runs conceded
`total_runs`	Integer	Total runs from the delivery
`extras_type`	String	Type of extra (NaN if none)
`is_wicket`	Integer	Binary flag: 1 if wicket fell
`player_dismissed`	String	Dismissed player name (NaN if no wicket)
`dismissal_kind`	String	Type of dismissal (NaN if no wicket)
`fielder`	String	Fielder involved (NaN if not applicable)

Note on missing values: player_dismissed, dismissal_kind, fielder, and extras_type are NaN when no wicket or no extra occurred. These are semantically meaningful absences, not data corruption. Rows were never dropped — NaNs were filled with "not_dismissed" and "none".

Data Analysis Pipeline

deliveries.csv
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVE 1 — Data Cleaning & EDA          │
│  • Load CSV, inspect shape, head, info      │
│  • Handle missing values (semantic fills)   │
│  • Remove duplicates (0 found)              │
│  • Correct over index (0–19 → 1–20)         │
│  • Remove super overs (inning > 2)          │
│  • Engineer match_phase feature column      │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVES 2–4 — Exploratory Analysis      │
│  • Univariate: run distribution, dismissals │
│  • Bivariate: over vs run rate, top batsmen │
│  • Multivariate: correlation heatmap,       │
│    scatter of balls faced vs total runs     │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVE 5 — Outlier Detection (IQR)      │
│  • Aggregate to runs-per-over level         │
│  • Q1=12, Q3=20, IQR=8, bounds [0, 32]      │
│  • Removed 369 outlier overs (1.42%)        │
│  • Returns over_clean DataFrame             │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVE 6 — Linear Regression            │
│  • X: over number, y: runs_per_over         │
│  • 80/20 train-test split (seed=42)         │
│  • MSE=40.98, R²=0.0106, coef=+0.1198      │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVE 7 — Hypothesis Testing (T-Test)  │
│  • Welch's t-test: Powerplay vs Death overs │
│  • t = −10.707, p ≈ 0.000000               │
│  • Reject H₀ — statistically significant   │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVE 8 — Advanced Player Analytics    │
│  • Top batsmen, top wicket-takers           │
│  • Boundary leaders, economy specialists   │
│  • Phase-wise run distribution & pie chart  │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVES 9–10 — Insights & Strategy      │
│  • 7 data-backed key findings               │
│  • Phase-specific batting & bowling recs    │
└─────────────────────────────────────────────┘

Machine Learning — Linear Regression

Objective: Quantify how much runs per over increases with each successive over, and evaluate how well a linear model captures this relationship.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X = over_clean[["over"]].values        # Independent variable: over number
y = over_clean["runs_per_over"].values  # Dependent variable: runs per over

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)

Results:

Metric	Value	Interpretation
Intercept (β₀)	14.3186	Baseline runs/over at over 0 (theoretical)
Coefficient (β₁)	+0.1198	Each additional over adds ~0.12 runs on average
MSE	40.9804	Average squared prediction error
R²	0.0106	Over number explains ~1% of variance in runs/over
Train/Test Split	80% / 20%	Fixed seed (random_state=42) for reproducibility

On the low R²: The R² of 0.0106 is expected and informative — not a model failure. Over number alone cannot capture pitch conditions, batting lineup, or match situation. The coefficient +0.1198 is the meaningful output: a quantified, reproducible upward trend across ~25,700 clean overs.

Hypothesis Testing — Powerplay vs Death Overs

Question: Do Death overs (16–20) score significantly more runs per over than Powerplay overs (1–6)?

Method: Welch's independent two-sample t-test (equal_var=False) — chosen because the two phases have different variance structures, making the equal-variance assumption of Student's t-test inappropriate.

from scipy import stats

powerplay = over_clean[over_clean["over"].between(1,  6)]["runs_per_over"]
death     = over_clean[over_clean["over"].between(16, 20)]["runs_per_over"]

t_stat, p_value = stats.ttest_ind(powerplay, death, equal_var=False)

Results:

Parameter	Value
H₀	Mean runs/over in Powerplay = Mean in Death overs
H₁	They differ significantly
α (Significance Level)	0.05
Powerplay Mean	15.40 runs/over (n = 5,094)
Death Overs Mean	16.77 runs/over (n = 4,220)
T-Statistic	−10.7068
P-Value	< 0.000001
Conclusion	Reject H₀ — statistically significant difference

Key Insights

#	Finding	Supporting Evidence
1	Death overs produce the highest scoring rate	16.77 runs/over vs 15.40 in Powerplay (p < 0.00001)
2	V Kohli is the all-time IPL run leader	8,004 total runs and 979 boundaries — leads both categories
3	Boundaries account for 59.9% of all runs	Pie chart confirms T20 is a boundary-dependent format
4	Caught is the dominant dismissal at 62%	8,053 caught dismissals out of ~13,000 total wickets
5	Run rate dips consistently at overs 7–8	Visible across all seasons — prime window for spin bowling
6	YS Chahal leads all wicket-takers with 213	6 of top 10 wicket-takers are spinners — spin dominates IPL
7	Over number alone has weak predictive power	Linear Regression R² ≈ 0.01 — match context dominates over position

Screenshots

Objective 2 — Univariate Analysis

Histogram of total_runs & Dismissal Type Countplot

The distribution of runs per delivery is heavily right-skewed — 0 runs and 1 run together account for over 75% of all deliveries (mean = 1.333, median = 1.0). The countplot confirms "caught" as the dominant dismissal at 8,053 occurrences (62% of all wickets), followed by bowled (2,204) and run out (1,107).

Objective 3 — Bivariate Analysis

Line Plot: Over vs Average Runs & Top 10 Batsmen by Total Runs

The line plot reveals the consistent run-rate dip at over 7 (transition window after powerplay) and the steady climb through death overs, peaking near overs 17–18. The bar chart confirms V Kohli at 8,004 runs — nearly 1,300 ahead of second-placed S Dhawan (6,769).

Objective 4 — Multivariate Analysis

Correlation Heatmap & Scatter: Balls Faced vs Total Runs (Top 30 Batsmen)

total_runs and batsman_runs show near-perfect correlation (r = 0.98). is_wicket vs total_runs is weakly negative (−0.18) — wicket deliveries tend to yield fewer runs. In the scatter, V Kohli sits alone in the extreme top-right, the only player to simultaneously maximise both balls faced and total runs.

Objective 5 — Outlier Detection & Removal

Boxplot Before/After IQR Removal + Distribution Comparison

The before boxplot shows extreme outliers reaching 52 runs per over (red dots above the whisker). After applying IQR bounds (Q1=12, Q3=20, upper limit=32), 369 overs (1.42%) were removed. The overlay histogram confirms the central distribution shape is fully preserved — only the extreme right tail is trimmed.

Objective 6 — Linear Regression

Regression Line (Over vs Runs per Over) & Residual Plot

The positive slope (β₁ = +0.1198) confirms a steady scoring increase across the innings. MSE = 40.98, R² = 0.0106. The residual plot shows even spread above and below the zero line with no systematic pattern — confirming an unbiased, homoscedastic model.

Objective 7 — Hypothesis Testing: Boxplot Comparison

Powerplay (1–6) vs Death Overs (16–20) — Welch's T-Test

Death overs (red) show a higher median and wider IQR than Powerplay overs (green), confirming both higher scoring and greater variability. t = −10.707, p ≈ 0.000000. The title annotation directly displays the test result on the chart.

Objective 7 — Hypothesis Testing: Mean Comparison

Average Runs per Over: Powerplay (15.40) vs Death (16.77)

Clean two-bar comparison with exact values labelled. The 1.37 run difference is consistent and systematic across thousands of overs — a structural feature of IPL innings, not a sample artefact.

Objective 8 — Advanced Analysis Part 1

Top 10 Batsmen by Total Runs & Top 10 Bowlers by Total Wickets

V Kohli (8,004 runs) leads run-scorers by a clear margin. YS Chahal tops wicket-takers with 213 wickets. Notably, 6 of the top 10 bowlers are spinners — confirming spin bowling as the dominant wicket-taking weapon in IPL conditions across 2008–2020.

Objective 8 — Advanced Analysis Part 2

Most Aggressive Batsmen (Boundaries) & Best Economy Bowlers

V Kohli also leads in total boundaries (979), followed by S Dhawan (921) and DA Warner (899) — confirming his status as the most complete T20 batsman in the dataset. Sohail Tanvir leads economy rates at 6.23 runs/over (min. 10 overs), followed by A Chandila (6.28).

Additional — Runs Distribution by Match Phase

Phase-wise Boxplot: Death (16–20) | Middle (7–15) | Powerplay (1–6)

Middle overs show the highest absolute spread and widest IQR, reflecting 9 overs of batting with high match-to-match variability. Powerplay and Death are more compact. Death shows the lowest median aggregate despite having the highest per-over rate — explained by fewer overs (5 vs 9).

Additional — Boundary Contribution to Total Runs

59.9% of All Batsman Runs Come from Boundaries (4s & 6s)

The most striking single visualization in the project. Only 40.1% of runs come from running between the wickets — singles, twos, and threes. This conclusively proves T20 cricket is boundary-dependent: teams without boundary-hitting specialists face a structural scoring disadvantage.

Additional — Average Runs by Match Phase

Phase-level Aggregate: Powerplay (94.4) → Middle (137.8) → Death (87.5)

Middle overs peak purely due to volume (9 overs vs 6 for Powerplay and 5 for Death). This contrasts with per-over rates where Death (16.77) leads. The line plot shows this aggregate-vs-rate distinction clearly — an important nuance for strategic planning.

Tech Stack

Category	Tool / Library	Purpose
Language	Python 3.10+	Core programming language
Data Manipulation	Pandas	DataFrame operations, groupby, aggregation
Numerical Computing	NumPy	Array operations, mathematical functions
Visualization	Matplotlib	Low-level plotting and figure control
Statistical Visualization	Seaborn	Heatmaps, countplots, boxplots
Machine Learning	Scikit-learn	LinearRegression, train_test_split, metrics
Statistical Testing	SciPy	Welch's t-test via `scipy.stats.ttest_ind`
Frontend	HTML5, CSS3, JS	Interactive portfolio website
IDE	VS Code	Development environment
Version Control	Git & GitHub	Code hosting and version control

Project Structure

IPL-Intelligence-Engine/
│
├── ipl_analysis.py               # Main Python script — all 10 objectives
├── deliveries.csv                # IPL ball-by-ball dataset (2008–2020)
├── README.md                     # Project documentation
│
├── images/                       # All generated graph outputs
│   ├── univariate.png            # Obj 2 — Histogram + Dismissal Countplot
│   ├── bivariate.png             # Obj 3 — Over Run Rate + Top Batsmen
│   ├── multivariate.png          # Obj 4 — Correlation Heatmap + Scatter
│   ├── outlier_detection.png     # Obj 5 — IQR Before/After + Distribution
│   ├── linear_regression.png     # Obj 6 — Regression Line + Residual Plot
│   ├── hypothesis_boxplot.png    # Obj 7 — Powerplay vs Death Boxplot
│   ├── hypothesis_bar.png        # Obj 7 — Mean Comparison Bar Chart
│   ├── advanced_analysis_1.png   # Obj 8 — Top Batsmen + Top Bowlers
│   ├── advanced_analysis_2.png   # Obj 8 — Boundaries + Economy Rates
│   ├── run_distribution.png      # Additional — Phase Boxplot
│   ├── boundary_pie.png          # Additional — Boundary Contribution Pie
│   └── phase_line.png            # Additional — Phase Average Line Plot
│
├── index.html                    # Website entry point
├── styles.css                    # Website stylesheet (dark editorial theme)
└── script.js                     # Website interactivity (scroll, lightbox, nav)

Live Website

URL: https://immortalshubham.github.io/IPL-Intelligence-Engine/

The website presents all findings in a deployed, fully interactive static site organized into six sections: Hero, About, Dataset, Analysis (all 10 objectives with embedded graphs), Insights, and Strategy Recommendations. Features include scroll-reveal animations, fullscreen image lightbox on click, sticky quick-navigation bar, and full mobile responsiveness.

GitHub Repository

Repository: https://github.com/ImmortalShubham/IPL-Intelligence-Engine

Contains the complete Python analysis script, all 12 visualization images, the full frontend website source (HTML/CSS/JS), dataset, and this documentation.

How to Run

Prerequisites

Python 3.10+
pip

Step-by-Step

# 1. Clone the repository
git clone https://github.com/ImmortalShubham/IPL-Intelligence-Engine.git
cd IPL-Intelligence-Engine

# 2. Install required libraries
pip install pandas numpy matplotlib seaborn scikit-learn scipy

# 3. Place deliveries.csv in the project root (same level as ipl_analysis.py)

# 4. Run the full analysis
python ipl_analysis.py

What Happens When You Run It

Step	Output
Objective 1	Dataset shape, head, info, statistical summary printed to console
Objectives 2–4	5 graphs displayed sequentially (univariate, bivariate, multivariate)
Objective 5	IQR parameters printed + 3 outlier detection graphs displayed
Objective 6	Regression metrics printed + 2 graphs displayed
Objective 7	T-test results printed + 2 hypothesis graphs displayed
Objective 8	4 advanced analysis graphs + pie chart + phase line plot displayed
Objectives 9–10	All insights and strategy recommendations printed to console

Viewing the Website

Open index.html in any modern browser — no server, no build step needed.

Key Learnings

Semantic missing values require domain understanding. Dropping NaN rows here would have removed 95%+ of the dataset — a critical mistake that pattern-matching imputation would miss.
Two-level aggregation is essential for fair over-level averages. Averaging raw ball-level data inflates results for overs with extra deliveries; grouping by (match, over) first, then averaging across matches, is the correct approach.
Feature engineering pays dividends. The match_phase column was created in one .apply() call and reused across five separate objectives — a small investment with large analytical return.
R² is not the only success metric. An R² of 0.0106 is the correct result here. The coefficient is the meaningful output, not the explained variance.
Welch's t-test over Student's t-test. In real-world analysis, equal variance between groups is rarely justified without testing. Welch's variant is the safer default.
Outlier context matters. Applying IQR at the over level (not the ball level) made the removal analytically meaningful — a 52-run over is an outlier; a 6-run delivery is not.
Chart type selection is analytical decision-making. Line plots for trends, heatmaps for relationships, residual plots for model validation — each choice communicates something the others cannot.

Future Improvements

Area	Description
Advanced ML Models	Random Forest or XGBoost for multi-feature run prediction — expected to outperform linear regression significantly
Match Outcome Prediction	Binary classification model predicting match winner from partial-innings data (logistic regression, SVM)
Player Similarity Clustering	K-Means or hierarchical clustering to group batsmen/bowlers into performance archetypes
Partnership Analysis	Track `batter`–`non_striker` pairs to identify the highest-output batting combinations
Venue Analysis	Merge with `matches.csv` for ground-specific performance breakdowns
Dataset Extension	Update to 2021–2024 IPL seasons — test whether death-over dominance has intensified
NLP Commentary	Auto-generate natural language summaries using a language model as data updates

Author

Shubham Kumar B.Tech CSE — Section 3M031 | Registration No: 12405152 Lovely Professional University, Phagwara, Punjab

Platform	Link
GitHub	@ImmortalShubham
Live Project	IPL Intelligence Engine

Submitted under the guidance of Dr. Mrinalini Rana (UID: 22138), Assistant Professor, School of CSE, LPU — INT375 (Data Science Tool Box: Python Programming), January–April 2026.

License

This project is licensed under the MIT License.

MIT License

Copyright (c) 2026 Shubham Kumar

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

IPL Intelligence Engine — INT375 Project Report, Lovely Professional University, April 2026

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
PROJECT.py		PROJECT.py
README.md		README.md
deliveries.zip		deliveries.zip
index.html		index.html
script.js		script.js
styles.css		styles.css

Folders and files

Latest commit

History

Repository files navigation

IPL Intelligence Engine

Project Overview

Quick Preview

Key Highlights

Dataset

Column Reference

Data Analysis Pipeline

Machine Learning — Linear Regression

Hypothesis Testing — Powerplay vs Death Overs

Key Insights

Screenshots

Objective 2 — Univariate Analysis

Objective 3 — Bivariate Analysis

Objective 4 — Multivariate Analysis

Objective 5 — Outlier Detection & Removal

Objective 6 — Linear Regression

Objective 7 — Hypothesis Testing: Boxplot Comparison

Objective 7 — Hypothesis Testing: Mean Comparison

Objective 8 — Advanced Analysis Part 1

Objective 8 — Advanced Analysis Part 2

Additional — Runs Distribution by Match Phase

Additional — Boundary Contribution to Total Runs

Additional — Average Runs by Match Phase

Tech Stack

Project Structure

Live Website

GitHub Repository

How to Run

Prerequisites

Step-by-Step

What Happens When You Run It

Viewing the Website

Key Learnings

Future Improvements

Author

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages