Skip to content

ImmortalShubham/IPL-Intelligence-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IPL Intelligence Engine

Transforming IPL ball-by-ball data into actionable cricket intelligence using statistical analysis and machine learning.

Python License: MIT Live Site GitHub Repo


Project Overview

The Indian Premier League generates one of the richest ball-by-ball datasets in professional sport. With over 260,000 delivery records spanning 2008 to 2020, this project applies the full data science workflow to answer questions that teams, analysts, and fans have debated for years: Which phase of the innings matters most? Are boundaries the primary driver of T20 scoring? Do death overs genuinely score more than the powerplay?

This project was submitted as part of INT375 — Data Science Tool Box: Python Programming at Lovely Professional University. It covers ten analytical objectives — from raw data cleaning to a formal statistical hypothesis test — and presents all findings through a deployed interactive website.


Quick Preview

Property Detail
Total Deliveries Analyzed 260,000+
Seasons Covered 2008–2020
Key Focus Match phase analysis, player performance, statistical validation
Analytical Objectives 10
Visualizations Produced 12
Output Interactive website + Python visual analytics

Key Highlights

  • Ball-by-ball analysis of 260,920 IPL deliveries across 13 seasons (2008–2020)
  • Full EDA pipeline: cleaning, missing value handling, feature engineering, and outlier removal
  • Three levels of exploratory analysis: univariate, bivariate, and multivariate
  • Linear Regression model quantifying the relationship between over number and run rate
  • Welch's independent t-test providing statistically rigorous phase comparison (p < 0.000001)
  • Advanced player analytics: top run-scorers, wicket-takers, boundary hitters, economy rates
  • Deployed interactive portfolio website built with HTML5, CSS3, and vanilla JavaScript
  • All findings are data-backed, reproducible, and fully documented

Dataset

Property Value
Source Kaggle — IPL Complete Dataset (2008–2020)
Contributor Patrick B
File Used deliveries.csv
Raw Shape 260,920 rows × 17 columns
Post-Cleaning Shape ~256,000 rows × 18 columns
Coverage IPL Seasons 2008–2020
Granularity One row per delivery bowled

Column Reference

Column Type Description
match_id Integer Unique match identifier
inning Integer Innings number (1–2; super overs = 3+)
batting_team String Team currently batting
bowling_team String Team currently bowling
over Integer Over number (0-indexed in raw; corrected to 1–20)
ball Integer Ball number within the over
batter String Batsman facing the delivery
bowler String Bowler delivering the ball
non_striker String Batsman at the non-striking end
batsman_runs Integer Runs scored by the batsman
extra_runs Integer Extra runs conceded
total_runs Integer Total runs from the delivery
extras_type String Type of extra (NaN if none)
is_wicket Integer Binary flag: 1 if wicket fell
player_dismissed String Dismissed player name (NaN if no wicket)
dismissal_kind String Type of dismissal (NaN if no wicket)
fielder String Fielder involved (NaN if not applicable)

Note on missing values: player_dismissed, dismissal_kind, fielder, and extras_type are NaN when no wicket or no extra occurred. These are semantically meaningful absences, not data corruption. Rows were never dropped — NaNs were filled with "not_dismissed" and "none".


Data Analysis Pipeline

deliveries.csv
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVE 1 — Data Cleaning & EDA          │
│  • Load CSV, inspect shape, head, info      │
│  • Handle missing values (semantic fills)   │
│  • Remove duplicates (0 found)              │
│  • Correct over index (0–19 → 1–20)         │
│  • Remove super overs (inning > 2)          │
│  • Engineer match_phase feature column      │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVES 2–4 — Exploratory Analysis      │
│  • Univariate: run distribution, dismissals │
│  • Bivariate: over vs run rate, top batsmen │
│  • Multivariate: correlation heatmap,       │
│    scatter of balls faced vs total runs     │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVE 5 — Outlier Detection (IQR)      │
│  • Aggregate to runs-per-over level         │
│  • Q1=12, Q3=20, IQR=8, bounds [0, 32]      │
│  • Removed 369 outlier overs (1.42%)        │
│  • Returns over_clean DataFrame             │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVE 6 — Linear Regression            │
│  • X: over number, y: runs_per_over         │
│  • 80/20 train-test split (seed=42)         │
│  • MSE=40.98, R²=0.0106, coef=+0.1198      │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVE 7 — Hypothesis Testing (T-Test)  │
│  • Welch's t-test: Powerplay vs Death overs │
│  • t = −10.707, p ≈ 0.000000               │
│  • Reject H₀ — statistically significant   │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVE 8 — Advanced Player Analytics    │
│  • Top batsmen, top wicket-takers           │
│  • Boundary leaders, economy specialists   │
│  • Phase-wise run distribution & pie chart  │
└─────────────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────────────┐
│  OBJECTIVES 9–10 — Insights & Strategy      │
│  • 7 data-backed key findings               │
│  • Phase-specific batting & bowling recs    │
└─────────────────────────────────────────────┘

Machine Learning — Linear Regression

Objective: Quantify how much runs per over increases with each successive over, and evaluate how well a linear model captures this relationship.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X = over_clean[["over"]].values        # Independent variable: over number
y = over_clean["runs_per_over"].values  # Dependent variable: runs per over

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)

Results:

Metric Value Interpretation
Intercept (β₀) 14.3186 Baseline runs/over at over 0 (theoretical)
Coefficient (β₁) +0.1198 Each additional over adds ~0.12 runs on average
MSE 40.9804 Average squared prediction error
0.0106 Over number explains ~1% of variance in runs/over
Train/Test Split 80% / 20% Fixed seed (random_state=42) for reproducibility

On the low R²: The R² of 0.0106 is expected and informative — not a model failure. Over number alone cannot capture pitch conditions, batting lineup, or match situation. The coefficient +0.1198 is the meaningful output: a quantified, reproducible upward trend across ~25,700 clean overs.


Hypothesis Testing — Powerplay vs Death Overs

Question: Do Death overs (16–20) score significantly more runs per over than Powerplay overs (1–6)?

Method: Welch's independent two-sample t-test (equal_var=False) — chosen because the two phases have different variance structures, making the equal-variance assumption of Student's t-test inappropriate.

from scipy import stats

powerplay = over_clean[over_clean["over"].between(1,  6)]["runs_per_over"]
death     = over_clean[over_clean["over"].between(16, 20)]["runs_per_over"]

t_stat, p_value = stats.ttest_ind(powerplay, death, equal_var=False)

Results:

Parameter Value
H₀ Mean runs/over in Powerplay = Mean in Death overs
H₁ They differ significantly
α (Significance Level) 0.05
Powerplay Mean 15.40 runs/over (n = 5,094)
Death Overs Mean 16.77 runs/over (n = 4,220)
T-Statistic −10.7068
P-Value < 0.000001
Conclusion Reject H₀ — statistically significant difference

Key Insights

# Finding Supporting Evidence
1 Death overs produce the highest scoring rate 16.77 runs/over vs 15.40 in Powerplay (p < 0.00001)
2 V Kohli is the all-time IPL run leader 8,004 total runs and 979 boundaries — leads both categories
3 Boundaries account for 59.9% of all runs Pie chart confirms T20 is a boundary-dependent format
4 Caught is the dominant dismissal at 62% 8,053 caught dismissals out of ~13,000 total wickets
5 Run rate dips consistently at overs 7–8 Visible across all seasons — prime window for spin bowling
6 YS Chahal leads all wicket-takers with 213 6 of top 10 wicket-takers are spinners — spin dominates IPL
7 Over number alone has weak predictive power Linear Regression R² ≈ 0.01 — match context dominates over position

Screenshots

Objective 2 — Univariate Analysis

Univariate Analysis

Histogram of total_runs & Dismissal Type Countplot

The distribution of runs per delivery is heavily right-skewed — 0 runs and 1 run together account for over 75% of all deliveries (mean = 1.333, median = 1.0). The countplot confirms "caught" as the dominant dismissal at 8,053 occurrences (62% of all wickets), followed by bowled (2,204) and run out (1,107).


Objective 3 — Bivariate Analysis

Bivariate Analysis

Line Plot: Over vs Average Runs & Top 10 Batsmen by Total Runs

The line plot reveals the consistent run-rate dip at over 7 (transition window after powerplay) and the steady climb through death overs, peaking near overs 17–18. The bar chart confirms V Kohli at 8,004 runs — nearly 1,300 ahead of second-placed S Dhawan (6,769).


Objective 4 — Multivariate Analysis

Multivariate Analysis

Correlation Heatmap & Scatter: Balls Faced vs Total Runs (Top 30 Batsmen)

total_runs and batsman_runs show near-perfect correlation (r = 0.98). is_wicket vs total_runs is weakly negative (−0.18) — wicket deliveries tend to yield fewer runs. In the scatter, V Kohli sits alone in the extreme top-right, the only player to simultaneously maximise both balls faced and total runs.


Objective 5 — Outlier Detection & Removal

Outlier Detection

Boxplot Before/After IQR Removal + Distribution Comparison

The before boxplot shows extreme outliers reaching 52 runs per over (red dots above the whisker). After applying IQR bounds (Q1=12, Q3=20, upper limit=32), 369 overs (1.42%) were removed. The overlay histogram confirms the central distribution shape is fully preserved — only the extreme right tail is trimmed.


Objective 6 — Linear Regression

Linear Regression

Regression Line (Over vs Runs per Over) & Residual Plot

The positive slope (β₁ = +0.1198) confirms a steady scoring increase across the innings. MSE = 40.98, R² = 0.0106. The residual plot shows even spread above and below the zero line with no systematic pattern — confirming an unbiased, homoscedastic model.


Objective 7 — Hypothesis Testing: Boxplot Comparison

Hypothesis Boxplot

Powerplay (1–6) vs Death Overs (16–20) — Welch's T-Test

Death overs (red) show a higher median and wider IQR than Powerplay overs (green), confirming both higher scoring and greater variability. t = −10.707, p ≈ 0.000000. The title annotation directly displays the test result on the chart.


Objective 7 — Hypothesis Testing: Mean Comparison

Hypothesis Bar Chart

Average Runs per Over: Powerplay (15.40) vs Death (16.77)

Clean two-bar comparison with exact values labelled. The 1.37 run difference is consistent and systematic across thousands of overs — a structural feature of IPL innings, not a sample artefact.


Objective 8 — Advanced Analysis Part 1

Advanced Analysis Part 1

Top 10 Batsmen by Total Runs & Top 10 Bowlers by Total Wickets

V Kohli (8,004 runs) leads run-scorers by a clear margin. YS Chahal tops wicket-takers with 213 wickets. Notably, 6 of the top 10 bowlers are spinners — confirming spin bowling as the dominant wicket-taking weapon in IPL conditions across 2008–2020.


Objective 8 — Advanced Analysis Part 2

Advanced Analysis Part 2

Most Aggressive Batsmen (Boundaries) & Best Economy Bowlers

V Kohli also leads in total boundaries (979), followed by S Dhawan (921) and DA Warner (899) — confirming his status as the most complete T20 batsman in the dataset. Sohail Tanvir leads economy rates at 6.23 runs/over (min. 10 overs), followed by A Chandila (6.28).


Additional — Runs Distribution by Match Phase

Run Distribution by Phase

Phase-wise Boxplot: Death (16–20) | Middle (7–15) | Powerplay (1–6)

Middle overs show the highest absolute spread and widest IQR, reflecting 9 overs of batting with high match-to-match variability. Powerplay and Death are more compact. Death shows the lowest median aggregate despite having the highest per-over rate — explained by fewer overs (5 vs 9).


Additional — Boundary Contribution to Total Runs

Boundary Pie Chart

59.9% of All Batsman Runs Come from Boundaries (4s & 6s)

The most striking single visualization in the project. Only 40.1% of runs come from running between the wickets — singles, twos, and threes. This conclusively proves T20 cricket is boundary-dependent: teams without boundary-hitting specialists face a structural scoring disadvantage.


Additional — Average Runs by Match Phase

Phase Line Plot

Phase-level Aggregate: Powerplay (94.4) → Middle (137.8) → Death (87.5)

Middle overs peak purely due to volume (9 overs vs 6 for Powerplay and 5 for Death). This contrasts with per-over rates where Death (16.77) leads. The line plot shows this aggregate-vs-rate distinction clearly — an important nuance for strategic planning.


Tech Stack

Category Tool / Library Purpose
Language Python 3.10+ Core programming language
Data Manipulation Pandas DataFrame operations, groupby, aggregation
Numerical Computing NumPy Array operations, mathematical functions
Visualization Matplotlib Low-level plotting and figure control
Statistical Visualization Seaborn Heatmaps, countplots, boxplots
Machine Learning Scikit-learn LinearRegression, train_test_split, metrics
Statistical Testing SciPy Welch's t-test via scipy.stats.ttest_ind
Frontend HTML5, CSS3, JS Interactive portfolio website
IDE VS Code Development environment
Version Control Git & GitHub Code hosting and version control

Project Structure

IPL-Intelligence-Engine/
│
├── ipl_analysis.py               # Main Python script — all 10 objectives
├── deliveries.csv                # IPL ball-by-ball dataset (2008–2020)
├── README.md                     # Project documentation
│
├── images/                       # All generated graph outputs
│   ├── univariate.png            # Obj 2 — Histogram + Dismissal Countplot
│   ├── bivariate.png             # Obj 3 — Over Run Rate + Top Batsmen
│   ├── multivariate.png          # Obj 4 — Correlation Heatmap + Scatter
│   ├── outlier_detection.png     # Obj 5 — IQR Before/After + Distribution
│   ├── linear_regression.png     # Obj 6 — Regression Line + Residual Plot
│   ├── hypothesis_boxplot.png    # Obj 7 — Powerplay vs Death Boxplot
│   ├── hypothesis_bar.png        # Obj 7 — Mean Comparison Bar Chart
│   ├── advanced_analysis_1.png   # Obj 8 — Top Batsmen + Top Bowlers
│   ├── advanced_analysis_2.png   # Obj 8 — Boundaries + Economy Rates
│   ├── run_distribution.png      # Additional — Phase Boxplot
│   ├── boundary_pie.png          # Additional — Boundary Contribution Pie
│   └── phase_line.png            # Additional — Phase Average Line Plot
│
├── index.html                    # Website entry point
├── styles.css                    # Website stylesheet (dark editorial theme)
└── script.js                     # Website interactivity (scroll, lightbox, nav)

Live Website

URL: https://immortalshubham.github.io/IPL-Intelligence-Engine/

The website presents all findings in a deployed, fully interactive static site organized into six sections: Hero, About, Dataset, Analysis (all 10 objectives with embedded graphs), Insights, and Strategy Recommendations. Features include scroll-reveal animations, fullscreen image lightbox on click, sticky quick-navigation bar, and full mobile responsiveness.


GitHub Repository

Repository: https://github.com/ImmortalShubham/IPL-Intelligence-Engine

Contains the complete Python analysis script, all 12 visualization images, the full frontend website source (HTML/CSS/JS), dataset, and this documentation.


How to Run

Prerequisites

Python 3.10+
pip

Step-by-Step

# 1. Clone the repository
git clone https://github.com/ImmortalShubham/IPL-Intelligence-Engine.git
cd IPL-Intelligence-Engine

# 2. Install required libraries
pip install pandas numpy matplotlib seaborn scikit-learn scipy

# 3. Place deliveries.csv in the project root (same level as ipl_analysis.py)

# 4. Run the full analysis
python ipl_analysis.py

What Happens When You Run It

Step Output
Objective 1 Dataset shape, head, info, statistical summary printed to console
Objectives 2–4 5 graphs displayed sequentially (univariate, bivariate, multivariate)
Objective 5 IQR parameters printed + 3 outlier detection graphs displayed
Objective 6 Regression metrics printed + 2 graphs displayed
Objective 7 T-test results printed + 2 hypothesis graphs displayed
Objective 8 4 advanced analysis graphs + pie chart + phase line plot displayed
Objectives 9–10 All insights and strategy recommendations printed to console

Viewing the Website

Open index.html in any modern browser — no server, no build step needed.


Key Learnings

  • Semantic missing values require domain understanding. Dropping NaN rows here would have removed 95%+ of the dataset — a critical mistake that pattern-matching imputation would miss.
  • Two-level aggregation is essential for fair over-level averages. Averaging raw ball-level data inflates results for overs with extra deliveries; grouping by (match, over) first, then averaging across matches, is the correct approach.
  • Feature engineering pays dividends. The match_phase column was created in one .apply() call and reused across five separate objectives — a small investment with large analytical return.
  • R² is not the only success metric. An R² of 0.0106 is the correct result here. The coefficient is the meaningful output, not the explained variance.
  • Welch's t-test over Student's t-test. In real-world analysis, equal variance between groups is rarely justified without testing. Welch's variant is the safer default.
  • Outlier context matters. Applying IQR at the over level (not the ball level) made the removal analytically meaningful — a 52-run over is an outlier; a 6-run delivery is not.
  • Chart type selection is analytical decision-making. Line plots for trends, heatmaps for relationships, residual plots for model validation — each choice communicates something the others cannot.

Future Improvements

Area Description
Advanced ML Models Random Forest or XGBoost for multi-feature run prediction — expected to outperform linear regression significantly
Match Outcome Prediction Binary classification model predicting match winner from partial-innings data (logistic regression, SVM)
Player Similarity Clustering K-Means or hierarchical clustering to group batsmen/bowlers into performance archetypes
Partnership Analysis Track batternon_striker pairs to identify the highest-output batting combinations
Venue Analysis Merge with matches.csv for ground-specific performance breakdowns
Dataset Extension Update to 2021–2024 IPL seasons — test whether death-over dominance has intensified
NLP Commentary Auto-generate natural language summaries using a language model as data updates

Author

Shubham Kumar B.Tech CSE — Section 3M031 | Registration No: 12405152 Lovely Professional University, Phagwara, Punjab

Platform Link
GitHub @ImmortalShubham
Live Project IPL Intelligence Engine

Submitted under the guidance of Dr. Mrinalini Rana (UID: 22138), Assistant Professor, School of CSE, LPU — INT375 (Data Science Tool Box: Python Programming), January–April 2026.


License

This project is licensed under the MIT License.

MIT License

Copyright (c) 2026 Shubham Kumar

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

IPL Intelligence Engine — INT375 Project Report, Lovely Professional University, April 2026

About

IPL Intelligence Engine : Data Analysis & Machine Learning Project with Interactive Website.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors