Transforming IPL ball-by-ball data into actionable cricket intelligence using statistical analysis and machine learning.
The Indian Premier League generates one of the richest ball-by-ball datasets in professional sport. With over 260,000 delivery records spanning 2008 to 2020, this project applies the full data science workflow to answer questions that teams, analysts, and fans have debated for years: Which phase of the innings matters most? Are boundaries the primary driver of T20 scoring? Do death overs genuinely score more than the powerplay?
This project was submitted as part of INT375 — Data Science Tool Box: Python Programming at Lovely Professional University. It covers ten analytical objectives — from raw data cleaning to a formal statistical hypothesis test — and presents all findings through a deployed interactive website.
| Property | Detail |
|---|---|
| Total Deliveries Analyzed | 260,000+ |
| Seasons Covered | 2008–2020 |
| Key Focus | Match phase analysis, player performance, statistical validation |
| Analytical Objectives | 10 |
| Visualizations Produced | 12 |
| Output | Interactive website + Python visual analytics |
- Ball-by-ball analysis of 260,920 IPL deliveries across 13 seasons (2008–2020)
- Full EDA pipeline: cleaning, missing value handling, feature engineering, and outlier removal
- Three levels of exploratory analysis: univariate, bivariate, and multivariate
- Linear Regression model quantifying the relationship between over number and run rate
- Welch's independent t-test providing statistically rigorous phase comparison (p < 0.000001)
- Advanced player analytics: top run-scorers, wicket-takers, boundary hitters, economy rates
- Deployed interactive portfolio website built with HTML5, CSS3, and vanilla JavaScript
- All findings are data-backed, reproducible, and fully documented
| Property | Value |
|---|---|
| Source | Kaggle — IPL Complete Dataset (2008–2020) |
| Contributor | Patrick B |
| File Used | deliveries.csv |
| Raw Shape | 260,920 rows × 17 columns |
| Post-Cleaning Shape | ~256,000 rows × 18 columns |
| Coverage | IPL Seasons 2008–2020 |
| Granularity | One row per delivery bowled |
| Column | Type | Description |
|---|---|---|
match_id |
Integer | Unique match identifier |
inning |
Integer | Innings number (1–2; super overs = 3+) |
batting_team |
String | Team currently batting |
bowling_team |
String | Team currently bowling |
over |
Integer | Over number (0-indexed in raw; corrected to 1–20) |
ball |
Integer | Ball number within the over |
batter |
String | Batsman facing the delivery |
bowler |
String | Bowler delivering the ball |
non_striker |
String | Batsman at the non-striking end |
batsman_runs |
Integer | Runs scored by the batsman |
extra_runs |
Integer | Extra runs conceded |
total_runs |
Integer | Total runs from the delivery |
extras_type |
String | Type of extra (NaN if none) |
is_wicket |
Integer | Binary flag: 1 if wicket fell |
player_dismissed |
String | Dismissed player name (NaN if no wicket) |
dismissal_kind |
String | Type of dismissal (NaN if no wicket) |
fielder |
String | Fielder involved (NaN if not applicable) |
Note on missing values:
player_dismissed,dismissal_kind,fielder, andextras_typeare NaN when no wicket or no extra occurred. These are semantically meaningful absences, not data corruption. Rows were never dropped — NaNs were filled with"not_dismissed"and"none".
deliveries.csv
│
▼
┌─────────────────────────────────────────────┐
│ OBJECTIVE 1 — Data Cleaning & EDA │
│ • Load CSV, inspect shape, head, info │
│ • Handle missing values (semantic fills) │
│ • Remove duplicates (0 found) │
│ • Correct over index (0–19 → 1–20) │
│ • Remove super overs (inning > 2) │
│ • Engineer match_phase feature column │
└─────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ OBJECTIVES 2–4 — Exploratory Analysis │
│ • Univariate: run distribution, dismissals │
│ • Bivariate: over vs run rate, top batsmen │
│ • Multivariate: correlation heatmap, │
│ scatter of balls faced vs total runs │
└─────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ OBJECTIVE 5 — Outlier Detection (IQR) │
│ • Aggregate to runs-per-over level │
│ • Q1=12, Q3=20, IQR=8, bounds [0, 32] │
│ • Removed 369 outlier overs (1.42%) │
│ • Returns over_clean DataFrame │
└─────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ OBJECTIVE 6 — Linear Regression │
│ • X: over number, y: runs_per_over │
│ • 80/20 train-test split (seed=42) │
│ • MSE=40.98, R²=0.0106, coef=+0.1198 │
└─────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ OBJECTIVE 7 — Hypothesis Testing (T-Test) │
│ • Welch's t-test: Powerplay vs Death overs │
│ • t = −10.707, p ≈ 0.000000 │
│ • Reject H₀ — statistically significant │
└─────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ OBJECTIVE 8 — Advanced Player Analytics │
│ • Top batsmen, top wicket-takers │
│ • Boundary leaders, economy specialists │
│ • Phase-wise run distribution & pie chart │
└─────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ OBJECTIVES 9–10 — Insights & Strategy │
│ • 7 data-backed key findings │
│ • Phase-specific batting & bowling recs │
└─────────────────────────────────────────────┘
Objective: Quantify how much runs per over increases with each successive over, and evaluate how well a linear model captures this relationship.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
X = over_clean[["over"]].values # Independent variable: over number
y = over_clean["runs_per_over"].values # Dependent variable: runs per over
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)Results:
| Metric | Value | Interpretation |
|---|---|---|
| Intercept (β₀) | 14.3186 | Baseline runs/over at over 0 (theoretical) |
| Coefficient (β₁) | +0.1198 | Each additional over adds ~0.12 runs on average |
| MSE | 40.9804 | Average squared prediction error |
| R² | 0.0106 | Over number explains ~1% of variance in runs/over |
| Train/Test Split | 80% / 20% | Fixed seed (random_state=42) for reproducibility |
On the low R²: The R² of 0.0106 is expected and informative — not a model failure. Over number alone cannot capture pitch conditions, batting lineup, or match situation. The coefficient +0.1198 is the meaningful output: a quantified, reproducible upward trend across ~25,700 clean overs.
Question: Do Death overs (16–20) score significantly more runs per over than Powerplay overs (1–6)?
Method: Welch's independent two-sample t-test (equal_var=False) — chosen because the two phases have different variance structures, making the equal-variance assumption of Student's t-test inappropriate.
from scipy import stats
powerplay = over_clean[over_clean["over"].between(1, 6)]["runs_per_over"]
death = over_clean[over_clean["over"].between(16, 20)]["runs_per_over"]
t_stat, p_value = stats.ttest_ind(powerplay, death, equal_var=False)Results:
| Parameter | Value |
|---|---|
| H₀ | Mean runs/over in Powerplay = Mean in Death overs |
| H₁ | They differ significantly |
| α (Significance Level) | 0.05 |
| Powerplay Mean | 15.40 runs/over (n = 5,094) |
| Death Overs Mean | 16.77 runs/over (n = 4,220) |
| T-Statistic | −10.7068 |
| P-Value | < 0.000001 |
| Conclusion | Reject H₀ — statistically significant difference |
| # | Finding | Supporting Evidence |
|---|---|---|
| 1 | Death overs produce the highest scoring rate | 16.77 runs/over vs 15.40 in Powerplay (p < 0.00001) |
| 2 | V Kohli is the all-time IPL run leader | 8,004 total runs and 979 boundaries — leads both categories |
| 3 | Boundaries account for 59.9% of all runs | Pie chart confirms T20 is a boundary-dependent format |
| 4 | Caught is the dominant dismissal at 62% | 8,053 caught dismissals out of ~13,000 total wickets |
| 5 | Run rate dips consistently at overs 7–8 | Visible across all seasons — prime window for spin bowling |
| 6 | YS Chahal leads all wicket-takers with 213 | 6 of top 10 wicket-takers are spinners — spin dominates IPL |
| 7 | Over number alone has weak predictive power | Linear Regression R² ≈ 0.01 — match context dominates over position |
Histogram of total_runs & Dismissal Type Countplot
The distribution of runs per delivery is heavily right-skewed — 0 runs and 1 run together account for over 75% of all deliveries (mean = 1.333, median = 1.0). The countplot confirms "caught" as the dominant dismissal at 8,053 occurrences (62% of all wickets), followed by bowled (2,204) and run out (1,107).
Line Plot: Over vs Average Runs & Top 10 Batsmen by Total Runs
The line plot reveals the consistent run-rate dip at over 7 (transition window after powerplay) and the steady climb through death overs, peaking near overs 17–18. The bar chart confirms V Kohli at 8,004 runs — nearly 1,300 ahead of second-placed S Dhawan (6,769).
Correlation Heatmap & Scatter: Balls Faced vs Total Runs (Top 30 Batsmen)
total_runs and batsman_runs show near-perfect correlation (r = 0.98). is_wicket vs total_runs is weakly negative (−0.18) — wicket deliveries tend to yield fewer runs. In the scatter, V Kohli sits alone in the extreme top-right, the only player to simultaneously maximise both balls faced and total runs.
Boxplot Before/After IQR Removal + Distribution Comparison
The before boxplot shows extreme outliers reaching 52 runs per over (red dots above the whisker). After applying IQR bounds (Q1=12, Q3=20, upper limit=32), 369 overs (1.42%) were removed. The overlay histogram confirms the central distribution shape is fully preserved — only the extreme right tail is trimmed.
Regression Line (Over vs Runs per Over) & Residual Plot
The positive slope (β₁ = +0.1198) confirms a steady scoring increase across the innings. MSE = 40.98, R² = 0.0106. The residual plot shows even spread above and below the zero line with no systematic pattern — confirming an unbiased, homoscedastic model.
Powerplay (1–6) vs Death Overs (16–20) — Welch's T-Test
Death overs (red) show a higher median and wider IQR than Powerplay overs (green), confirming both higher scoring and greater variability. t = −10.707, p ≈ 0.000000. The title annotation directly displays the test result on the chart.
Average Runs per Over: Powerplay (15.40) vs Death (16.77)
Clean two-bar comparison with exact values labelled. The 1.37 run difference is consistent and systematic across thousands of overs — a structural feature of IPL innings, not a sample artefact.
Top 10 Batsmen by Total Runs & Top 10 Bowlers by Total Wickets
V Kohli (8,004 runs) leads run-scorers by a clear margin. YS Chahal tops wicket-takers with 213 wickets. Notably, 6 of the top 10 bowlers are spinners — confirming spin bowling as the dominant wicket-taking weapon in IPL conditions across 2008–2020.
Most Aggressive Batsmen (Boundaries) & Best Economy Bowlers
V Kohli also leads in total boundaries (979), followed by S Dhawan (921) and DA Warner (899) — confirming his status as the most complete T20 batsman in the dataset. Sohail Tanvir leads economy rates at 6.23 runs/over (min. 10 overs), followed by A Chandila (6.28).
Phase-wise Boxplot: Death (16–20) | Middle (7–15) | Powerplay (1–6)
Middle overs show the highest absolute spread and widest IQR, reflecting 9 overs of batting with high match-to-match variability. Powerplay and Death are more compact. Death shows the lowest median aggregate despite having the highest per-over rate — explained by fewer overs (5 vs 9).
59.9% of All Batsman Runs Come from Boundaries (4s & 6s)
The most striking single visualization in the project. Only 40.1% of runs come from running between the wickets — singles, twos, and threes. This conclusively proves T20 cricket is boundary-dependent: teams without boundary-hitting specialists face a structural scoring disadvantage.
Phase-level Aggregate: Powerplay (94.4) → Middle (137.8) → Death (87.5)
Middle overs peak purely due to volume (9 overs vs 6 for Powerplay and 5 for Death). This contrasts with per-over rates where Death (16.77) leads. The line plot shows this aggregate-vs-rate distinction clearly — an important nuance for strategic planning.
| Category | Tool / Library | Purpose |
|---|---|---|
| Language | Python 3.10+ | Core programming language |
| Data Manipulation | Pandas | DataFrame operations, groupby, aggregation |
| Numerical Computing | NumPy | Array operations, mathematical functions |
| Visualization | Matplotlib | Low-level plotting and figure control |
| Statistical Visualization | Seaborn | Heatmaps, countplots, boxplots |
| Machine Learning | Scikit-learn | LinearRegression, train_test_split, metrics |
| Statistical Testing | SciPy | Welch's t-test via scipy.stats.ttest_ind |
| Frontend | HTML5, CSS3, JS | Interactive portfolio website |
| IDE | VS Code | Development environment |
| Version Control | Git & GitHub | Code hosting and version control |
IPL-Intelligence-Engine/
│
├── ipl_analysis.py # Main Python script — all 10 objectives
├── deliveries.csv # IPL ball-by-ball dataset (2008–2020)
├── README.md # Project documentation
│
├── images/ # All generated graph outputs
│ ├── univariate.png # Obj 2 — Histogram + Dismissal Countplot
│ ├── bivariate.png # Obj 3 — Over Run Rate + Top Batsmen
│ ├── multivariate.png # Obj 4 — Correlation Heatmap + Scatter
│ ├── outlier_detection.png # Obj 5 — IQR Before/After + Distribution
│ ├── linear_regression.png # Obj 6 — Regression Line + Residual Plot
│ ├── hypothesis_boxplot.png # Obj 7 — Powerplay vs Death Boxplot
│ ├── hypothesis_bar.png # Obj 7 — Mean Comparison Bar Chart
│ ├── advanced_analysis_1.png # Obj 8 — Top Batsmen + Top Bowlers
│ ├── advanced_analysis_2.png # Obj 8 — Boundaries + Economy Rates
│ ├── run_distribution.png # Additional — Phase Boxplot
│ ├── boundary_pie.png # Additional — Boundary Contribution Pie
│ └── phase_line.png # Additional — Phase Average Line Plot
│
├── index.html # Website entry point
├── styles.css # Website stylesheet (dark editorial theme)
└── script.js # Website interactivity (scroll, lightbox, nav)
URL: https://immortalshubham.github.io/IPL-Intelligence-Engine/
The website presents all findings in a deployed, fully interactive static site organized into six sections: Hero, About, Dataset, Analysis (all 10 objectives with embedded graphs), Insights, and Strategy Recommendations. Features include scroll-reveal animations, fullscreen image lightbox on click, sticky quick-navigation bar, and full mobile responsiveness.
Repository: https://github.com/ImmortalShubham/IPL-Intelligence-Engine
Contains the complete Python analysis script, all 12 visualization images, the full frontend website source (HTML/CSS/JS), dataset, and this documentation.
Python 3.10+
pip
# 1. Clone the repository
git clone https://github.com/ImmortalShubham/IPL-Intelligence-Engine.git
cd IPL-Intelligence-Engine
# 2. Install required libraries
pip install pandas numpy matplotlib seaborn scikit-learn scipy
# 3. Place deliveries.csv in the project root (same level as ipl_analysis.py)
# 4. Run the full analysis
python ipl_analysis.py| Step | Output |
|---|---|
| Objective 1 | Dataset shape, head, info, statistical summary printed to console |
| Objectives 2–4 | 5 graphs displayed sequentially (univariate, bivariate, multivariate) |
| Objective 5 | IQR parameters printed + 3 outlier detection graphs displayed |
| Objective 6 | Regression metrics printed + 2 graphs displayed |
| Objective 7 | T-test results printed + 2 hypothesis graphs displayed |
| Objective 8 | 4 advanced analysis graphs + pie chart + phase line plot displayed |
| Objectives 9–10 | All insights and strategy recommendations printed to console |
Open index.html in any modern browser — no server, no build step needed.
- Semantic missing values require domain understanding. Dropping NaN rows here would have removed 95%+ of the dataset — a critical mistake that pattern-matching imputation would miss.
- Two-level aggregation is essential for fair over-level averages. Averaging raw ball-level data inflates results for overs with extra deliveries; grouping by (match, over) first, then averaging across matches, is the correct approach.
- Feature engineering pays dividends. The
match_phasecolumn was created in one.apply()call and reused across five separate objectives — a small investment with large analytical return. - R² is not the only success metric. An R² of 0.0106 is the correct result here. The coefficient is the meaningful output, not the explained variance.
- Welch's t-test over Student's t-test. In real-world analysis, equal variance between groups is rarely justified without testing. Welch's variant is the safer default.
- Outlier context matters. Applying IQR at the over level (not the ball level) made the removal analytically meaningful — a 52-run over is an outlier; a 6-run delivery is not.
- Chart type selection is analytical decision-making. Line plots for trends, heatmaps for relationships, residual plots for model validation — each choice communicates something the others cannot.
| Area | Description |
|---|---|
| Advanced ML Models | Random Forest or XGBoost for multi-feature run prediction — expected to outperform linear regression significantly |
| Match Outcome Prediction | Binary classification model predicting match winner from partial-innings data (logistic regression, SVM) |
| Player Similarity Clustering | K-Means or hierarchical clustering to group batsmen/bowlers into performance archetypes |
| Partnership Analysis | Track batter–non_striker pairs to identify the highest-output batting combinations |
| Venue Analysis | Merge with matches.csv for ground-specific performance breakdowns |
| Dataset Extension | Update to 2021–2024 IPL seasons — test whether death-over dominance has intensified |
| NLP Commentary | Auto-generate natural language summaries using a language model as data updates |
Shubham Kumar B.Tech CSE — Section 3M031 | Registration No: 12405152 Lovely Professional University, Phagwara, Punjab
| Platform | Link |
|---|---|
| GitHub | @ImmortalShubham |
| Live Project | IPL Intelligence Engine |
Submitted under the guidance of Dr. Mrinalini Rana (UID: 22138), Assistant Professor, School of CSE, LPU — INT375 (Data Science Tool Box: Python Programming), January–April 2026.
This project is licensed under the MIT License.
MIT License
Copyright (c) 2026 Shubham Kumar
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
IPL Intelligence Engine — INT375 Project Report, Lovely Professional University, April 2026











