Halftime to Fulltime Predictor

To what extent are first-half statistics predictive of the final match result?

A football analytics project using StatsBomb open data for the Italian Serie A 2015/16 season (380 matches, 1.35M event rows). The analysis combines advanced metric computation from raw event data with machine learning classification to quantify how much the first half tells us about the final outcome.

Key Results

Metric	Value
Result stability (HT == FT)	62.6%
Home wins kept at FT	81.7%
Away wins kept at FT	76.1%
HT draws kept at FT	41.0% — most volatile state
Cramér's V (HT → FT association)	0.494 — moderate, p < 0.001
Best model (SVM, all features)	56.8% accuracy, AUC 0.757
Majority-class baseline	46.3%

TL;DR: The half-time score is the dominant predictor. Process metrics (possession, pressing, progressive passes) have limited standalone power on a single-season dataset. Draws are structurally unpredictable from first-half data alone.

Project Structure

halftime-to-fulltime-predictor/
│
├── halftime_to_fulltime_predictor.ipynb   # Main analysis notebook
│
├── data/
│   ├── download_data.py                   # Script to download StatsBomb data
│   └── README.md                          # Data documentation
│
├── visualizations/                        # All plots saved here on execution
│                            
│
├── assets/
│   └── teams_logos/                             # Optional: team logo PNGs
│
├── fonts/                                 # Optional: Teko font family (.ttf)
│
├── requirements.txt
├── .gitignore
└── README.md

Analysis Pipeline

Phase	Description
1. Setup	Libraries, global config, reproducibility seed (`RANDOM_STATE = 42`)
2. Data loading	Shape inspection, event type distribution
3. Standings table	Basic table enriched with xG, PPDA, Field Tilt, possession, progressive passes, key passes, pressure height
4. Visualisations	Goals vs xG, shot maps, passing zone heatmaps, pressing scatter
5. Match results table	Per-match HT/FT stats built with vectorised pandas groupby operations
6. HT–FT analysis	Result stability, comeback rates, transition matrix
7. Modelling	Chi-square test, majority-class baseline, SVM / Decision Tree / Naive Bayes, stratified 5-fold CV, ROC curves, feature importance
8. No-goals analysis	Full modelling pipeline repeated excluding goal-related features
9. Conclusions	Findings, per-model comparison table, limitations, implications

Advanced Metrics Computed from Raw Events

All metrics are computed from scratch using StatsBomb event coordinates — no pre-aggregated stats are used.

xG — StatsBomb's shot.statsbomb_xg field, aggregated per team per match
PPDA — Passes per Defensive Action: opponent successful passes in their own half ÷ team defensive actions in the offensive half. Lower = more intense pressing.
Field Tilt — team's share of passes in the final third (x > 80)
Possession — pass-based ball share per match
Progressive passes — completed passes reducing distance to goal by ≥ 20%
Pressure height — median x-coordinate of Pressure events
Key passes — passes directly leading to a shot (pass.shot_assist or pass.goal_assist)

Modelling Design Choices

First-half features only. All models are trained exclusively on ht_ prefixed features. Using full-match statistics to predict the final result would constitute data leakage — the model would see second-half information that is unavailable at half-time.

Calibrated SVM. CalibratedClassifierCV is used instead of SVC(probability=True) to obtain better probability estimates for ROC analysis on a small dataset.

Majority-class baseline. A DummyClassifier establishes the floor any model must beat before being considered useful.

Model Performance Summary

All first-half features

Model	CV acc	Test acc	Macro AUC	Draw F1
SVM (calibrated)	0.600 ± 0.013	0.568	0.757	0.00
Naive Bayes	0.533 ± 0.057	0.547	0.734	0.39
Decision Tree	0.502 ± 0.038	0.495	0.605	0.34
Majority-class baseline	—	0.463	—	0.00

Excluding goal-related features (goals, shots, xG)

Model	CV acc	Test acc	Macro AUC
SVM (calibrated)	0.523 ± 0.026	0.484	0.560
Naive Bayes	0.453 ± 0.036	0.453	0.608
Decision Tree	0.407 ± 0.070	0.390	0.536

The sharp AUC drop (0.757 → 0.560) when removing goal features confirms that the half-time score dominates, while process metrics have limited standalone power at this dataset scale.

Notable Finding: Draws

The SVM (best overall model) achieves F1 = 0.00 on draws — it ignores them entirely due to class imbalance (95 draws vs. 175 home wins). Naive Bayes partially recovers draw prediction (F1 = 0.39) at the cost of lower overall accuracy.

The transition matrix explains why: only 41% of HT draws remain draws at FT, while 38.6% flip to a home win — the single most frequent second-half transition in the dataset.

Installation

git clone https://github.com/marinoalfonso/halftime-to-fulltime-predictor.git
cd halftime-to-fulltime-predictor
pip install -r requirements.txt

Download the data

python data/download_data.py

This downloads the StatsBomb open data for Serie A 2015/16 directly via the official statsbombpy library. The raw CSV files are not included in the repository due to their size (~1 GB).

Optional: custom fonts

Download the Teko font family from Google Fonts and place the .ttf files in a fonts/ folder. The notebook falls back to matplotlib defaults if the folder is absent.

Run

jupyter notebook halftime_to_fulltime_predictor.ipynb

Data

Source: StatsBomb Open Data
Competition: Serie A 2015/16
Matches: 380
Events: ~1,353,739 rows × 176 columns
License: StatsBomb Open Data License — free for non-commercial use with attribution

See data/README.md for full details on the data structure and download instructions.

Known Limitations

Limitation	Impact
Single season (380 matches)	High CV variance (DT SD up to 0.070); limited generalisation
Class imbalance (175 / 110 / 95)	SVM collapses draw recall to zero; accuracy alone is misleading
No temporal features	Second-half substitutions, fatigue, red cards not modelled
Single competition	Serie A 2015/16 tactical patterns may not transfer to other leagues/eras

Tech Stack

License

This project is released under the MIT License.
Data is provided by StatsBomb under their Open Data License — please read it before using the data in your own work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Halftime to Fulltime Predictor

Key Results

Project Structure

Analysis Pipeline

Advanced Metrics Computed from Raw Events

Modelling Design Choices

Model Performance Summary

All first-half features

Excluding goal-related features (goals, shots, xG)

Notable Finding: Draws

Installation

Download the data

Optional: custom fonts

Run

Data

Known Limitations

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
data		data
visualizations		visualizations
LICENSE		LICENSE
README.md		README.md
gitignore		gitignore
halftime_to_fulltime_predictor.ipynb		halftime_to_fulltime_predictor.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Halftime to Fulltime Predictor

Key Results

Project Structure

Analysis Pipeline

Advanced Metrics Computed from Raw Events

Modelling Design Choices

Model Performance Summary

All first-half features

Excluding goal-related features (goals, shots, xG)

Notable Finding: Draws

Installation

Download the data

Optional: custom fonts

Run

Data

Known Limitations

Tech Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages