Exploring Defensive Team Profiles of Top 5 European Leagues — 2023/24 Season
A statistical analysis of team-level defensive data from the five major European football leagues (Serie A, Bundesliga, Ligue 1, La Liga, Premier League) in the 2023/24 season, applying Principal Component Analysis (PCA) and Multiple Linear Regression.
⚠️ Reproducibility noticeThis project was developed in 2024. Since then, FBref has restricted access to several statistical tables that were previously available for free scraping. As a result, the data-loading step in
report.Rmd— which relies onworldfootballR::fb_season_team_stats(..., "defense")— no longer works and will return an error or empty data.Unfortunately the original dataset was not exported at the time and cannot be redistributed. The repository is therefore preserved as a methodological reference: the full analysis pipeline (descriptive statistics, PCA, multiple regression, and diagnostic tests) remains valid and can be adapted to any equivalent dataset obtained through alternative means (e.g. a paid FBref subscription, or a manually exported CSV).
This project explores how and how much top European clubs defend, going beyond simple tallies to identify:
- Tactical profiles of individual teams through descriptive statistics and radar charts.
- Latent defensive dimensions via PCA (e.g., overall defensive intensity, pressing orientation, midfield efficiency, error propensity).
- Linear drivers of tackle success via multiple regression with backward variable selection.
Most Champions League qualifiers cluster in the positive PC1–PC2 region, suggesting that top clubs tend to be more active defensively — pressing higher and winning more tackles across all zones of the pitch.
Each team's defensive profile expressed as percentile ranks across all 14 variables.
DefensivePerformance_PCA/
├── report.Rmd # Main analysis document (R Markdown)
├── report.pdf # Compiled PDF report
├── R/
│ └── helpers.R # Reusable utility functions
├── data/
│ └── README.md # Notes on data provenance and caching
├── output/
│ └── figures/
│ ├── pca_biplot_ucl.png # PCA biplot — UCL vs non-UCL teams
│ └── radar_example.png # Radar chart example
└── docs/
└── variable_glossary.md # Full variable descriptions
All analysis is done in R. Install required packages with:
install.packages(c(
"tidyverse", "corrplot", "factoextra", "worldfootballR",
"ggplot2", "GGally", "car", "lmtest", "PerformanceAnalytics",
"gridExtra", "fmsb", "psych", "scales"
))Note:
worldfootballRscrapes data from fbref.com. An active internet connection is required to reproduce the data-loading step. To avoid repeated scraping, the cleaned dataset can be cached locally — seedata/README.md.
-
Clone the repository:
git clone https://github.com/marinoalfonso/DefensivePerformance_PCA.git cd DefensivePerformance_PCA -
Open
report.Rmdin RStudio (or any R environment). -
Install dependencies (see above).
-
Knit the document to PDF or HTML:
rmarkdown::render("report.Rmd", output_format = "pdf_document") # or rmarkdown::render("report.Rmd", output_format = "html_document")
| Stage | Main Result |
|---|---|
| Descriptive | Tottenham leads in attacking-third tackles; Juventus leads in dribble-countering %; Liverpool concedes most dribbles (Gegenpressing effect). |
| PCA | 4 PCs explain >75% variance. PC1 = overall defensive intensity; PC2 = pressing height; PC3 = midfield efficiency; PC4 = error propensity. |
| Regression | Best model for TklWin: tackles in all three field zones (positive) + shots blocked (negative). Adjusted R² > 0.90 after backward selection. |
Data extracted via the worldfootballR R package from fbref.com, a widely used football statistics aggregator. The dataset covers 96 teams (20 per league, 18 for Bundesliga and Ligue 1) with 14 numeric defensive variables per team.
Alfonso Marino
GitHub · Feel free to open an issue or submit a PR.
This project is licensed under the MIT License - see the LICENSE file for details.

