This project was completed as part of the MSc Mathematical Trading and Finance programme at Bayes Business School (formerly Cass). It applies unsupervised learning methods to understand latent structure in film characteristics across 50 top-rated IMDB movies. The focus lies in dimensionality reduction through Principal Component Analysis (PCA) and clustering via both KMeans and Hierarchical methods.
Can we uncover meaningful groupings among films based on commercial and critical success metrics without using genre labels or metadata? This project explores how PCA reveals dominant axes of variation in film attributes, and how clustering algorithms can group similar films in this reduced feature space. We also benchmarked our human-led approach against an output generated by ChatGPT-4o, to evaluate how effectively LLMs perform in exploratory unsupervised settings.
What struck me about this project was how PCA can surface structure even when we know very little about the underlying 'drivers of success.' Without genre or actor labels, PCA still pulled out commercially versus critically successful films as its first principal component. It made me reflect on how powerful unsupervised methods could be in finance too; I'd be curious to run a similar analysis on hedge fund returns to see whether natural groupings emerge by strategy or performance regime. Benchmarking against ChatGPT's solution also revealed that while LLMs are strong at functional outputs, they typically favour user-friendly scripts over robust, modular, object-oriented pipelines.
- Data Cleaning: Numeric-only filtering, missing value handling, standardisation
- Exploratory Data Analysis: Correlation matrices, pairplots, boxplots
- Dimensionality Reduction: Principal Component Analysis (full and reduced)
- Clustering:
- KMeans (with silhouette score optimisation)
- Hierarchical clustering (Ward’s method, dendrogram thresholding)
- Interpretation: Visual overlays of PCA loadings and cluster groupings
- Model Benchmarking: Comparison with ChatGPT-4o-generated output
Unsupervised-Learning-Project/
├── datasets/
├── images/
├── .gitignore
├── README.md
├── Report.pdf
├── Task.pdf
├── chatgpt4o_output.py
├── imdb_movie_analysis.py
├── requirements.txt
| Method | Insight |
|---|---|
| PCA | PC1 distinguished commercial vs critical success |
| KMeans (k=3) | Revealed clusters aligned with genre-type and box office returns |
| Hierarchical | Confirmed structure and stability of KMeans clusters |
| ChatGPT Benchmark | Delivered valid outputs, but lacked interpretability and pipeline depth |
PCA captured 95% of total variance with 4 components. KMeans clustering (k=3) showed genre-adjacent groupings, while hierarchical clustering validated stability across methods.
pip install -r requirements.txtgit clone https://github.com/RemaniSA/Unsupervised-Learning-Project.git
cd Unsupervised-Learning-Project
python imdb_movie_analysis.pyEnsure:
IMDB-Movies.csvis located in thedatasets/folder- Output images are written to the
images/folder
- Jolliffe, I.T.: Principal Component Analysis
- Hastie, Tibshirani & Friedman: The Elements of Statistical Learning
- Lecture notes from Machine Learning for Quantitative Professionals, Bayes
- Shaan Ali Remani
- José Pedro Pessoa Dos Santos
- Chin-Lan Chen
- Poh Har Yap