Unsupervised Learning Project

This project was completed as part of the MSc Mathematical Trading and Finance programme at Bayes Business School (formerly Cass). It applies unsupervised learning methods to understand latent structure in film characteristics across 50 top-rated IMDB movies. The focus lies in dimensionality reduction through Principal Component Analysis (PCA) and clustering via both KMeans and Hierarchical methods.

Problem

Can we uncover meaningful groupings among films based on commercial and critical success metrics without using genre labels or metadata? This project explores how PCA reveals dominant axes of variation in film attributes, and how clustering algorithms can group similar films in this reduced feature space. We also benchmarked our human-led approach against an output generated by ChatGPT-4o, to evaluate how effectively LLMs perform in exploratory unsupervised settings.

My Reflections

What struck me about this project was how PCA can surface structure even when we know very little about the underlying 'drivers of success.' Without genre or actor labels, PCA still pulled out commercially versus critically successful films as its first principal component. It made me reflect on how powerful unsupervised methods could be in finance too; I'd be curious to run a similar analysis on hedge fund returns to see whether natural groupings emerge by strategy or performance regime. Benchmarking against ChatGPT's solution also revealed that while LLMs are strong at functional outputs, they typically favour user-friendly scripts over robust, modular, object-oriented pipelines.

Methods

Data Cleaning: Numeric-only filtering, missing value handling, standardisation
Exploratory Data Analysis: Correlation matrices, pairplots, boxplots
Dimensionality Reduction: Principal Component Analysis (full and reduced)
Clustering:
- KMeans (with silhouette score optimisation)
- Hierarchical clustering (Ward’s method, dendrogram thresholding)
Interpretation: Visual overlays of PCA loadings and cluster groupings
Model Benchmarking: Comparison with ChatGPT-4o-generated output

Repository Structure

Unsupervised-Learning-Project/
├── datasets/
├── images/
├── .gitignore
├── README.md
├── Report.pdf
├── Task.pdf
├── chatgpt4o_output.py
├── imdb_movie_analysis.py
├── requirements.txt

Summary of Results

Method	Insight
PCA	PC1 distinguished commercial vs critical success
KMeans (k=3)	Revealed clusters aligned with genre-type and box office returns
Hierarchical	Confirmed structure and stability of KMeans clusters
ChatGPT Benchmark	Delivered valid outputs, but lacked interpretability and pipeline depth

PCA captured 95% of total variance with 4 components. KMeans clustering (k=3) showed genre-adjacent groupings, while hierarchical clustering validated stability across methods.

Requirements

pip install -r requirements.txt

How to Run

git clone https://github.com/RemaniSA/Unsupervised-Learning-Project.git
cd Unsupervised-Learning-Project
python imdb_movie_analysis.py

Ensure:

IMDB-Movies.csv is located in the datasets/ folder
Output images are written to the images/ folder

Authors

Shaan Ali Remani
José Pedro Pessoa Dos Santos
Chin-Lan Chen
Poh Har Yap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised Learning Project

Problem

My Reflections

Methods

Repository Structure

Summary of Results

Requirements

How to Run

Further Reading

Authors

Connect

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
datasets		datasets
images		images
.gitignore		.gitignore
README.md		README.md
Report.pdf		Report.pdf
Task.pdf		Task.pdf
chatgpt4o_output.py		chatgpt4o_output.py
imdb_movie_analysis.py		imdb_movie_analysis.py
requirements.txt		requirements.txt

RemaniSA/Unsupervised-Learning-Project

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Learning Project

Problem

My Reflections

Methods

Repository Structure

Summary of Results

Requirements

How to Run

Further Reading

Authors

Connect

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages