Skip to content

RemaniSA/Unsupervised-Learning-Project

Repository files navigation

Unsupervised Learning Project

This project was completed as part of the MSc Mathematical Trading and Finance programme at Bayes Business School (formerly Cass). It applies unsupervised learning methods to understand latent structure in film characteristics across 50 top-rated IMDB movies. The focus lies in dimensionality reduction through Principal Component Analysis (PCA) and clustering via both KMeans and Hierarchical methods.

Problem

Can we uncover meaningful groupings among films based on commercial and critical success metrics without using genre labels or metadata? This project explores how PCA reveals dominant axes of variation in film attributes, and how clustering algorithms can group similar films in this reduced feature space. We also benchmarked our human-led approach against an output generated by ChatGPT-4o, to evaluate how effectively LLMs perform in exploratory unsupervised settings.

My Reflections

What struck me about this project was how PCA can surface structure even when we know very little about the underlying 'drivers of success.' Without genre or actor labels, PCA still pulled out commercially versus critically successful films as its first principal component. It made me reflect on how powerful unsupervised methods could be in finance too; I'd be curious to run a similar analysis on hedge fund returns to see whether natural groupings emerge by strategy or performance regime. Benchmarking against ChatGPT's solution also revealed that while LLMs are strong at functional outputs, they typically favour user-friendly scripts over robust, modular, object-oriented pipelines.

Methods

  • Data Cleaning: Numeric-only filtering, missing value handling, standardisation
  • Exploratory Data Analysis: Correlation matrices, pairplots, boxplots
  • Dimensionality Reduction: Principal Component Analysis (full and reduced)
  • Clustering:
    • KMeans (with silhouette score optimisation)
    • Hierarchical clustering (Ward’s method, dendrogram thresholding)
  • Interpretation: Visual overlays of PCA loadings and cluster groupings
  • Model Benchmarking: Comparison with ChatGPT-4o-generated output

Repository Structure

Unsupervised-Learning-Project/
├── datasets/
├── images/
├── .gitignore
├── README.md
├── Report.pdf
├── Task.pdf
├── chatgpt4o_output.py
├── imdb_movie_analysis.py
├── requirements.txt

Summary of Results

Method Insight
PCA PC1 distinguished commercial vs critical success
KMeans (k=3) Revealed clusters aligned with genre-type and box office returns
Hierarchical Confirmed structure and stability of KMeans clusters
ChatGPT Benchmark Delivered valid outputs, but lacked interpretability and pipeline depth

PCA captured 95% of total variance with 4 components. KMeans clustering (k=3) showed genre-adjacent groupings, while hierarchical clustering validated stability across methods.

Requirements

pip install -r requirements.txt

How to Run

git clone https://github.com/RemaniSA/Unsupervised-Learning-Project.git
cd Unsupervised-Learning-Project
python imdb_movie_analysis.py

Ensure:

  • IMDB-Movies.csv is located in the datasets/ folder
  • Output images are written to the images/ folder

Further Reading

  • Jolliffe, I.T.: Principal Component Analysis
  • Hastie, Tibshirani & Friedman: The Elements of Statistical Learning
  • Lecture notes from Machine Learning for Quantitative Professionals, Bayes

Authors

  • Shaan Ali Remani
  • José Pedro Pessoa Dos Santos
  • Chin-Lan Chen
  • Poh Har Yap

Connect

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages