The aim of this project is to analyze the Netflix dataset of movies and TV shows available until 2019, sourced from the third-party search engine Flixable. The goal is to group the content into relevant clusters using Natural Language Processing (NLP) techniques to enhance the user experience through a recommendation system. This will help prevent subscriber churn for Netflix, which currently has over 220 million subscribers. π
Additionally, the dataset is analyzed to uncover insights and trends in the streaming entertainment industry. π₯
The project follows a structured step-by-step process:
-
Handling Missing Values
- π Cleaning and filling null values to maintain data integrity.
-
Managing Nested Columns
- π Processing nested columns such as
director,cast,listed_in, andcountryto improve visualization and analysis.
- π Processing nested columns such as
-
Binning the Rating Attribute
- π Categorizing ratings into groups:
adult,children's,family-friendly, andnot ratedto enhance interpretability.
- π Categorizing ratings into groups:
-
Exploratory Data Analysis (EDA)
- π¬ Gaining insights into factors that may contribute to subscriber churn.
- π Identifying trends in content popularity and distribution.
-
Feature Engineering for Clustering
- π Using attributes such as
director,cast,country,genre,rating, anddescription. - βοΈ Tokenizing, preprocessing, and vectorizing text data using the TF-IDF vectorizer.
- π Using attributes such as
-
Dimensionality Reduction
- β‘ Applying Principal Component Analysis (PCA) to reduce dimensionality and improve model performance.
-
Clustering Techniques
- π§© Implementing K-Means Clustering and Agglomerative Hierarchical Clustering.
- π Determining optimal cluster numbers:
- K-Means: 4 clusters π―
- Hierarchical Clustering: 2 clusters π
- π Evaluating clusters using silhouette scores and other validation techniques.
-
Developing a Content-Based Recommender System
- π Utilizing a cosine similarity matrix to provide personalized recommendations based on clustered content.
- π Aiming to reduce subscriber churn by enhancing user satisfaction and engagement.
- A well-defined clustering model for Netflix content. π¬
- A recommendation system that improves user experience and retention. π€
- Key insights into streaming trends and subscriber engagement. π
- Enhanced decision-making for content curation to cater to diverse audiences. π―
- Python (Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, NLTK, SciPy) π
- Natural Language Processing (NLP) (TF-IDF Vectorization) π
- Machine Learning (K-Means, Hierarchical Clustering, PCA) π€
- Recommendation Systems (Cosine Similarity Matrix) π
- Data Visualization (Seaborn, Matplotlib) π
This comprehensive analysis and recommendation system are expected to enhance user satisfaction, leading to improved retention rates for Netflix. By clustering similar content and delivering personalized recommendations, Netflix can offer a better viewing experience and reduce subscriber churn. π₯π