A machine learning and data analysis project that explores the Netflix dataset to uncover content trends, perform data cleaning, and visualize meaningful patterns in streaming media.
This project explores and analyzes the Netflix dataset to extract actionable insights using Python. It includes data cleaning, exploratory data analysis (EDA), and visualizations to understand content distribution by type, country, genre, and release timeline.
- Clean and preprocess the dataset
- Analyze the distribution of Netflix content (Movies vs TV Shows)
- Identify content trends by country, release year, and ratings
- Visualize top genres, directors, and frequently appearing cast members
- Answer specific business-oriented queries using code and visual analysis
- Source: Kaggle – Netflix Dataset
- File:
netflix_titles.csv - ⚠ Dataset is not included in this repository due to redistribution restrictions.
- Download manually from the link above and place it in your project folder to run the notebook.
- Data Cleaning (handling nulls, duplicates, format issues)
- Exploratory Data Analysis (EDA)
- Grouping, filtering, and cross-analysis by multiple columns
- Visualization using:
SeabornMatplotlib- Insightful Question-Answering (e.g., most active countries, popular genres)
- Languages: Python
- Libraries: Pandas, NumPy, Matplotlib, Seaborn
- Environment: Google Colab
- Version Control: Git & GitHub
- TV Shows dominate recent additions compared to Movies.
- The United States and India are the top content providers.
- Peak content additions occurred between 2017–2019.
- "Documentaries" and "Dramas" are the most frequent genres.
- Clone the repository or download the notebook.
- Download the dataset from Kaggle.
- Place
netflix_titles.csvin the same directory as the notebook. - Open
netflix_Ml_Project.ipynbin Jupyter Notebook or Google Colab. - Run all cells to reproduce the results.
Raviha Khan
📍 Karachi, Pakistan
🔗 LinkedIn
🐙 GitHub
📧 ravihakhan53@gmail.com
"Learning by doing — turning Netflix data into meaningful insights."