Sentiment Analysis on Movie Reviews

Project Overview

This project focuses on performing binary sentiment analysis on the IMDB Movie Reviews dataset. The goal is to classify movie reviews as either positive or negative using various machine learning and deep learning techniques. The project covers the entire pipeline from data exploration and preprocessing to model training and evaluation.

Dataset

Source: IMDB Dataset
Size: 50,000 movie reviews
Class Balance: Balanced (25,000 positive, 25,000 negative)
Features: Raw textual reviews
Target: Sentiment (Binary: 0 for Negative, 1 for Positive)

Technologies Used

Language: Python 3.x
Libraries:
- Data Manipulation: Pandas, NumPy
- Visualization: Matplotlib, Seaborn, WordCloud
- Natural Language Processing: NLTK (Tokenization, Stopwords, Stemming)
- Machine Learning: Scikit-learn (TF-IDF, Logistic Regression, SVM, KNN, MLP)
- Deep Learning: TensorFlow/Keras (CNN, Embedding, Sequence Padding)

Methodology

1. Data Preprocessing

Text Cleaning: Removal of HTML tags, punctuation, special characters, and digits.
Normalization: Conversion to lowercase.
NLP Techniques: Tokenization, stopword removal, and Porter Stemming.
Vectorization:
- TF-IDF: Used for traditional ML models (max 5000 features).
- Sequence Padding: Used for the CNN model (max length 200).
Data Split: 75% Training, 25% Testing.

2. Models Implemented

The project implements and compares the following classification models:

Logistic Regression: A baseline linear model suitable for sparse high-dimensional text data.
Linear SVC (Support Vector Classifier): Optimized for high-dimensional spaces.
K-Nearest Neighbors (KNN): A non-parametric method based on feature similarity.
Multi-Layer Perceptron (MLP): Feedforward neural networks with various hidden layer architectures (1, 2, and 3 layers).
Convolutional Neural Network (CNN): A deep learning model utilizing 1D convolutions to capture local patterns in text sequences.
- Architecture: Embedding (128 dim) -> Conv1D (128 filters) -> GlobalMaxPooling -> Dense (64) -> Output (Sigmoid).

Results

The models were evaluated based on Accuracy, Precision, Recall, and F1-Score.

Best Performers: Logistic Regression, Linear SVC, and CNN achieved the highest accuracies, reaching approximately 88%.
CNN Performance: The CNN model demonstrated robust performance, effectively learning semantic features from the text sequences.
KNN Performance: Lower accuracy (~77%) compared to linear models and neural networks, highlighting the effectiveness of dimensionality-aware models for text.

How to Run

Ensure all dependencies are installed:

pip install pandas numpy matplotlib seaborn nltk scikit-learn tensorflow wordcloud

Download the IMDB Dataset.csv and place it in the project root.
Run the Jupyter Notebook SentimentAnalysis_MovieReview.ipynb to execute the analysis and training pipeline.

Conclusion

This project demonstrates that both traditional linear models (like SVM and Logistic Regression) and deep learning models (CNN) are highly effective for sentiment analysis on this dataset, with CNNs offering the potential for further scalability on larger, more complex corpora.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
IMDB_Dataset.csv		IMDB_Dataset.csv
README.md		README.md
SentimentAnalysis_MovieReview.ipynb		SentimentAnalysis_MovieReview.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis on Movie Reviews

Project Overview

Dataset

Technologies Used

Methodology

1. Data Preprocessing

2. Models Implemented

Results

How to Run

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis on Movie Reviews

Project Overview

Dataset

Technologies Used

Methodology

1. Data Preprocessing

2. Models Implemented

Results

How to Run

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages