The Email Spam Classifier is a Machine Learning project that detects whether an email/message is Spam or Ham (Not Spam) using Natural Language Processing (NLP) techniques.
This project uses:
- Python for implementation
- Pandas & NumPy for data handling
- TF-IDF Vectorization for text feature extraction
- Logistic Regression for classification
- Scikit-learn for machine learning utilities
The model is trained on a dataset containing labeled email messages and predicts whether a given message is spam or not.
- Preprocess email/message text data
- Convert text into numerical features using TF-IDF
- Train a machine learning model for spam detection
- Evaluate the model accuracy
- Predict whether new messages are spam or ham
- Python
- Jupyter Notebook
- NumPy
- Pandas
- Scikit-learn
Email-Spam-Classifier/
│
├── Email Spam Classifier.ipynb # Main project notebook
├── mail_data.csv # Dataset used for training
├── README.md # Project documentationThe dataset contains:
- Category → Label indicating whether the message is
spamorham - Message → The actual email/text message
Example:
| Category | Message |
|---|---|
| ham | Hello, how are you? |
| spam | Congratulations! You won a prize. |
The project imports libraries for:
- Data manipulation
- Machine learning
- Text vectorization
- Model evaluation
The dataset is loaded using Pandas.
import pandas as pd
df = pd.read_csv('mail_data.csv')-
Handle missing values
-
Convert labels:
- spam → 0
- ham → 1
data = df.where((pd.notnull(df)), '')Text messages are converted into numerical vectors using TF-IDF Vectorizer.
from sklearn.feature_extraction.text import TfidfVectorizer
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)The dataset is divided into training and testing data.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)The project uses Logistic Regression.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_features, Y_train)Accuracy is calculated using:
from sklearn.metrics import accuracy_scoreThe trained model predicts whether a new email is spam or ham.
git clone https://github.com/your-username/email-spam-classifier.gitcd email-spam-classifierpip install numpy pandas scikit-learn jupyterjupyter notebookOpen:
Email Spam Classifier.ipynbLogistic Regression is a supervised machine learning algorithm used for binary classification problems.
Advantages:
- Simple and efficient
- Fast training
- Good performance for text classification
- Works well with TF-IDF features
The model predicts:
Spam Message
OR
Ham Message
- Use advanced NLP techniques
- Implement Deep Learning models
- Build a web application interface
- Improve accuracy with larger datasets
- Add real-time email filtering
input_mail = ["Congratulations! You have won a free ticket"]Output:
Spam Mail
Contributions are welcome.
Steps:
- Fork the repository
- Create a new branch
- Make changes
- Submit a pull request