SMS-Spam-Classification-with-PySpark

This project implements various machine learning models for the classification of SMS messages into spam or non-spam categories. The models are evaluated based on their performance using various metrics, such as accuracy, sensitivity, specificity, and the area under the ROC curve (AUC). The goal is to identify the best-performing model for SMS spam detection.

Project Overview

This project involves building and evaluating the following machine learning models for SMS spam classification:

Naive Bayes Classifier
Random Forest Classifier
Support Vector Classifier (SVC)
K-Nearest Neighbors (KNN)
Logistic Regression

Each model is trained on a preprocessed dataset of SMS messages and evaluated using metrics such as accuracy, sensitivity, specificity, and AUC to determine the best model for classification.

Setup and Installation

To get started with this project, ensure that you have the following installed:

3. Start Spark Session

To use Spark, you’ll need to start a Spark session in your Python environment. You can do this with the following code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SMS Spam Classification").getOrCreate()

Data Preprocessing

The preprocessing steps include:

Data Loading: The dataset is loaded from a CSV file into a PySpark DataFrame.
Renaming Columns: The dataset columns are renamed to input (SMS text) and output (spam label).
Dropping Irrelevant Columns: Unnecessary columns are removed to clean up the dataset.
Handling Null/Empty Values: Rows with missing or empty SMS text are filtered out.
Text Cleaning: The text data is cleaned by removing non-alphabetical characters, converting text to lowercase, and stemming words.
Tokenization: The text is split into tokens (words).
Count Vectorization: The tokenized words are converted into feature vectors using CountVectorizer.

Model Training and Evaluation

Five machine learning models were trained and evaluated:

1. Naive Bayes Classifier

Type: Multinomial Naive Bayes
Accuracy: 98%
Sensitivity: 90.5%
Specificity: 99.2%
AUC: 0.9836

2. Random Forest Classifier

Number of Trees: 100
Max Depth: 10
Accuracy: 93.6%
Sensitivity: 53.1%
Specificity: 100%
AUC: 0.9804

3. Support Vector Classifier (SVC)

Max Iterations: 100
Accuracy: 97.6%
Sensitivity: 83.7%
Specificity: 99.78%
AUC: 0.9894

4. K-Nearest Neighbors (KNN)

Neighbors: 5
Accuracy: 97.3%
Sensitivity: 80.3%
Specificity: 100%
AUC: 0.9876

5. Logistic Regression

Max Iterations: 100
Accuracy: 97.3%
Sensitivity: 80.3%
Specificity: 100%
AUC: 0.9920

Results

The models were evaluated on the following metrics:

Accuracy: Measures the overall correctness of the model.
Sensitivity (Recall): Measures the ability of the model to correctly classify spam messages.
Specificity: Measures the ability of the model to correctly classify non-spam messages.
AUC: The area under the ROC curve, a measure of the model’s discriminatory power.

Best Model

The Multinomial Naive Bayes model was identified as the best model, with an accuracy of 98%, sensitivity of 90.5%, specificity of 99.2%, and an AUC of 0.9836. This model provides a good balance between correctly identifying spam and non-spam messages.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
spam.csv		spam.csv
updated_smallproject_1.ipynb		updated_smallproject_1.ipynb
updated_smallproject_1.py		updated_smallproject_1.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMS-Spam-Classification-with-PySpark

Table of Contents

Project Overview

Setup and Installation

3. Start Spark Session

Data Preprocessing

Model Training and Evaluation

1. Naive Bayes Classifier

2. Random Forest Classifier

3. Support Vector Classifier (SVC)

4. K-Nearest Neighbors (KNN)

5. Logistic Regression

Results

Best Model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SMS-Spam-Classification-with-PySpark

Table of Contents

Project Overview

Setup and Installation

3. Start Spark Session

Data Preprocessing

Model Training and Evaluation

1. Naive Bayes Classifier

2. Random Forest Classifier

3. Support Vector Classifier (SVC)

4. K-Nearest Neighbors (KNN)

5. Logistic Regression

Results

Best Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages