Skip to content

DeebeshS-ML/nlp-text-preprocessing-regex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SMS Spam Classification using NLP

Overview

This project focuses on building a Natural Language Processing (NLP) model to classify SMS messages as Spam or Ham (Non-Spam).

The project demonstrates various text preprocessing techniques and feature extraction methods commonly used in NLP applications.


Problem Statement

Spam messages are unwanted messages that can affect user experience and security. The objective of this project is to build a machine learning model capable of automatically identifying whether an SMS message is spam or ham.


Objectives

  • Perform text preprocessing.
  • Clean and normalize text data.
  • Convert text into numerical features.
  • Train machine learning models.
  • Classify messages into Spam or Ham categories.

NLP Techniques Used

Text Preprocessing

  • Lowercase conversion
  • Tokenization
  • Stopword Removal
  • Stemming
  • Lemmatization
  • Regular Expressions

Feature Engineering

  • Bag of Words
  • CountVectorizer
  • TF-IDF Vectorizer

Libraries Used

  • Python
  • Pandas
  • NumPy
  • NLTK
  • Scikit-Learn
  • Matplotlib
  • Seaborn

Workflow

  1. Data Cleaning
  2. Exploratory Data Analysis
  3. Text Preprocessing
  4. Tokenization
  5. Stemming and Lemmatization
  6. Feature Extraction
  7. Model Building
  8. Model Evaluation

Project Structure

sms-spam-classification-nlp
│
├── data/
├── sms_spam_classifier.ipynb
├── requirements.txt
├── README.md
└── images/

Skills Demonstrated

  • Natural Language Processing
  • Text Cleaning
  • Tokenization
  • Regular Expressions
  • Stemming
  • Lemmatization
  • Count Vectorization
  • TF-IDF Vectorization
  • Machine Learning
  • Feature Engineering

Applications

  • Email Spam Filtering
  • SMS Spam Detection
  • Chatbots
  • Sentiment Analysis
  • Text Classification
  • Information Retrieval

Future Improvements

  • Word2Vec Embeddings
  • GloVe Embeddings
  • LSTM Models
  • BERT Transformers
  • Hyperparameter Tuning
  • Model Deployment using Streamlit

Author

Deebesh Sundar

Machine Learning & Data Science Practitioner

Releases

No releases published

Packages

 
 
 

Contributors