Skip to content

codexshami/HamOrSpam-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HamOrSpam

AI-Powered Email Classification System

Python Version Framework License Accuracy

Application Overview

HamOrSpam is a production-grade spam detection system that combines TF-IDF vectorization with logistic regression to classify email messages with greater than 98% accuracy. The application ships with a Streamlit-based analytics interface, real-time inference, and model export capabilities.


Table of Contents


Overview

Email spam detection remains a core challenge in messaging infrastructure. HamOrSpam addresses this with a lightweight yet high-performing classification pipeline trained on the Enron Spam Dataset. The system handles class imbalance via SMOTE oversampling and exposes both a web UI and a REST-compatible prediction endpoint.


Features

  • Real-time spam classification with calibrated probability scoring
  • Interactive analytics dashboard displaying model performance metrics
  • Word cloud visualizations highlighting key spam and ham indicators
  • SMOTE-based training pipeline to handle imbalanced class distributions
  • Exportable trained model artifacts for downstream integration
  • Dark-mode Streamlit UI with responsive layout

Architecture

Raw Email Text
      |
      v
TF-IDF Vectorization       (max_features configurable, default: 3000)
      |
      v
SMOTE Class Balancing      (applied during training only)
      |
      v
Logistic Regression        (sklearn, threshold configurable)
      |
      v
Spam / Ham Prediction      (label + confidence score)

The pipeline is intentionally lean. TF-IDF captures term importance relative to the corpus without requiring dense embeddings, and logistic regression provides well-calibrated probability outputs suitable for threshold tuning.


Dataset

Property Detail
Source Enron Spam Dataset
Total Samples 5,000+
Class Balance 70% Ham / 30% Spam (pre-SMOTE)
Label Format Binary — spam / ham

Performance

Evaluated on a held-out test split after SMOTE-balanced training.

Metric Score
Accuracy 98.2%
Precision 97.8%
Recall 98.5%
F1 Score 98.1%

Model Performance

Accuracy Curve

Evaluation Metrics


Installation

Prerequisites

  • Python 3.8 or higher
  • pip

Steps

# Clone the repository
git clone https://github.com/codewithshami/HamOrSpam-Classifier.git
cd HamOrSpam-Classifier

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirement.txt

Usage

Launch the Application

streamlit run app.py

The application will be available at http://localhost:8501 by default.

Application Interface

Getting Started

Before classifying messages, refer to the onboarding screen for a walkthrough of available features and controls.

Instructions

Classifying a Message

  1. Navigate to the Prediction tab in the sidebar.
  2. Paste or type the email content into the input area.
  3. Click Analyze Message.
  4. Review the classification label, confidence score, and probability distribution.
  5. Optionally inspect the keyword analysis panel for feature-level explanations.

Analyser View

Analyser — Spam Result

Analyser — Ham Result

Analytics Dashboard

The Analytics tab surfaces model performance metrics, confusion matrix visualizations, word frequency distributions, and dataset statistics without requiring any additional configuration.

Data Statistics

Word Frequency Distribution


Configuration

Create a .env file in the project root to override default settings:

DEBUG_MODE=False
THRESHOLD=0.85
MAX_FEATURES=3000
Variable Default Description
DEBUG_MODE False Enables verbose logging when set to True
THRESHOLD 0.85 Minimum spam probability required for a spam label
MAX_FEATURES 3000 Number of features retained by the TF-IDF vectorizer

Raising THRESHOLD reduces false positives at the cost of recall. Lowering it increases sensitivity. Adjust based on the tolerance of the deployment environment.


API Reference

POST /predict

Classifies a single email message and returns a confidence-scored result.

Parameters

Name Type Required Description
message string Yes Raw email content to classify

Response

{
    "prediction": "spam",
    "confidence": 0.95,
    "spam_probability": 0.92,
    "ham_probability": 0.08
}

Response Fields

Field Type Description
prediction string Classification result — spam or ham
confidence float Model confidence in the prediction (0.0 to 1.0)
spam_probability float Raw probability assigned to the spam class
ham_probability float Raw probability assigned to the ham class

Tech Stack

Layer Technology
Backend Python, Streamlit
Machine Learning scikit-learn, imbalanced-learn (SMOTE)
Data Processing pandas, numpy
Visualization matplotlib, seaborn, wordcloud
Deployment Streamlit Cloud

Deployment

Streamlit Cloud

Push the repository to GitHub, connect it to Streamlit Cloud, and set any required environment variables in the Secrets panel. The service will handle the rest.

Self-Hosted

pip install streamlit
pip install -r requirement.txt
streamlit run app.py --server.port 8501 --server.headless true

For production deployments, place the application behind a reverse proxy such as Nginx and manage the process with a supervisor like systemd or supervisord.


Contributing

Contributions are welcome. Please follow the workflow below:

  1. Fork the repository.
  2. Create a feature branch: git checkout -b feature/your-feature-name
  3. Commit your changes with a descriptive message: git commit -m "Add: brief description of change"
  4. Push the branch: git push origin feature/your-feature-name
  5. Open a Pull Request against the main branch.

Please ensure new code includes appropriate tests and does not break existing functionality before submitting a PR.


License

This project is licensed under the MIT License. See LICENSE for the full text.


Contact

Mohd Shami

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors