AI-Powered Email Classification System
HamOrSpam is a production-grade spam detection system that combines TF-IDF vectorization with logistic regression to classify email messages with greater than 98% accuracy. The application ships with a Streamlit-based analytics interface, real-time inference, and model export capabilities.
- Overview
- Features
- Architecture
- Dataset
- Performance
- Installation
- Usage
- Configuration
- API Reference
- Deployment
- Contributing
- License
- Contact
Email spam detection remains a core challenge in messaging infrastructure. HamOrSpam addresses this with a lightweight yet high-performing classification pipeline trained on the Enron Spam Dataset. The system handles class imbalance via SMOTE oversampling and exposes both a web UI and a REST-compatible prediction endpoint.
- Real-time spam classification with calibrated probability scoring
- Interactive analytics dashboard displaying model performance metrics
- Word cloud visualizations highlighting key spam and ham indicators
- SMOTE-based training pipeline to handle imbalanced class distributions
- Exportable trained model artifacts for downstream integration
- Dark-mode Streamlit UI with responsive layout
Raw Email Text
|
v
TF-IDF Vectorization (max_features configurable, default: 3000)
|
v
SMOTE Class Balancing (applied during training only)
|
v
Logistic Regression (sklearn, threshold configurable)
|
v
Spam / Ham Prediction (label + confidence score)
The pipeline is intentionally lean. TF-IDF captures term importance relative to the corpus without requiring dense embeddings, and logistic regression provides well-calibrated probability outputs suitable for threshold tuning.
| Property | Detail |
|---|---|
| Source | Enron Spam Dataset |
| Total Samples | 5,000+ |
| Class Balance | 70% Ham / 30% Spam (pre-SMOTE) |
| Label Format | Binary — spam / ham |
Evaluated on a held-out test split after SMOTE-balanced training.
| Metric | Score |
|---|---|
| Accuracy | 98.2% |
| Precision | 97.8% |
| Recall | 98.5% |
| F1 Score | 98.1% |
- Python 3.8 or higher
- pip
# Clone the repository
git clone https://github.com/codewithshami/HamOrSpam-Classifier.git
cd HamOrSpam-Classifier
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirement.txtstreamlit run app.pyThe application will be available at http://localhost:8501 by default.
Before classifying messages, refer to the onboarding screen for a walkthrough of available features and controls.
- Navigate to the Prediction tab in the sidebar.
- Paste or type the email content into the input area.
- Click Analyze Message.
- Review the classification label, confidence score, and probability distribution.
- Optionally inspect the keyword analysis panel for feature-level explanations.
The Analytics tab surfaces model performance metrics, confusion matrix visualizations, word frequency distributions, and dataset statistics without requiring any additional configuration.
Create a .env file in the project root to override default settings:
DEBUG_MODE=False
THRESHOLD=0.85
MAX_FEATURES=3000| Variable | Default | Description |
|---|---|---|
DEBUG_MODE |
False |
Enables verbose logging when set to True |
THRESHOLD |
0.85 |
Minimum spam probability required for a spam label |
MAX_FEATURES |
3000 |
Number of features retained by the TF-IDF vectorizer |
Raising THRESHOLD reduces false positives at the cost of recall. Lowering it increases sensitivity. Adjust based on the tolerance of the deployment environment.
Classifies a single email message and returns a confidence-scored result.
Parameters
| Name | Type | Required | Description |
|---|---|---|---|
message |
string | Yes | Raw email content to classify |
Response
{
"prediction": "spam",
"confidence": 0.95,
"spam_probability": 0.92,
"ham_probability": 0.08
}Response Fields
| Field | Type | Description |
|---|---|---|
prediction |
string | Classification result — spam or ham |
confidence |
float | Model confidence in the prediction (0.0 to 1.0) |
spam_probability |
float | Raw probability assigned to the spam class |
ham_probability |
float | Raw probability assigned to the ham class |
| Layer | Technology |
|---|---|
| Backend | Python, Streamlit |
| Machine Learning | scikit-learn, imbalanced-learn (SMOTE) |
| Data Processing | pandas, numpy |
| Visualization | matplotlib, seaborn, wordcloud |
| Deployment | Streamlit Cloud |
Push the repository to GitHub, connect it to Streamlit Cloud, and set any required environment variables in the Secrets panel. The service will handle the rest.
pip install streamlit
pip install -r requirement.txt
streamlit run app.py --server.port 8501 --server.headless trueFor production deployments, place the application behind a reverse proxy such as Nginx and manage the process with a supervisor like systemd or supervisord.
Contributions are welcome. Please follow the workflow below:
- Fork the repository.
- Create a feature branch:
git checkout -b feature/your-feature-name - Commit your changes with a descriptive message:
git commit -m "Add: brief description of change" - Push the branch:
git push origin feature/your-feature-name - Open a Pull Request against the
mainbranch.
Please ensure new code includes appropriate tests and does not break existing functionality before submitting a PR.
This project is licensed under the MIT License. See LICENSE for the full text.
Mohd Shami
- Email: codexshami@gmail.com
- LinkedIn: linkedin.com/in/mohd-shami-792133276
- Repository: github.com/codewithshami/HamOrSpam-Classifier










