El Matador — AI-Powered News Credibility Analyzer

El Matador is an ML-powered news credibility analysis tool that helps users evaluate the trustworthiness of news articles. It combines a TF-IDF + scikit-learn classifier trained on 63,000+ labeled articles with a suite of rule-based linguistic pattern detectors to produce an interpretable credibility score.

✨ Features

Feature	Description
Credibility Score (0–100)	Blends ML model confidence (70%) with pattern analysis (30%)
4-class classification	`REAL`, `FAKE`, `MISLEADING`, `UNVERIFIED`
9 pattern detectors	Sensational language, excessive caps, vague sources, conspiracy framing, emotional manipulation, one-sidedness, lack of evidence, extreme adjectives, clickbait
Emotional tone	5-level tone classification from Neutral to Highly manipulative
Suspicious claims	Up to 5 flagged sentences per article with fact-check guidance
Streamlit UI	Interactive two-column layout with full score breakdown
CLI-compatible	`analyze()` returns a JSON-serialisable dict for downstream integration

🏗️ Project Structure

El_Matador/
├── src/                        # Refactored source packages
│   ├── analyzer/
│   │   └── credibility_analyzer.py  # Core orchestrator
│   ├── models/
│   │   └── model_loader.py          # Lazy singleton model loader
│   ├── patterns/
│   │   ├── pattern_detector.py      # 9-pattern linguistic detector
│   │   ├── emotional_analyzer.py    # Tone classifier
│   │   └── claim_highlighter.py     # Suspicious-claim extractor
│   └── utils/
│       └── text_utils.py            # Canonical text helpers
│
├── tests/                      # pytest test suite
│   ├── test_utils.py
│   ├── test_patterns.py
│   ├── test_claim_highlighter.py
│   └── test_analyzer.py
│
├── models/                     # Trained model artefacts (git-ignored)
│   ├── best_model.joblib
│   ├── tfidf_vectorizer.joblib
│   ├── metadata.txt
│   └── training_report.json
│
├── streamlit_app.py            # Streamlit UI entry point
├── train_model.py              # Model training script (with cross-validation)
├── pyproject.toml              # pytest config
├── requirements.txt
└── .gitignore

Note: The legacy flat-file modules (credibility_analyzer.py, pattern_detector.py, etc.) remain in the root for backward compatibility. New development should target src/.

🚀 Quick Start

Prerequisites

Python 3.9+
pip

1 — Clone the repository

git clone https://github.com/Mario5T/El_Matador.git
cd El_Matador

2 — Create and activate a virtual environment

python -m venv venv

# macOS / Linux
source venv/bin/activate

# Windows
venv\Scripts\activate

3 — Install dependencies

pip install -r requirements.txt

4 — Download the dataset (for training only)

Download the WELFake dataset from Kaggle:

https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification

Place the file at:

dataset/WELFake_Dataset.csv

⚠️ The CSV (~150 MB) is excluded from version control. You only need it to retrain the model.

5 — Train the model

python train_model.py

This will:

Train Logistic Regression and Passive Aggressive classifiers
Print full metrics (accuracy, precision, recall, F1, confusion matrix) for each
Run 5-fold cross-validation on the best model
Save artefacts to models/

Expected output (example):

[4/6] Training & evaluating models …

  ▸ Passive Aggressive  (trained in 1.8s)
    Accuracy  : 0.9667
    Precision : 0.9668
    Recall    : 0.9667
    F1 Score  : 0.9667

[5/6] Cross-validating best model (Passive Aggressive, k=5) …
    accuracy  : 0.9654 ± 0.0021
    precision : 0.9655 ± 0.0022
    recall    : 0.9654 ± 0.0021
    f1        : 0.9654 ± 0.0021

6 — Run the Streamlit app

streamlit run streamlit_app.py

The app opens at http://localhost:8501.

📥 Input / Output

Input format

Paste the plain text of a news article (minimum 50 characters, maximum 50,000 characters).

Example — credible article snippet:

Scientists at Stanford University published a peer-reviewed study showing 
a new vaccine candidate was 89% effective in phase 3 trials involving 30,000 
participants. Dr. Jane Smith confirmed the results would be submitted to the FDA.

Example — suspicious article snippet:

SHOCKING: Government scientists EXPOSED! Sources say the deep state is 
covering up a massive false flag. Many believe this conspiracy, but the 
mainstream media doesn't want you to know the truth. Wake up, people!

Output

The analyze() method returns a dictionary:

Key	Type	Description
`classification`	`str`	`REAL`, `FAKE`, `MISLEADING`, or `UNVERIFIED`
`credibility_score`	`int`	0–100; higher = more credible
`risk_level`	`str`	`Low Risk` (≥75), `Medium Risk` (40–74), `High Risk` (<40)
`confidence`	`int`	System confidence in its own assessment (0–100%)
`analysis_summary`	`str`	2–4 sentence summary
`key_indicators`	`list[str]`	Top linguistic red flags detected
`emotional_tone`	`str`	Dominant tone from Neutral to Highly emotional and manipulative
`suspicious_claims`	`list[str]`	Up to 5 sentences flagged for fact-checking
`recommended_action`	`str`	Actionable user guidance
`explanation`	`str`	Detailed assessment explanation
`pattern_score`	`float`	Raw pattern suspicion score (0.0–1.0)
`patterns`	`dict`	All 9 pattern detector outputs

🖥️ CLI Usage

from src.models import ModelLoader
from src.analyzer import CredibilityAnalyzer

loader   = ModelLoader()
model, vectorizer = loader.load()   # cached after first call

analyzer = CredibilityAnalyzer()
result   = analyzer.analyze(article_text, model, vectorizer)

print(result["classification"])     # → "FAKE"
print(result["credibility_score"])  # → 18

🧪 Running Tests

pytest

Test suite covers:

test_utils.py — text preprocessing and sentence-level helpers
test_patterns.py — PatternDetector outputs and edge cases
test_claim_highlighter.py — suspicious-claim extraction
test_analyzer.py — full pipeline with mocked ML model

📊 Model Performance

Model	Accuracy	Precision	Recall	F1 (weighted)
Logistic Regression	~94%	~94%	~94%	~94%
Passive Aggressive ✅	~96.7%	~96.7%	~96.7%	~96.7%

Trained on WELFake dataset (~63,000 labelled articles, 80/20 split).
5-fold cross-validation confirms generalisation (F1 std ≈ 0.002).

🚀 Deployment

Streamlit Cloud (recommended)

Push to GitHub (model artefacts excluded — see below).
Connect repo to share.streamlit.io.
Set entry point: streamlit_app.py.
Add models/best_model.joblib and models/tfidf_vectorizer.joblib via Streamlit Secrets or a one-time download script in streamlit_app.py.

Cold-start optimisation: @st.cache_resource on load_model() ensures the model is loaded once per session, not on every rerun.

Model caching note

Because .joblib files are excluded from git (they are ~2.5 MB total), you have two options for Streamlit Cloud:

Option A: Upload artefacts to a private GCS/S3 bucket and download on cold start.
Option B: Store them in Streamlit Secrets / file-based secrets as base64-encoded blobs.

📋 Commit Strategy

Suggested branch / PR breakdown:

Branch	Purpose
`feat/src-restructure`	Move modules into `src/` packages
`feat/model-pipeline`	Improved `train_model.py` with CV and JSON report
`feat/analyzer-refactor`	Singleton sub-components, constants, type hints
`feat/model-loader`	Lazy `ModelLoader` with `lru_cache`
`feat/tests`	Full pytest suite
`chore/gitignore`	Add `.DS_Store`, `models/*.joblib`, `dataset/`
`docs/readme`	This README

🔮 Limitations

No real-time fact-checking — analysis is purely linguistic/structural.
No knowledge injection — the model has no live internet access.
Domain drift — model trained on English news; performance may degrade on non-news text or other languages.
Satire blind spot — satirical articles may score as MISLEADING due to sensational language patterns.

📄 License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

El Matador — AI-Powered News Credibility Analyzer

✨ Features

🏗️ Project Structure

🚀 Quick Start

Prerequisites

1 — Clone the repository

2 — Create and activate a virtual environment

3 — Install dependencies

4 — Download the dataset (for training only)

5 — Train the model

6 — Run the Streamlit app

📥 Input / Output

Input format

Output

🖥️ CLI Usage

🧪 Running Tests

📊 Model Performance

🚀 Deployment

Streamlit Cloud (recommended)

Model caching note

📋 Commit Strategy

🔮 Limitations

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 240 Commits
.streamlit		.streamlit
models		models
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
claim_highlighter.py		claim_highlighter.py
contribution_history.txt		contribution_history.txt
credibility_analyzer.py		credibility_analyzer.py
emotional_analyzer.py		emotional_analyzer.py
pattern_detector.py		pattern_detector.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py
train_model.py		train_model.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

El Matador — AI-Powered News Credibility Analyzer

✨ Features

🏗️ Project Structure

🚀 Quick Start

Prerequisites

1 — Clone the repository

2 — Create and activate a virtual environment

3 — Install dependencies

4 — Download the dataset (for training only)

5 — Train the model

6 — Run the Streamlit app

📥 Input / Output

Input format

Output

🖥️ CLI Usage

🧪 Running Tests

📊 Model Performance

🚀 Deployment

Streamlit Cloud (recommended)

Model caching note

📋 Commit Strategy

🔮 Limitations

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages