This project involves curating and integrating three major fake news datasets—Fakeddit, Weibo, and FakeNewsNet—followed by extensive feature engineering and the development of a deep learning model. By combining textual, visual, and metadata features across multiple languages, the project aims to build a robust multilingual, multimodal system for fake news classification.
Report: https://docs.google.com/document/d/18P5IHFZBCZZa9uoN0dCsvX3OWVefIykAAFsvcLglQkI/edit?usp=drive_link
- Fakeddit: A dataset sourced from Reddit, consisting of both rumor and non-rumor posts. Each post contains metadata, text, and images.
- Weibo: A collection of tweets from the Weibo platform, labeled as rumors or non-rumors. Each tweet contains metadata, text, and associated images.
- FakeNewsNet: A dataset containing news articles with labels indicating whether they are fake or real, along with relevant metadata and images.
data/: This folder contains the raw data for the three datasets.fakenewsnet/: Processed FakeNewsNet dataset.weibo/: Processed Weibo dataset.fakeddit/: Processed Fakeddit dataset.image_dump/: A folder where all the images from different sources are stored.
scripts/: Python scripts for processing the datasets, performing feature engineering, and saving the processed data.data_processing: Notebooks for data preprocessing and joining of the datasetsfeature_engineering: Notebook for feature engineeringmodel: Notebook for model training and evaluation
requirements.txt: Python dependencies needed for the project.
Note: GitHub may not render Jupyter notebooks correctly if they were created using Google Colab, particularly due to compatibility issues with interactive widget metadata. However, all .ipynb files in this repository can be downloaded and opened locally in Jupyter Notebook or Colab with full outputs preserved. This includes all model outputs, visualizations, and metrics.
- Python 3.11 or higher
- Git
- Pip package manager
-
Clone the repository:
git clone https://github.com/yourusername/FakeNewsProject.git cd FakeNewsProject -
Create and activate a virtual environment (recommended):
python -m venv venv # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate -
Install the required dependencies:
pip install -r requirements.txt -
Configure environment variables:
- Copy the
.env_samplefile to.env
cp .env_sample .env- Edit the
.envfile and add your Reddit API credentials (required for scraping Fakeddit data)
- Copy the
The datasets used in this project are large. Two options are available:
-
Download the processed datasets:
- The processed dataset is available on Kaggle
- The feature-engineered dataframe is available on Google Drive
-
Process the datasets from scratch:
- Run the data processing notebooks in the
scripts/data_processingdirectory in the following order:(FakeNewsNet) fakenewsnet_preprocessing.ipynb(Weibo) weibo_preprocessing.ipynb(Fakeddit) reddit_scraper.ipynb(Fakeddit) article_image_scraper.ipynbfinal_dataset_preperation.ipynb
- Run the data processing notebooks in the
After setting up the datasets, run the feature engineering notebook:
jupyter notebook scripts/feature_engineering/feature_engineering.ipynb
Run the model training and evaluation notebook:
jupyter notebook scripts/model/model_train_eval.ipynb
Below are the evaluation metrics of our multimodal model on the test set. For more details refer to model_train_eval.ipynb.
The datasets used in this project are from the following repositories:



