Fake News - Cleaning Online Media with Machine Learning

João Pedro Eira

Content

Project Description
Questions
Dataset
Cleaning
Analysis
Conclusion
Future Work
Links

Project Description

This project aims to develop a Machine Learning algorithm to detect fake news in the English language. Several classifiers are testes together with bag of words processing of the text.

Questions

Given a news defined bu it´s title and text determine if the news is a real news article or fake news.

Fake news is one of the great plagues of the modern fully connected age. They are now back on the spotlight in the context of the COVID-19 crisis, as rumours spread even faster than the virus. To be able to effectively identify and eliminate fake news is paramount to any democracy.

To approach this issue using Machine Learning a Bag of Words approach is used, and the text is first vectorized. Then several classifiers are tested and tuned to find the optimal model.

Dataset

The original data to train/test the models is obtained from two Kaggle datasets. You can find them here and here. Both datasets include the article text and title.

Additionaly, a small more recent data set of COVID 19 news was used to test the model performance with news from a different context and timeline of the original data. This data is included in the file 'corona_fake.csv' included in this repository.

Cleaning

The data sets required some cleaning for missing entries. The process used is described in detail in the notebooks. Also, since the size was very unbalanced (150K entries versus 12K), the "true news" data set was sampled so that a balanced training set was used for the model.

Analysis

The first step in the project is process the text - cleaning punctuation, tokenizing, stemming, lemmatization and stop word removing followed by vectorization and tf-idf transforming.

This was done with a combination of nltk and sklearn library implementations.

After that, six classifiers are tested, hypertuned and cross-validated - Naive-Bayes, Logistic Regression, SVM with Linear Kernel, Decision Tree, Random Forest and Gradient Boosting. Then they are evaulted on the test set - using Accuracy, Precision, Recall and F1-score metrics.

For each classifier 3 feature configurations were tested - only title of the article, only the text (full article) or both text and title used as features.

All models were tested using sklearn implementations of the models.

Results show that:

Using the full text of the article leads to better (6-8% depending on the model) results than using only the title
Using both text and title at the same time does not show sensible improvement, and, depending on the model, may even show worse results than using only the text
The best model overall is Logistic Regression using the text as a feature and a regularization parameter C=10, which produces an accuracy of 93.11% and an F1-Score of 0.93.
Tested on a different dataset (COVID-19 news) the model still shows an high accuracy - 85%.

Conclusion

Results show a very high accuracy using a very simple model - Logistic Regression. The model is relatively computer-processing inexpensive.
The more rich text article produces significantly better results, meaning the model can infer more "meaning" from the full text than when using just the title.
The model reveals robustness to a different data set.

Future Work

Test the model with a bigger and more rich recent dataset, filled with different news cycles. Analyse results and retrain acoordingly.
Test new approaches - deep learning, LTSM,...
Extend to Fake "audio news" and "video" - Get text from audio and video and use the models to infer fake/true news.
Integrate with online platforms for automatic detection and elimination of fake contents.

Links

Repository
Slides

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Notebooks		Notebooks
Presentation		Presentation
Scraped Datasets		Scraped Datasets
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fake News - Cleaning Online Media with Machine Learning

Content

Project Description

Questions

Dataset

Cleaning

Analysis

Conclusion

Future Work

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fake News - Cleaning Online Media with Machine Learning

Content

Project Description

Questions

Dataset

Cleaning

Analysis

Conclusion

Future Work

Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages