Classifying Fake Amazon Reviews

Introduction

With the rise of fake news and misinformation flooding the internet, it is often extremely difficult to determine what is true and what is not. Amazon, as one of the largest e-commerce sites in the world is certainly not safe from this misinformation. A common threat plaguing Amazon is the issue of fake reviews. I did not realize how widespread this was of an issue until I was shopping for a new laptop charger a few weeks back, only to discover that most of the reviews for the chargers I was looking at were not even related to the product page I was on. In this data project, I will attempt to explore the nature of these fake reviews, build several models to predict whether a review is fake or not, and finally build an interactive application where you can check the reviews of any Amazon product against my model.

Dataset

The hardest part of this project was certainly finding a good dataset to work with. Most importantly, I was looking for a dataset that had reviews already classified as real or fake in addition to representing the breadth of products found on Amazon. I spent some time looking on different data platforms with no avail, so I looked to Github to see if anyone had attempted to work with a similar enough dataset. Thankfully, I found a dataset which was perfect for what I needed here: https://github.com/aayush210789/Deception-Detection-on-Amazon-reviews-dataset This dataset provided me with a decent amount of features to work with, so I decided to give it a go. The dataset provides data on the Review Title, Review Text, Product Title, Rating, Verified Purchase, and of course, which class the review belongs in - either real or fake

Understanding the data

Going into this, I had some hypothesis for what I expected to see in reviews which were fake. I hypothesized that fake reviews would have much higher average ratings as their purpose is to try to con people into buying a product. I also expected them to be on the shorter end as its unlikely that the people writing these reviews would take the time to go into detail. I also expected fake reviews to have nowhere near the amount of verified purchases when compared to real reviews. Finally, I expected that certain product categories would be far more prone to fake reviews - for example, Electronics - as there is a far wider range of options in certain product categories than others. To get a feel for this, I made some basic graphs to see if this was the case.


Average Review Length	Distribution of Rankings	Count of Reviews by Product Category	Verified Purchases vs Review Type
This plot shows that it is quite clear that fake reviews are far shorter on average when compared to real reviews. On average, real reviews are 102 characters longer than their fake counterparts.	This plot shows the count of reviews grouped by ratings and split by whether the review is real or not (0 is real, 1 is fake). I was surprised when I first saw this, but quickly realized that the plot looks like this as the dataset is a perfect split of real and fake reviews as well as the count of ratings within each bucket.	This plot shows the number of reviews in each category, broke up again by whether the reviews are real or fake. This plot is again misleading as the dataset appears to have been put together to ensure equal representation of all categories.	This plot shows quite clearly that real reviews have far more verified purchases than fake reviews do. This makes perfect sense, as fake reviewers will likely never go through the hassle of purchasing the product as there are clear financial disadvantages to doing so. Even if the product is their own, they will have to pay Amazon a cut of sales for each product they sell to themselves.

Data Preperation

Going into this project, I wanted to gain more skill with some of the tools we used this semester, specifically text mining. The majority of my analysis is focused on using the text of each review to gain some insight as to if there are fundamental differences in the text used in real vs fake reviews and if I could leverage that to build a strong classifier. I began by first turning the reviews texts into a corpus, removing stopwords, punctuation, and numbers, stemming the words, and making everything lowercase. Once I was done, I still had a ridiculous number of words (around 30,000), so I decided to remove the sparse terms from the matrix, leaving me with about 4000 words to work with (I actually opted for a much lower number earlier, but revisited this and raised the number of words in my Document - Term Matrix.) Upon doing this, I was curious to see if there were any immediate things I could observe just from looking at the word-clouds of real and fake reviews, which are shown below. After making these, I notices that 'the' was included, despite me triple checking that I was removing stopwords correctly.


REAL REVIEWS WORDCLOUD (TOP 100 WORDS)	FAKE REVIEWS WORDCLOUD (TOP 100 WORDS)
The words here are similar to those in the fake reviews, but the frequency of usage is far different. This leads me to believe that real reviews are far more passionate and descriptive, rather than the generic words used to express quality in the fake reviews wordcloud.	From first glance, it seems that these reviews tends to use words to generally speak about how good a product is. For example, notice how large recommend, great, realli, love and like are.

Feature Selection and Model Building

At this point, I felt that I understood the data well enough to begin thinking about which features I wanted to use and which models I wanted to run. As mentioned earlier, I wanted to focus specifically on text mining along with a number of other features, and this remaining as one of my guiding principles throughout this process. As expected, I removed any features irreproducible, such as the product and review ids. Additionally, I opted to not use the product title for each review as I could not think of a fairly simple way to marry the product titles with the review text and ensure that relationship would be represented in my models.

Logistic Regression

Since this is a classification problem, logistic regression seemed like a great starting point. Due to the insane amount of features at this point (over 4000), I decided to first use a regularization technique known as LASSO to help me select the features for my logistic regression. Once running LASSO, and building the logistic regression, I found that it actually performed quite well, with about 80% accuracy on test data. However, upon looking at the summary information for the model, I quickly realized most of this model's power was in using the verified purchase feature of each review to make a prediction. In keeping with my goal to learn more about text mining, I decided that I wanted to explore models that would be better suited for picking up features based on the text in the reviews.

Decision Trees

Decision trees seemed like a good next move as they would be much better at using more of the text based features in classifying the reviews. I ultimately decided to run a Random Forest based model, as I believed a single decision tree would by no means be able to capture the underlying relationship without massive overfitting. After tuning the model through methods learned in my other course, I arrived at a classifier with similar accuracy to that of the Logistic Regression, around 80%, but I believed that this model was capturing more of the underlying relationship with the text in the reviews.

Neural Networks

I was most excited to try out running a neural network on this data as it seemed like a great candidate. I came up with a few different structures, mostly through trial and error, but was quickly discouraged as they were doing worse than my previous models. After covering neural networks more closely in another one of my classes, I quickly realized a huge problem was that the neural networks were heavily overfitting or just memorizing the test data. After tuning the model against this, by adding a regularization or penalty term, and tweaking the learning rate, I arrived at my best model yet. A neural network with close to 82.5% accuracy. While this was not a huge jump compared to the previous models, what really excited me was that the neural network was heavily dependent on the text based features. I tested this by running a few neural networks without any of the obvious features like verified purchase or rating and was still getting around 70% accuracy. I decided that the neural network was the best model for the job for this reason in addition to its superior accuracy.

Other Models

In the process, I ran a few other model as I was inspired by what I had seen in my other classes. I decided to run boosting, an improved version of Random Forest, which works by trying to fix the issues of previous trees. I was specifically excited by the idea that boosted models excel at combining weak features into larger stronger ones, perfect for this application. However, there was very little improvement gained from running these models. It may have been due to the fact that I am not super familiar with them and that they take a super long time to run, which dissuaded me from trying various different tweaks. Finally, I tried to strengthen my earlier models by giving them the title text data from each review, but little to no improvements were observed.

Building a Useful Tool

Now that I had built a fairly strong classifier, I decided that I wanted to put my neural network to good use and use it to classify actual data. In order to accomplish this, I built a fairly basic R Shiny app which allows anyone to put any Amazon product number in and get a bit of insight on how legitimate a products reviews may be. I did this by first figuring out a systematic way to scrape Amazon reviews and the features I needed in R (https://justrthings.com/2019/03/03/web-scraping-amazon-reviews-march-2019/ this article was super pivotal in allowing me to figure this out). One of the largest challenges I faced was allowing the app to work for products whose categories were not in my original dataset. To resolve this, I tried my best to scrape fairly broad product categories, and match them with the 40 or so I had in my training dataset using fuzzy string matching in the form of Levenshtein distances. While this is certainly a decent fix, I imagine with more data, the app would improve in its ability to learn the relationships present in specific classes of products. Once I had resolved this issue, I then exported my Keras based neural network so that I could use it in my Shiny app to classify these scraped reviews. Finally, once these reviews are classified, the proportion of the scraped reviews deemed fake is presented to the user, along with a number of visualizations. The tool works by scraping the 100 most recent reviews of whichever product is inputted, analyzing and classifying them, and then presenting the output to the user.

Ideally, this tool would be up live on the web for you to use, but I ran into a bunch of trouble getting Keras to work on a R Shiny server as Keras is based on Python, which R Shiny servers gave me a bunch of trouble with getting to work. I have attached the all the files to run the app locally in R Studio if you would like to try it out. Below are some screenshots of the app. R Shiny is a super unintuitive library to work with at first, so it was quite a struggle to get it this far. However, with the knowledge I gained thoughout the process, I will likely revisit it later and make further improvements.

Conclusion and Takeaways

In conclusion, fake Amazon reviews pose a risk to customers and sellers alike. I really hope Amazon begins to crack down on these reviews before they become too large of a problem. While Amazon certainly has far more domain expertise and resources to build a better system, I believe that my analysis highlights some of the important things to be considered when implementing such a system. Amazon should definitely consider the actual content of these reviews, in addition to posing restrictions on the number of reviews that can be posted without being a verified purchaser. Additionally, Amazon has far granular data, and ideally they should fine tune their models to look for specific relationships within product categories and subcategories, as I am sure that they play a role, but I just did not have enough data to explore this. This was both a challenging yet exciting project and it has taught me a ton about text mining and model that perform well in that domain, ultimately allowing me to meet my goal of learning more about this topic!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
web-app		web-app
README.md		README.md
amazon_reviews.csv		amazon_reviews.csv
fakereview.hd5		fakereview.hd5
modelsandbrainstorming.Rmd		modelsandbrainstorming.Rmd
modelsandbrainstorming.nb.html		modelsandbrainstorming.nb.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classifying Fake Amazon Reviews

Introduction

Dataset

Understanding the data

Data Preperation

Feature Selection and Model Building

Logistic Regression

Decision Trees

Neural Networks

Other Models

Building a Useful Tool

Conclusion and Takeaways

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Classifying Fake Amazon Reviews

Introduction

Dataset

Understanding the data

Data Preperation

Feature Selection and Model Building

Logistic Regression

Decision Trees

Neural Networks

Other Models

Building a Useful Tool

Conclusion and Takeaways

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages