This directory contains individual R scripts for each machine learning model used to classify fake Amazon reviews.
Make sure you have installed all required R packages:
install.packages(c("tm", "magrittr", "SnowballC", "Matrix", "glmnet", "pROC", "caret",
"randomForest", "ranger", "xgboost", "keras3", "ade4",
"dplyr", "ggplot2", "gridExtra"))For keras3 (required for neural network models 7-8):
Step 1: Install keras3 package
install.packages("keras3")If CRAN installation fails, try installing from GitHub:
install.packages("remotes")
remotes::install_github("rstudio/keras3")Step 2: TensorFlow backend installation
After installing keras3, TensorFlow will be automatically installed when you first run:
library(keras3)
use_backend("tensorflow")This happens automatically when you run the neural network model scripts.
Note:
- The
install_keras()function is not available in keras3 - TensorFlow installation is handled automatically when you first use
use_backend("tensorflow") - If you encounter issues, see
INSTALL_NEURAL_NETWORK.mdfor detailed troubleshooting steps
All scripts expect data_new.csv to be in the same directory. This file contains the preprocessed Amazon reviews data.
- Model: Logistic Regression with Lasso Regularization (lambda.min)
- Features: Selected features from Lasso regularization
- Output: Accuracy, Confusion Matrix, Sensitivity, Specificity, Precision, F1 Score, AUC
- Model: Logistic Regression with Lasso Regularization (lambda.1se)
- Features: More conservative feature selection
- Output: Accuracy, Confusion Matrix, Sensitivity, Specificity, Precision, F1 Score, AUC
- Model: Random Forest
- Parameters: ntree=100, nodesize=5, mtry=60
- Output: Accuracy, Confusion Matrix, Sensitivity, Specificity, Precision, F1 Score, AUC, Variable Importance
- Model: Ranger (Fast Random Forest)
- Parameters: num.trees=200
- Output: Accuracy, Confusion Matrix, Sensitivity, Specificity, Precision, F1 Score, AUC, Variable Importance
- Model: XGBoost (Gradient Boosting)
- Parameters: eta=0.05, nrounds=342
- Output: Accuracy, Confusion Matrix, Sensitivity, Specificity, Precision, F1 Score, AUC, Feature Importance
- Model: Simple Logistic Regression
- Features: Only VERIFIED_PURCHASE
- Output: Accuracy, Confusion Matrix, Sensitivity, Specificity, Precision, F1 Score, AUC, Model Coefficients
- Model: Neural Network (Keras3)
- Features: Review text features only
- Architecture: 2 hidden layers (32-32 units), dropout=0.3
- Output: Test Loss, Test Accuracy, Confusion Matrix, Sensitivity, Specificity, Precision, F1 Score, AUC
- Saves: Model to
fakereview.keras
- Model: Neural Network (Keras3)
- Features: Review text + Review title features
- Architecture: 3 hidden layers (32-32-16 units), dropout=0.3
- Output: Test Loss, Test Accuracy, Confusion Matrix, Sensitivity, Specificity, Precision, F1 Score, AUC
- Saves: Model to
fakereview_titles.keras
To run any model, simply execute the R script:
source('model_01_glm_lasso_min.R')Or from command line:
Rscript model_01_glm_lasso_min.RAll models use data_preprocessing.R which:
- Loads the data from
data_new.csv - Preprocesses text data (removes stopwords, punctuation, numbers, stems)
- Creates document-term matrix
- Splits data into training (75%) and test (25%) sets
Each model outputs:
- Accuracy: Overall classification accuracy
- Confusion Matrix: True/False Positives and Negatives
- Sensitivity (Recall): True Positive Rate
- Specificity: True Negative Rate
- Precision: Positive Predictive Value
- F1 Score: Harmonic mean of Precision and Recall
- AUC: Area Under the ROC Curve
All models automatically generate and save visualizations to the figures/ directory:
- ROC Curves: Receiver Operating Characteristic curves for all models
- Confusion Matrix Heatmaps: Visual representation of classification results
- Model-Specific Visualizations:
- Lasso models (models 1-2): Cross-validation plots showing lambda selection
- Random Forest (model 3): Error rate plot and variable importance
- Ranger (model 4): Variable importance plot
- XGBoost (model 5): Feature importance plot
- Simple GLM (model 6): Prediction probability distribution
- Neural Networks (models 7-8): Training history plots (loss and accuracy) and prediction probability distributions
All figures are saved as high-resolution PNG files (300 DPI) with descriptive filenames following the pattern model_XX_description.png.
- All models use the same random seed (245) for reproducibility
- Training/test split is 75/25
- Neural network models may take several minutes to train
- Some models save their trained models to disk for later use
- All visualizations are automatically saved to the
figures/directory when you run each model script - The
figures/directory will be created automatically if it doesn't exist