Machine Learning project for RUG-ml-2018
The complete cresci-2017 data set can be downloaded from the Bot Repository. A filtered version used in the NLP approach has been uploaded to the repository.
The Stanford GloVe embeddings can be downloaded from here.
The scripts will look for data files in specific directories.
- The English-filtered tweets, which are used for the NLP approaches (both the models and the embeddings' training), should be placed in the path "data/preprocessedTweets/".
- The GloVe embeddings should be placed in the path "data/gloveEmbeds/".
- The complete spambot and genuine tweets.csv files, used in the decision trees, should be placed in the path "data/datasets_full.csv/traditional_spambots_1.csv/" and "data/datasets_full.csv/genuine_accounts.csv/".
conda env create -f gpu-environment.ymlPrerequisites for using the GPU environment can be found here, along with a list with compatible GPUs.
Note: this is an environment for Windows! Using it in another OS can lead to compatibility issues.
conda env create -f cpu-environment.ymlIt is advised to run the code while in the proper directories, as indicated in the parentheses.
-
tf-idf:
- To test the Naive Bayes (Models) model using tf-idf run:
python sklearnNB.py
- To test the Support Vector Model (Models) model using tf-idf run:
python svm.py
- To test the Naive Bayes (Models) model using tf-idf run:
-
word embeddings:
- Custom embeddings
- first run the training (Preprocess):
python word2vec.py
- then run the model/-s (Models):
or
python sklearnNBEmbeded.py
python svmEmbeded.py
- first run the training (Preprocess):
- GloVe embeddings:
- run the model/-s (Models):
or
python sklearnNBEmbededGlove.py
python svmEmbededGlove.py
- run the model/-s (Models):
- Custom embeddings
-
Decision trees:
- run the logistic regression model (FeatureSelection) to show the feature importance and two performances with two encodings:
python Logistic_Regression_for_weights.py
- run SkDecision Tree model (FeatureSelection) to show the accuracy with OneHot and Binary encoding:
python Decision_tree_under_two_encoding.py
- run the decision tree model (FeatureSelection) which is implemented step by step:
or the model (FeatureSelection) by using scikit learn library:
python decision_tree.py
python skDecisionTree.py