Information Retrieval Classification

Pytorch Implementation for sentiment analysis on product's comment

Requirements

Please use Python 3, ubuntu 16.04, Git, and NVIDIA GPU with CUDA toolkit 8 and cUDNN 6. To install python library:

pip3 install -r requirements.txt
python3 -m spacy download en #for downloading English model for spaCy tokenizer

If failed please look requirements.txt and install one by one.

Dataset

The dataset should be seperated into training file and test file in CSV format When preparing CSV file, the dataset should not use indexing (if saving using pandas, use pd.to_csv(file_name, index=False)). There must be 2 columns: text and label.

CNN and LSTM Model

The custom module of CNN and LSTM model are saved in model_module folder.

JSON file for configuration training model

The JSON file are already prepared with appropriate setting. Most of the setting are for model hyperparameter. Important parameter that might need to be change:

"train_dataset_path": train data path
"dev_dataset_path": test data path
"result_folder_path": where to save result such as model, image of confusion matrix, etc
"use_git": whether to use current commit information for better result versioning. if true result_folder_path = result_folder_path/[branch_name][commit_date_GMT_0][time_duration_after_commit]
"pretrained_word_embedding_name": Name of pretrained word vectors used. For now, word2vec is recommended. word2vec can be downloaded at https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
"pretrained_word_embedding_path": Path of word vectors .bin file
"embedding_dim": Size of word vectors. 300 for word2vec
"train_embedding_layer": Set to false to make embedding layer static.
"epoch": number of training rounds.
"kernel_sizes" (CNN only): List of region sizes used fo CNN. Currently, ensemble learning is created by modifying this parameter.

Please look at the json file and try to run training first to understand more

Running training to save model, plot evaluation accuracy, plot evaluation loss, confusion matrix, precision, recall, and F1 Score

CNN

python train_cnn.py --path train_cnn_parameter.json

LSTM

python train_lstm.py --path train_lstm_parameter.json

The result of confusion_matrix, precision, recall, and F1 score is displayed at console output (0 = positive, 1 = neutral, 2 = negative) The PNG image of confusion_matrix are saved in result_folder_path/[branch_name][commit_date_GMT_0][time_duration_after_commit]/confusion_matrix_folder_path The vocabulary built from dataset are saved as text_vocab.pkl and label_vocab.pkl in project root folder.

View evaluation accuracy and loss with tensorboard

Please run

tensorboard --logdir=[result_folder_path]

To view graph of test loss and accuracy

Run prediction on single model CNN and ensemble learning CNN

Jupyter notebook, trained model, and vocabulary data are prepared for these tasks. Trained model and vocabulary data are saved in CNN_single_and_ensemble_learning_related folder. The Jupyter notebook is saved as CNN_single_model_and_ensemble_model.ipynb. Please run Juypter server before running Juputer notebook.

jupyter notebook

Run prediction on single model LSTM

Jupyter notebook, trained model, and vocabulary data are prepared for these tasks. Trained model and vocabulary data are saved in LSTM_related folder. The Jupyter notebook is saved as LSTM experiment.ipynb. Please run Juypter server before running Juputer notebook.

jupyter notebook

Cohen's kappa on test dataset

The cohen's kappa coefficient are calculated in cohen_kappa.ipynb Jupyter notebook. The dataset containing 2 version of labeling are saved in ir_test_manual_label.csv Please run Juypter server before running Juputer notebook.

jupyter notebook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval Classification

Requirements

Dataset

CNN and LSTM Model

JSON file for configuration training model

Running training to save model, plot evaluation accuracy, plot evaluation loss, confusion matrix, precision, recall, and F1 Score

View evaluation accuracy and loss with tensorboard

Run prediction on single model CNN and ensemble learning CNN

Run prediction on single model LSTM

Cohen's kappa on test dataset

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
CNN_single_and_ensemble_learning_related		CNN_single_and_ensemble_learning_related
LSTM_related		LSTM_related
data_module		data_module
model_module		model_module
.gitignore		.gitignore
CNN_single_model_and_ensemble_model.ipynb		CNN_single_model_and_ensemble_model.ipynb
LSTM experiment.ipynb		LSTM experiment.ipynb
README.md		README.md
cohen_kappa.ipynb		cohen_kappa.ipynb
ir_test_dataset.csv		ir_test_dataset.csv
ir_test_manual_label.csv		ir_test_manual_label.csv
ir_train_dataset.csv		ir_train_dataset.csv
requirements.txt		requirements.txt
train_cnn.py		train_cnn.py
train_cnn_parameter.json		train_cnn_parameter.json
train_lstm.py		train_lstm.py
train_lstm_parameter.json		train_lstm_parameter.json
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval Classification

Requirements

Dataset

CNN and LSTM Model

JSON file for configuration training model

Running training to save model, plot evaluation accuracy, plot evaluation loss, confusion matrix, precision, recall, and F1 Score

View evaluation accuracy and loss with tensorboard

Run prediction on single model CNN and ensemble learning CNN

Run prediction on single model LSTM

Cohen's kappa on test dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages