Hi!
I’m experiencing some issues training a custom SquiggleNet model. Would it be possible to check whether I’m doing something wrong?
Classification using your pretrained models, on human and E. coli R9 reads
First I tried running SquiggleNet inference using your pretrained models, on human and E. coli nanopore R9 data. This gave very nice results: an accuracy of 83-86% (depending on the model).
I did assume that human = 1 and bacterial = 0 in these models. Is this correct?
Classification using a custom model, on human and SARS-CoV2 R9 reads
Then I tried training a model using human and SARS-CoV2 data. Classifying the test data using this custom model, resulted in an accuracy of only 3%.
I am not sure whether I executed your scripts correctly. It would be very much appreciated if you wanted to check whether I ran your scripts as intended.
-
Splitting up the data into training, validation and test datasets
- I used an equal amount of target (= SARS-CoV2) and non-target (=human) reads (269507 reads)
- 80% of the reads were randomly selected and allocated to the training dataset, another 10% were allocated to the validation dataset, the remaining reads were allocated to the test dataset
- the remaining reads (not included in the 269507 reads I started with) were also allocated to the test dataset
-
Preprocessing the training data
- python ./SquiggleNet/preprocess.py -gp sarscov2_train_readids.txt -gn human_train_readids.txt -i fast5_human_and_sarscov2 -o outfolder_train
- this resulted in 21 neg_*.pt and 15 pos_*.pt data batches
-
Preprocessing the validation data
- python ./SquiggleNet/preprocess.py -gp sarscov2_val_readids.txt -gn human_val_readids.txt -i fast5_human_and_sarscov2 -o outfolder_val
- this resulted in 2 neg_*.pt and 1 pos_*.pt data batches
-
Training a custom model
- I was confused as to how the trainer.py script had to be executed. The preprocessing-script resulted in multiple pytorch tensors, but only one file can be specified at once to execute the trainer-script I think?
- I used 15 batches (each iteration one target, one non-target) to train the model. I used the --intermediate option to finetune the previous model after the first iteration. Is this how the trainer script is intended to be used?
- python ./SquiggleNet/trainer.py -tt outfolder_train/pos_10000.pt -nt outfolder_train/neg_10000.pt -tv outfolder_val/pos_10000.pt -nv outfolder_val/neg_10000.pt -o trainedModel_b1.ckpt -e 3
- python ./SquiggleNet/trainer.py -tt outfolder_train/pos_20000.pt -nt outfolder_train/neg_20000.pt -tv outfolder_val/pos_10000.pt -nv outfolder_val/neg_10000.pt -i ./trainedModel_b1.ckpt -e 3 -o trainedModel_b2.ckpt
- […]
- python ./SquiggleNet/trainer.py -tt outfolder_train/pos_150000.pt -nt outfolder_train/neg_150000.pt -tv outfolder_val/pos_10000.pt -nv outfolder_val/neg_10000.pt -i ./trainedModel_b14.ckpt -e 3 -o trainedModel_b15.ckpt
-
Classifying the test data using the custom model
- python ./SquiggleNet/inference.py -m trainedModel_b16.ckpt -i fast5_human_and_sarscov2_testdata/ -o classification_results_trainedModel_b16/
- only 3% of the reads were allocated to the correct class
Any help would be greatly appreciated!
Hi!
I’m experiencing some issues training a custom SquiggleNet model. Would it be possible to check whether I’m doing something wrong?
Classification using your pretrained models, on human and E. coli R9 reads
First I tried running SquiggleNet inference using your pretrained models, on human and E. coli nanopore R9 data. This gave very nice results: an accuracy of 83-86% (depending on the model).
I did assume that human = 1 and bacterial = 0 in these models. Is this correct?
Classification using a custom model, on human and SARS-CoV2 R9 reads
Then I tried training a model using human and SARS-CoV2 data. Classifying the test data using this custom model, resulted in an accuracy of only 3%.
I am not sure whether I executed your scripts correctly. It would be very much appreciated if you wanted to check whether I ran your scripts as intended.
Splitting up the data into training, validation and test datasets
- I used an equal amount of target (= SARS-CoV2) and non-target (=human) reads (269507 reads)
- 80% of the reads were randomly selected and allocated to the training dataset, another 10% were allocated to the validation dataset, the remaining reads were allocated to the test dataset
- the remaining reads (not included in the 269507 reads I started with) were also allocated to the test dataset
Preprocessing the training data
-
python ./SquiggleNet/preprocess.py -gp sarscov2_train_readids.txt -gn human_train_readids.txt -i fast5_human_and_sarscov2 -o outfolder_train- this resulted in 21
neg_*.ptand15 pos_*.ptdata batchesPreprocessing the validation data
-
python ./SquiggleNet/preprocess.py -gp sarscov2_val_readids.txt -gn human_val_readids.txt -i fast5_human_and_sarscov2 -o outfolder_val- this resulted in 2
neg_*.ptand 1pos_*.ptdata batchesTraining a custom model
- I was confused as to how the
trainer.pyscript had to be executed. The preprocessing-script resulted in multiple pytorch tensors, but only one file can be specified at once to execute the trainer-script I think?- I used 15 batches (each iteration one target, one non-target) to train the model. I used the
--intermediateoption to finetune the previous model after the first iteration. Is this how the trainer script is intended to be used?-
python ./SquiggleNet/trainer.py -tt outfolder_train/pos_10000.pt -nt outfolder_train/neg_10000.pt -tv outfolder_val/pos_10000.pt -nv outfolder_val/neg_10000.pt -o trainedModel_b1.ckpt -e 3-
python ./SquiggleNet/trainer.py -tt outfolder_train/pos_20000.pt -nt outfolder_train/neg_20000.pt -tv outfolder_val/pos_10000.pt -nv outfolder_val/neg_10000.pt -i ./trainedModel_b1.ckpt -e 3 -o trainedModel_b2.ckpt-
[…]-
python ./SquiggleNet/trainer.py -tt outfolder_train/pos_150000.pt -nt outfolder_train/neg_150000.pt -tv outfolder_val/pos_10000.pt -nv outfolder_val/neg_10000.pt -i ./trainedModel_b14.ckpt -e 3 -o trainedModel_b15.ckptClassifying the test data using the custom model
-
python ./SquiggleNet/inference.py -m trainedModel_b16.ckpt -i fast5_human_and_sarscov2_testdata/ -o classification_results_trainedModel_b16/- only 3% of the reads were allocated to the correct class
Any help would be greatly appreciated!