Hi!
I am trying to train deepvariant on RNA SEQ TUMOR ONLY SOMATIC variant calling.
I was already in contact with @danielecook regarding some guidelines on this, which is greatly appreciated. I am now posting on Github to make it easier to troubleshoot. :)
I have prepared my input data:
vcfs in right output format (these vcfs are filtered on coverage > 10 based on the RNA bam file)
RNA tumor bam (chr split for eval/train/test)
exons.bed file
GRCh38.d1.vd1.fa reference
I use the deepvariant version 1.9.0 sif image.
Is the header of my vcf file too extensive?
Then, I ran make_examples for train/test/eval and created the tfrecords files.
Below are the specific parameters I used:
--mode training
--customized_classes_labeler_classes_list=ref,germline,somatic
--customized_classes_labeler_info_field_name=type
--split_skip_reads=True
--channel_list='BASE_CHANNELS,insert_size'
I then shuffled these records, and obtained pbtxt files.
But when I looked deeper in my records files, I see that all the variants get label 0 (ref).
For example for one training sample and first 10000 examples:
Variants in truth vcf: 5609 germline, 73 somatic
Class label distribution:
ref (class0): 10,000 (100.0%)
TOTAL: 10,000
I also checked the overlap with the exons.bed file and this was: 5636, so this should not be the problem I think.
I am working with TCGA data, so I suppose the quality of the data should be sufficient.
Is there someone that would have any insight on this? Some help would be greatly appreciated!
With kind regards
Charlotte
Hi!
I am trying to train deepvariant on RNA SEQ TUMOR ONLY SOMATIC variant calling.
I was already in contact with @danielecook regarding some guidelines on this, which is greatly appreciated. I am now posting on Github to make it easier to troubleshoot. :)
I have prepared my input data:
vcfs in right output format (these vcfs are filtered on coverage > 10 based on the RNA bam file)
RNA tumor bam (chr split for eval/train/test)
exons.bed file
GRCh38.d1.vd1.fa reference
I use the deepvariant version 1.9.0 sif image.
Is the header of my vcf file too extensive?
Then, I ran make_examples for train/test/eval and created the tfrecords files.
Below are the specific parameters I used:
--mode training
--customized_classes_labeler_classes_list=ref,germline,somatic
--customized_classes_labeler_info_field_name=type
--split_skip_reads=True
--channel_list='BASE_CHANNELS,insert_size'
I then shuffled these records, and obtained pbtxt files.
But when I looked deeper in my records files, I see that all the variants get label 0 (ref).
For example for one training sample and first 10000 examples:
Variants in truth vcf: 5609 germline, 73 somatic
Class label distribution:
ref (class0): 10,000 (100.0%)
TOTAL: 10,000
I also checked the overlap with the exons.bed file and this was: 5636, so this should not be the problem I think.
I am working with TCGA data, so I suppose the quality of the data should be sufficient.
Is there someone that would have any insight on this? Some help would be greatly appreciated!
With kind regards
Charlotte