Skip to content

Training deepvariant on RNA TUMOR ONLY SOMATIC variant calling #1072

@chstubbe

Description

@chstubbe

Hi!

I am trying to train deepvariant on RNA SEQ TUMOR ONLY SOMATIC variant calling.
I was already in contact with @danielecook regarding some guidelines on this, which is greatly appreciated. I am now posting on Github to make it easier to troubleshoot. :)

I have prepared my input data:

vcfs in right output format (these vcfs are filtered on coverage > 10 based on the RNA bam file)
RNA tumor bam (chr split for eval/train/test)
exons.bed file
GRCh38.d1.vd1.fa reference
I use the deepvariant version 1.9.0 sif image.

Image

Is the header of my vcf file too extensive?

Then, I ran make_examples for train/test/eval and created the tfrecords files.
Below are the specific parameters I used:
--mode training
--customized_classes_labeler_classes_list=ref,germline,somatic
--customized_classes_labeler_info_field_name=type
--split_skip_reads=True
--channel_list='BASE_CHANNELS,insert_size'

I then shuffled these records, and obtained pbtxt files.

But when I looked deeper in my records files, I see that all the variants get label 0 (ref).
For example for one training sample and first 10000 examples:
Variants in truth vcf: 5609 germline, 73 somatic
Class label distribution:
ref (class0): 10,000 (100.0%)
TOTAL: 10,000
I also checked the overlap with the exons.bed file and this was: 5636, so this should not be the problem I think.

I am working with TCGA data, so I suppose the quality of the data should be sufficient.
Is there someone that would have any insight on this? Some help would be greatly appreciated!

With kind regards
Charlotte

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions