The function isotools.domains.add_domains_to_table should find domains that overlap with alternative splicing features when the options overlap=True and source="hmmer" are chosen. However, when the gene feature is on the reverse strand, not all overlapping domains are found.
This results in incorrect results without any warning messages being raised.
I traced the issue to two functions: isotools._utils.genomic_position and isotools._utils.has_overlap.
isotools._utils.has_overlap is called within isotools.domains.add_domains_to_table to check whether a given domain overlaps with the alternative splicing event coordinates:
|
if overlap_only and not has_overlap(dom[4], (row.start, row.end)): |
has_overlap does not check if coordinates are flipped
has_overlap assumes that start <= end for coordinates (start,end). However, I found that when a gene is on the reverse strand, the order of the domain coordinates in the transcript object are flipped, (end,start).
For partially overlapping features, has_overlap will wrongly return False if the coordinates are flipped:
# Partial overlap left
# start, end order
has_overlap((90, 118), (100, 200)) # correctly returns True
has_overlap((118, 90), (100, 200)) # wrongly returns False
Coordinates are flipped for reverse strand features when converting from transcript coordinates to genomic coordinates
pyhmmer itself reports domain coordinates such that start <= end. The flipping of coordinates happens when genomic_position is called within isotools.domains.add_hmmer_domains and isotools.domains.add_iterpro_domains:
|
pos_map = genomic_position( |
|
[p + orf[0] for dom in domL for p in dom[3]], |
|
transcript["exons"], |
|
gene.strand == "-", |
|
) |
For example,
tr_pos = (350, 450)
exons = [[100, 500],[600, 800],]
genomic_position(tr_pos, exons, True)
will return {450: 250, 350: 350}
The transcript coordinates are the dict keys, and are referenced when adding the genomic coordinates to the domain tuple:
|
(pos_map[dom[3][0] + orf[0]], pos_map[dom[3][1] + orf[0]]), |
For the example above, the transcript coordinates 350, 450 will give genomic coordinates 350, 250 because the feature is on the reverse strand. This is the point where the genomic coordinates become flipped.
The function
isotools.domains.add_domains_to_tableshould find domains that overlap with alternative splicing features when the optionsoverlap=Trueandsource="hmmer"are chosen. However, when the gene feature is on the reverse strand, not all overlapping domains are found.This results in incorrect results without any warning messages being raised.
I traced the issue to two functions:
isotools._utils.genomic_positionandisotools._utils.has_overlap.isotools._utils.has_overlapis called withinisotools.domains.add_domains_to_tableto check whether a given domain overlaps with the alternative splicing event coordinates:IsoTools2/src/isotools/domains.py
Line 119 in 423764b
has_overlapdoes not check if coordinates are flippedhas_overlapassumes thatstart <= endfor coordinates(start,end). However, I found that when a gene is on the reverse strand, the order of the domain coordinates in the transcript object are flipped,(end,start).For partially overlapping features,
has_overlapwill wrongly returnFalseif the coordinates are flipped:Coordinates are flipped for reverse strand features when converting from transcript coordinates to genomic coordinates
pyhmmer itself reports domain coordinates such that start <= end. The flipping of coordinates happens when
genomic_positionis called withinisotools.domains.add_hmmer_domainsandisotools.domains.add_iterpro_domains:IsoTools2/src/isotools/domains.py
Lines 315 to 319 in 423764b
For example,
will return
{450: 250, 350: 350}The transcript coordinates are the dict keys, and are referenced when adding the genomic coordinates to the domain tuple:
IsoTools2/src/isotools/domains.py
Line 323 in 423764b
For the example above, the transcript coordinates
350, 450will give genomic coordinates350, 250because the feature is on the reverse strand. This is the point where the genomic coordinates become flipped.