Skip to content

add_domains_to_table silently fails to add some overlapping domains with options overlap=True and source="hmmer" #23

@kbseah

Description

@kbseah

The function isotools.domains.add_domains_to_table should find domains that overlap with alternative splicing features when the options overlap=True and source="hmmer" are chosen. However, when the gene feature is on the reverse strand, not all overlapping domains are found.

This results in incorrect results without any warning messages being raised.

I traced the issue to two functions: isotools._utils.genomic_position and isotools._utils.has_overlap.

isotools._utils.has_overlap is called within isotools.domains.add_domains_to_table to check whether a given domain overlaps with the alternative splicing event coordinates:

if overlap_only and not has_overlap(dom[4], (row.start, row.end)):

has_overlap does not check if coordinates are flipped

has_overlap assumes that start <= end for coordinates (start,end). However, I found that when a gene is on the reverse strand, the order of the domain coordinates in the transcript object are flipped, (end,start).

For partially overlapping features, has_overlap will wrongly return False if the coordinates are flipped:

# Partial overlap left
# start, end order
has_overlap((90, 118), (100, 200)) # correctly returns True
has_overlap((118, 90), (100, 200)) # wrongly returns False

Coordinates are flipped for reverse strand features when converting from transcript coordinates to genomic coordinates

pyhmmer itself reports domain coordinates such that start <= end. The flipping of coordinates happens when genomic_position is called within isotools.domains.add_hmmer_domains and isotools.domains.add_iterpro_domains:

pos_map = genomic_position(
[p + orf[0] for dom in domL for p in dom[3]],
transcript["exons"],
gene.strand == "-",
)

For example,

tr_pos = (350, 450)
exons = [[100, 500],[600, 800],]
genomic_position(tr_pos, exons, True) 

will return {450: 250, 350: 350}

The transcript coordinates are the dict keys, and are referenced when adding the genomic coordinates to the domain tuple:

(pos_map[dom[3][0] + orf[0]], pos_map[dom[3][1] + orf[0]]),

For the example above, the transcript coordinates 350, 450 will give genomic coordinates 350, 250 because the feature is on the reverse strand. This is the point where the genomic coordinates become flipped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions