Skip to content

Expected number of INDELs using genomic_distribution #85

@cutleraging

Description

@cutleraging

Hello,

Thanks for such a great package!

In the genomic_distribution function, I understand that the expected amount of mutations for a region of interest is calculated as

n_muts / surveyed_length * surveyed_region_length

However, does this proved an accurate estimate when dealing with INDELs? I would not think so since n_muts is not equal to the amount of total mutated bases (such as for SNVs).

Any thoughts on a better way to calculate the expected number of INDELs?

One solution I have tried is to randomly shuffle the INDELs (accounting for sequence context) and then count how many are in the region of interest. When I do this, I get a observed/expected ratio of ~1, which is what I would expect. However, I am confused how then I would calculate if this is significant using the binomial_test function. Would it make sense to do something along these lines?

p = n_INDELs /  surveyed_length
n = surveyed_region_length
x = observed_INDELs # number of INDELs observed to land in region of interest from the randomly shuffled files
binomial_test(p, n, x)

Any input would be wonderful, thanks!
Ronnie

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions