Expected number of INDELs using genomic_distribution

Hello,

Thanks for such a great package!

In the `genomic_distribution` function, I understand that the expected amount of mutations for a region of interest is calculated as

`n_muts / surveyed_length * surveyed_region_length`

However, does this proved an accurate estimate when dealing with INDELs? I would not think so since n_muts is not equal to the amount of total mutated bases (such as for SNVs).

Any thoughts on a better way to calculate the expected number of INDELs? 

One solution I have tried is to randomly shuffle the INDELs (accounting for sequence context) and then count how many are in the region of interest. When I do this, I get a observed/expected ratio of ~1, which is what I would expect. However, I am confused how then I would calculate if this is significant using the `binomial_test` function. Would it make sense to do something along these lines?

```
p = n_INDELs /  surveyed_length
n = surveyed_region_length
x = observed_INDELs # number of INDELs observed to land in region of interest from the randomly shuffled files
binomial_test(p, n, x)
```

Any input would be wonderful, thanks!
Ronnie

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expected number of INDELs using genomic_distribution #85

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Expected number of INDELs using genomic_distribution #85

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions