Skip to content

question about interpretation of present and missing in meryl statistics under canonical k-mers #56

@justdx

Description

@justdx

Hi, when using meryl count (version 1.4) with default canonical k-mers, followed by meryl statistics, I observed behavior that is confusing for small k (e.g. k=4) and reproducible across multiple chromosomes. Specifically, for sequences containing only A/C/G/T, the reported present value does not equal L - k + 1, even though the sum of all k-mer counts from meryl print matches present. For larger k (e.g. k=23), present = L - k + 1. It would be helpful to clarify what present is intended to represent under canonical counting.

In addition, for k=4, meryl statistics reports distinct = 136 and missing = 120, which sum to 4^k = 256. This suggests that missing is computed in strand-specific k-mer space, even though the database itself is built using canonical k-mers (canonical space size = 136 for k=4). This mixing of canonical and strand-specific spaces makes the statistics difficult to interpret. Would it be possible to report missing consistently in canonical k-mer space when canonical counting is used?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions