question about interpretation of present and missing in meryl statistics under canonical k-mers

Hi, when using meryl count (version 1.4) with default canonical k-mers, followed by meryl statistics, I observed behavior that is confusing for small k (e.g. k=4) and reproducible across multiple chromosomes. Specifically, for sequences containing only A/C/G/T, the reported present value does not equal L - k + 1, even though the sum of all k-mer counts from meryl print matches present. For larger k (e.g. k=23), present = L - k + 1. It would be helpful to clarify what present is intended to represent under canonical counting.

In addition, for k=4, meryl statistics reports distinct = 136 and missing = 120, which sum to 4^k = 256. This suggests that missing is computed in strand-specific k-mer space, even though the database itself is built using canonical k-mers (canonical space size = 136 for k=4). This mixing of canonical and strand-specific spaces makes the statistics difficult to interpret. Would it be possible to report missing consistently in canonical k-mer space when canonical counting is used?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about interpretation of present and missing in meryl statistics under canonical k-mers #56

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

question about interpretation of present and missing in meryl statistics under canonical k-mers #56

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions