Hi, when using meryl count (version 1.4) with default canonical k-mers, followed by meryl statistics, I observed behavior that is confusing for small k (e.g. k=4) and reproducible across multiple chromosomes. Specifically, for sequences containing only A/C/G/T, the reported present value does not equal L - k + 1, even though the sum of all k-mer counts from meryl print matches present. For larger k (e.g. k=23), present = L - k + 1. It would be helpful to clarify what present is intended to represent under canonical counting.
In addition, for k=4, meryl statistics reports distinct = 136 and missing = 120, which sum to 4^k = 256. This suggests that missing is computed in strand-specific k-mer space, even though the database itself is built using canonical k-mers (canonical space size = 136 for k=4). This mixing of canonical and strand-specific spaces makes the statistics difficult to interpret. Would it be possible to report missing consistently in canonical k-mer space when canonical counting is used?
Hi, when using meryl count (version 1.4) with default canonical k-mers, followed by meryl statistics, I observed behavior that is confusing for small k (e.g. k=4) and reproducible across multiple chromosomes. Specifically, for sequences containing only A/C/G/T, the reported present value does not equal L - k + 1, even though the sum of all k-mer counts from meryl print matches present. For larger k (e.g. k=23), present = L - k + 1. It would be helpful to clarify what present is intended to represent under canonical counting.
In addition, for k=4, meryl statistics reports distinct = 136 and missing = 120, which sum to 4^k = 256. This suggests that missing is computed in strand-specific k-mer space, even though the database itself is built using canonical k-mers (canonical space size = 136 for k=4). This mixing of canonical and strand-specific spaces makes the statistics difficult to interpret. Would it be possible to report missing consistently in canonical k-mer space when canonical counting is used?