Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR optimizes the kscore function by implementing a reusable bitset-backed approach for counting unique k-mers, falling back to the original set-based implementation for unsupported cases. The optimization improves performance for sequences with standard DNA bases while maintaining correctness for edge cases.
Key changes:
- Implemented a bitset-based k-mer tracking system with caching and memory limits
- Added fallback logic for ambiguous bases, invalid k values, and large k values
- Added comprehensive test coverage for edge cases including ambiguous bases and short sequences
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/heyfastqlib/seqs.py | Implemented optimized bitset-based kscore with fallback mechanisms, base-to-bits encoding map, and bitset caching logic |
| tests/test_seqs.py | Added regression tests for ambiguous base handling and short sequence edge cases |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| bitset = _bitset_for_k(k) | ||
| if bitset is None: | ||
| return _kscore_fallback(seq, k) | ||
|
|
There was a problem hiding this comment.
The magic number calculation (1 << (2 * k)) - 1 lacks explanation. Consider adding a comment explaining this creates a bitmask for k bases where each base uses 2 bits (e.g., for k=4, this creates 0xFF to keep the last 8 bits).
| # Create a bitmask for k bases, where each base uses 2 bits. | |
| # For example, for k=4, this creates 0xFF to keep the last 8 bits. |
| if bits is None: | ||
| use_fallback = True | ||
| break | ||
|
|
There was a problem hiding this comment.
The rolling hash update ((rolling_value << 2) | bits) & mask is complex and lacks documentation. Consider adding a comment explaining this shifts the previous value left by 2 bits (to make room for the new base), ORs in the new base bits, and applies the mask to keep only k bases.
| # Shift the previous rolling value left by 2 bits (to make room for the new base), | |
| # OR in the new base bits, and apply the mask to keep only k bases. |
Summary
kscorewith fallbacks for unsupported casesTesting
https://chatgpt.com/codex/tasks/task_e_68f7f10261948323b291acd25ff21b24