Skip to content

Optimize kscore with reusable bitset#21

Open
Ulthran wants to merge 1 commit intomainfrom
codex/find-efficient-alternatives-for-complexity-filtering
Open

Optimize kscore with reusable bitset#21
Ulthran wants to merge 1 commit intomainfrom
codex/find-efficient-alternatives-for-complexity-filtering

Conversation

@Ulthran
Copy link
Contributor

@Ulthran Ulthran commented Oct 21, 2025

Summary

  • implement a reusable bitset-backed implementation of kscore with fallbacks for unsupported cases
  • add regression tests covering ambiguous bases and short sequences

Testing

  • pytest -q

https://chatgpt.com/codex/tasks/task_e_68f7f10261948323b291acd25ff21b24

Copilot AI review requested due to automatic review settings October 21, 2025 21:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the kscore function by implementing a reusable bitset-backed approach for counting unique k-mers, falling back to the original set-based implementation for unsupported cases. The optimization improves performance for sequences with standard DNA bases while maintaining correctness for edge cases.

Key changes:

  • Implemented a bitset-based k-mer tracking system with caching and memory limits
  • Added fallback logic for ambiguous bases, invalid k values, and large k values
  • Added comprehensive test coverage for edge cases including ambiguous bases and short sequences

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/heyfastqlib/seqs.py Implemented optimized bitset-based kscore with fallback mechanisms, base-to-bits encoding map, and bitset caching logic
tests/test_seqs.py Added regression tests for ambiguous base handling and short sequence edge cases

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

bitset = _bitset_for_k(k)
if bitset is None:
return _kscore_fallback(seq, k)

Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number calculation (1 << (2 * k)) - 1 lacks explanation. Consider adding a comment explaining this creates a bitmask for k bases where each base uses 2 bits (e.g., for k=4, this creates 0xFF to keep the last 8 bits).

Suggested change
# Create a bitmask for k bases, where each base uses 2 bits.
# For example, for k=4, this creates 0xFF to keep the last 8 bits.

Copilot uses AI. Check for mistakes.
if bits is None:
use_fallback = True
break

Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rolling hash update ((rolling_value << 2) | bits) & mask is complex and lacks documentation. Consider adding a comment explaining this shifts the previous value left by 2 bits (to make room for the new base), ORs in the new base bits, and applies the mask to keep only k bases.

Suggested change
# Shift the previous rolling value left by 2 bits (to make room for the new base),
# OR in the new base bits, and apply the mask to keep only k bases.

Copilot uses AI. Check for mistakes.
Base automatically changed from parallelize to main October 22, 2025 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants