Skip to content

Conversation

@noamteyssier
Copy link
Collaborator

  • introduces CBQ variant
  • improve API on writing
  • introduce deprecations on older API for reading

@gemini-code-assist
Copy link

Summary of Changes

Hello @noamteyssier, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This release, version 0.9.0, marks a significant evolution for the Binseq library. The primary focus has been on introducing the highly efficient columnar-block-based CBQ format, which offers superior compression and processing for sequencing data. Alongside this, the writing interface has been completely overhauled to provide a unified and more intuitive experience for users, abstracting away format-specific complexities. These changes aim to enhance performance, simplify development, and ensure the library remains at the forefront of bioinformatics data handling.

Highlights

  • New CBQ Variant Introduced: A new columnar-block-based format, CBQ, has been added. This variant offers improved compression, faster processing speeds, and native support for 'N' nucleotides, addressing several pain points of the VBQ format. It is recommended for most new use cases.
  • Unified Writing API: The API for writing records has been significantly improved with the introduction of a unified BinseqWriter enum and a SequencingRecord struct. This provides a consistent and format-agnostic interface for writing data across BQ, VBQ, and CBQ formats, simplifying integration and usage.
  • Deprecation of Older Writing APIs: Older write_record and write_paired_record methods in the BQ and VBQ modules have been deprecated in favor of the new push method which accepts a SequencingRecord. This change promotes the use of the unified API.
  • Enhanced Parallel Processing Capabilities: The ParallelReader trait now includes a validate_range method for more robust range checking. Additionally, the BinseqReader enum now supports the new CBQ format for parallel processing, and MmapReader implementations for BQ, VBQ, and CBQ now manage default quality scores more effectively.
  • Dependency Updates and Rust Edition Upgrade: The project has updated its Rust edition to '2024' and refreshed several dependencies, including bitnuc, itoa, paraseq, parking_lot, sucds, and clap, ensuring compatibility and leveraging the latest features and performance improvements.
  • Refactored Examples and Documentation: The examples directory has been reorganized and updated to reflect the new writing API and the CBQ format. Old examples like parallel_processing.rs and read_write.rs were removed, while new ones like auto-write.rs and write.rs were added to showcase the new BinseqWriterBuilder and FASTX encoding capabilities. The README.md has also been updated to highlight the CBQ variant and the latest preprint.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/ci.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new CBQ format, a unified BinseqWriter enum and BinseqWriterBuilder for all BINSEQ formats, and a SequencingRecord with a builder for a more ergonomic writing API. It also includes a new fastx.rs module for efficient FASTX to BINSEQ encoding using parallel processing. The review comments highlight the architectural improvements, the new push method for handling different record types, the robust ColumnarBlockWriter implementation, the specific error reporting with WriteError::ConfigurationMismatch and CbqError, and the zero-copy access provided by RefRecord and RefRecordIter. The reviewer also notes the removal of the context module, the addition of comprehensive unit tests, and the use of clap::Parser for command-line argument parsing. The reviewer also notes that anyhow and paraseq are optional dependencies.

I am having trouble creating individual review comments. Click here to see my feedback.

src/write.rs (1-1510)

critical

The new src/write.rs module is a major architectural improvement. It provides a unified BinseqWriter enum and BinseqWriterBuilder that abstract over all BINSEQ formats (BQ, VBQ, CBQ). This greatly simplifies the API for users, allowing them to write data format-agnostically. The encode_fastx integration with paraseq and the robust ingest mechanism for parallel writing are excellent features. The comprehensive unit tests ensure the correctness and reliability of this new unified API.

src/vbq/writer.rs (471-594)

high

The new push method, which accepts a SequencingRecord, is a significant improvement. It centralizes the logic for handling different record types (single/paired) and ensures consistent validation against the writer's configuration (paired, qualities, headers, flags). This greatly simplifies the writing API for users and reduces potential configuration mismatches.

src/cbq/read.rs (1-636)

high

The MmapReader for CBQ files is well-implemented, leveraging memory mapping and the new indexing structures for efficient parallel processing. The process_parallel_range method correctly distributes blocks and handles record ranges across threads, which is a key feature for high-performance bioinformatics tools. The comprehensive unit tests further validate its correctness.

src/cbq/write.rs (1-181)

high

The ColumnarBlockWriter for CBQ is a robust implementation of the new unified writing API. Its block management, ZSTD compression, and ingest method for parallel writing are well-designed. The push method's validation logic ensures records conform to the writer's configuration, preventing common data integrity issues.

src/error.rs (187-195)

high

The new WriteError::ConfigurationMismatch is a critical addition. It provides clear and specific feedback when a SequencingRecord does not match the writer's configuration, preventing silent data loss or unexpected behavior.

src/cbq/core/block.rs (1-950)

high

The ColumnarBlock implementation is a well-designed core component of the new CBQ format. Its columnar storage, ZSTD compression, and Elias-Fano encoding for N-positions are efficient choices for bioinformatics data. The RefRecord and RefRecordIter provide zero-copy access, which is crucial for performance.

src/bq/writer.rs (436-505)

high

The new push method, which accepts a SequencingRecord, is a significant improvement. It centralizes the logic for handling different record types (single/paired) and ensures consistent validation against the writer's configuration. This greatly simplifies the writing API for users.

src/vbq/writer.rs (448)

high

Marking write_paired_record as #[deprecated] and refactoring it to use self.push(record) is a good step for API evolution. It encourages users to transition to the new, more flexible push method with SequencingRecord, centralizing validation and logic.

src/vbq/writer.rs (436-446)

high

Marking write_record as #[deprecated] and refactoring it to use self.push(record) is a good step for API evolution. It encourages users to transition to the new, more flexible push method with SequencingRecord, centralizing validation and logic.

src/utils/fastx.rs (1-462)

high

The new fastx.rs module, with its FastxEncoderBuilder and Encoder implementing paraseq traits, provides a powerful and efficient way to encode FASTX files into BINSEQ formats. The parallel processing capabilities and automatic sequence length detection are significant improvements for handling large datasets.

src/record/sequencing_record.rs (1-403)

high

The new SequencingRecord and SequencingRecordBuilder are central to the unified writing API. The SequencingRecord's zero-copy design and configured_size_cbq/configured_size_vbq methods are crucial for efficient block management. The builder pattern provides a clean and type-safe way to construct records, significantly improving the ergonomics of the writing API.

src/parallel.rs (160-197)

high

The new validate_range method in the ParallelReader trait is a significant improvement. It centralizes and standardizes range validation logic, preventing common errors like out-of-bounds or inverted ranges, and improving the robustness of parallel processing.

src/error.rs (319-360)

high

The new CbqError enum provides a comprehensive set of error types specific to the CBQ format. This granular error reporting is excellent for debugging and understanding issues related to CBQ block management, headers, and indexing.

src/context/mod.rs (1-8)

medium

The removal of the context module (mod.rs, structs.rs, traits.rs) is a positive refactoring. It simplifies the API by eliminating an unnecessary layer of abstraction and encourages direct interaction with BinseqRecord methods, leading to cleaner and more explicit code.

src/error.rs (363-371)

medium

The FastxEncodingError enum (conditional on paraseq) provides specific error types for issues encountered during FASTX encoding, such as empty files or missing input. This improves error handling for the new encode_fastx functionality.

src/bq/mod.rs (96-102)

medium

Updating the example to use SequencingRecordBuilder and writer.push(record) demonstrates the new, unified writing API. This is a significant improvement over the older write_paired_record methods, offering better type safety and a more ergonomic builder pattern.

src/bq/mod.rs (127-131)

medium

Updating the example to use SequencingRecordBuilder and writer.push(record) demonstrates the new, unified writing API. This is a significant improvement over the older write_record methods, offering better type safety and a more ergonomic builder pattern.

src/bq/reader.rs (42-43)

medium

Adding qbuf: &'a [u8] to RefRecord is a good change. It allows RefRecord to provide a default quality score buffer, ensuring a consistent API even for BQ format which doesn't store explicit quality scores.

src/bq/reader.rs (133-134)

medium

Adding squal() and xqual() methods to the BinseqRecord implementation for RefRecord ensures that even BQ records, which lack explicit quality scores, can return a default quality buffer. This provides a consistent interface across all BINSEQ variants.

src/bq/reader.rs (151-152)

medium

Adding qbuf: &'a [u8] to BatchRecord is a good change. It allows BatchRecord to provide a default quality score buffer, ensuring a consistent API even for BQ format which doesn't store explicit quality scores.

src/bq/reader.rs (233-234)

medium

Adding squal() and xqual() methods to the BinseqRecord implementation for BatchRecord ensures that even BQ records, which lack explicit quality scores, can return a default quality buffer. This provides a consistent interface across all BINSEQ variants.

src/bq/reader.rs (419-423)

medium

Adding qbuf: Vec<u8> and default_quality_score: u8 to MmapReader allows the reader to manage its own reusable quality buffer and configurable default score. This is a good design for consistency and flexibility.

src/bq/reader.rs (504-507)

medium

The set_default_quality_score method is a good addition, allowing users to customize the default quality score for BQ records. Updating qbuf when the score changes ensures consistency.

src/bq/reader.rs (532-536)

medium

Changing ReadError::OutOfRange to a struct variant ({ requested_index: idx, max_index: self.num_records() }) is a significant improvement for error clarity. It makes error messages more descriptive and easier to parse programmatically.

src/bq/reader.rs (552-556)

medium

Changing ReadError::OutOfRange to a struct variant ({ requested_index: range.end, max_index: self.num_records() }) is a significant improvement for error clarity. It makes error messages more descriptive and easier to parse programmatically.

src/bq/reader.rs (591-595)

medium

Adding qbuf: Vec<u8> and default_quality_score: u8 to StreamReader allows the streaming reader to manage its own reusable quality buffer and configurable default score. This is a good design for consistency and flexibility.

src/bq/reader.rs (647-653)

medium

The set_default_quality_score method is a good addition, allowing users to customize the default quality score for BQ records in streaming mode. Updating qbuf when the score changes ensures consistency.

src/bq/reader.rs (894)

medium

Replacing the manual range validation with self.validate_range(num_records, &range)?; is a good refactoring. It centralizes validation logic in the ParallelReader trait, promoting consistency and reducing duplication.

src/bq/reader.rs (925)

medium

Initializing qbuf within the parallel processing thread ensures that each thread has its own quality buffer, preventing data races and improving performance by avoiding shared mutable state.

src/bq/reader.rs (1005-1312)

medium

The addition of comprehensive unit tests for MmapReader, RecordConfig, RefRecord, Parallel Processing, and error handling is excellent. This significantly improves the reliability and maintainability of the BQ reader module.

src/bq/writer.rs (389)

medium

Marking write_record as #[deprecated] is appropriate, as it encourages users to transition to the new, more flexible push method with SequencingRecord. This is a good step for API evolution.

src/bq/writer.rs (416)

medium

Marking write_paired_record as #[deprecated] is appropriate, as it encourages users to transition to the new, more flexible push method with SequencingRecord. This is a good step for API evolution.

src/bq/writer.rs (668)

medium

Marking StreamWriter::write_record as #[deprecated] is appropriate, as it encourages users to transition to the new, more flexible push method with SequencingRecord. This is a good step for API evolution.

src/bq/writer.rs (674)

medium

Marking StreamWriter::write_paired_record as #[deprecated] is appropriate, as it encourages users to transition to the new, more flexible push method with SequencingRecord. This is a good step for API evolution.

src/cbq/core/block_header.rs (1-92)

medium

The BlockHeader struct for CBQ is well-defined, using Pod and Zeroable for efficient serialization. It captures all necessary metadata for a columnar block, including compressed sizes and record counts, which is essential for reading and indexing.

src/cbq/core/header.rs (1-224)

medium

The FileHeader and FileHeaderBuilder for CBQ are robust. The use of presence flags for optional features (paired, qualities, headers, flags) makes the format flexible, and the builder pattern provides a clean API for configuration.

src/cbq/core/index.rs (1-216)

medium

The IndexHeader, IndexFooter, Index, and BlockRange structures provide a solid indexing mechanism for CBQ files. The use of Pod and Zeroable for headers and ranges ensures efficient storage and retrieval, which is critical for random access and parallel processing.

src/cbq/core/utils.rs (1-104)

medium

The utility functions in cbq::core::utils are well-implemented, particularly sized_compress and resize_uninit, which are crucial for efficient ZSTD compression and memory management. The Span struct is a useful abstraction for managing data ranges.

examples/read.rs (115-123)

medium

Replacing manual argument parsing with clap::Parser is a great enhancement. It provides robust, self-documenting command-line argument handling, which is much more maintainable and user-friendly.

src/context/structs.rs (1-221)

medium

The removal of the context module (mod.rs, structs.rs, traits.rs) is a positive refactoring. It simplifies the API by eliminating an unnecessary layer of abstraction and encourages direct interaction with BinseqRecord methods, leading to cleaner and more explicit code.

src/context/traits.rs (1-109)

medium

The removal of the context module (mod.rs, structs.rs, traits.rs) is a positive refactoring. It simplifies the API by eliminating an unnecessary layer of abstraction and encourages direct interaction with BinseqRecord methods, leading to cleaner and more explicit code.

src/error.rs (138-141)

medium

Changing ReadError::OutOfRange to a struct variant ({ requested_index: usize, max_index: usize }) is a significant improvement. It makes error messages more descriptive and easier to parse programmatically, enhancing debuggability.

src/error.rs (197-198)

medium

Adding WriteError::FormatMismatch is important for the unified BinseqWriter::ingest method. It ensures that attempts to merge writers of different formats are correctly identified and reported.

src/error.rs (200-208)

medium

The new WriteError::MissingSequenceLength provides specific error details when required sequence lengths are not provided, which is crucial for formats like BQ that demand fixed lengths.

src/error.rs (270-272)

medium

Adding WriteError::MissingSequence for SequencingRecordBuilder ensures that records are always built with a primary sequence, preventing logical errors down the line.

examples/read.rs (58-60)

medium

Directly using record.xheader(), record.xseq(), and record.xqual() instead of an intermediate self.ctx is a significant improvement. It simplifies the code, reduces boilerplate, and makes the data flow more explicit.

src/bq/mod.rs (62-67)

medium

Updating the example to use SequencingRecordBuilder and writer.push(record) demonstrates the new, unified writing API. This is a significant improvement over the older write_record methods, offering better type safety and a more ergonomic builder pattern.

src/error.rs (414-707)

medium

The addition of extensive unit tests for various error types, including is_index_mismatch, IndexError, HeaderError, ReadError, WriteError, CbqError, BuilderError, ExtensionError, and error conversions, is excellent. This significantly improves the robustness and reliability of error handling throughout the library.

src/parallel.rs (19)

medium

Adding Cbq(cbq::MmapReader) to the BinseqReader enum is essential for integrating the new CBQ variant into the generic reader abstraction, allowing users to process CBQ files uniformly.

src/parallel.rs (47-52)

medium

Adding set_default_quality_score to BinseqReader and delegating it to the specific reader implementations provides a consistent way to configure default quality scores across all BINSEQ formats.

examples/read.rs (49-51)

medium

Directly using record.sheader(), record.sseq(), and record.squal() instead of an intermediate self.ctx is a significant improvement. It simplifies the code, reduces boilerplate, and makes the data flow more explicit.

src/parallel.rs (229-340)

medium

The addition of comprehensive unit tests for BinseqReader and ParallelProcessor across all BINSEQ formats (BQ, VBQ, CBQ) is excellent. It covers parallel processing, range validation, and default quality scores, ensuring the robustness of the parallel API.

src/policy.rs (174-450)

medium

The addition of comprehensive unit tests for the Policy enum, covering all variants (IgnoreSequence, BreakOnInvalid, RandomDraw, SetToA/C/G/T), deterministic behavior with seeding, buffer clearing, and various edge cases, is excellent. This significantly improves the reliability of nucleotide handling.

examples/grep.rs (54-66)

medium

Replacing manual argument parsing with clap::Parser is a great enhancement. It provides robust, self-documenting command-line argument handling, which is much more maintainable and user-friendly.

examples/grep.rs (40)

medium

Directly using record.sseq() and record.xseq() instead of an intermediate self.ctx is a significant improvement. It simplifies the code, reduces boilerplate, and makes the data flow more explicit.

src/vbq/mod.rs (116-122)

medium

Updating the example to use SequencingRecordBuilder and writer.push(record) demonstrates the new, unified writing API. This is a significant improvement over the older write_record methods, offering better type safety and a more ergonomic builder pattern.

src/vbq/reader.rs (131-132)

medium

Adding has_quality: bool to RecordMetadata is a good change. It allows RecordMetadata to explicitly track whether a record has quality scores, which is important for correctly handling quality data during iteration.

src/vbq/reader.rs (188-191)

medium

Adding qbuf: Vec<u8> and default_quality_score: u8 to RecordBlock allows the block to manage its own reusable quality buffer and configurable default score. This is a good design for consistency and flexibility.

src/vbq/reader.rs (227-231)

medium

The set_default_quality_score method is a good addition, allowing users to customize the default quality score for VBQ records. Updating qbuf when the score changes ensures consistency.

src/vbq/reader.rs (546)

medium

Adding qbuf: &'a [u8] to RecordBlockIter is a good change. It allows RecordBlockIter to provide a default quality score buffer, ensuring a consistent API even for VBQ format when quality scores are not explicitly stored.

src/vbq/reader.rs (579-590)

medium

The logic for determining squal and xqual based on meta.has_quality and using the qbuf for default scores is correct. This ensures that the BinseqRecord trait methods always return a valid slice for quality scores, even if they are default values.

src/vbq/reader.rs (813-814)

medium

Adding default_quality_score: u8 to MmapReader allows the reader to manage its own configurable default quality score. This is a good design for consistency and flexibility.

src/vbq/reader.rs (878-880)

medium

The set_default_quality_score method is a good addition, allowing users to customize the default quality score for VBQ records. This improves flexibility for handling quality data.

src/vbq/reader.rs (1295)

medium

Replacing the manual range validation with self.validate_range(total_records, &range)?; is a good refactoring. It centralizes validation logic in the ParallelReader trait, promoting consistency and reducing duplication.

src/vbq/reader.rs (1408-1947)

medium

The addition of comprehensive unit tests for MmapReader, RecordBlock, Parallel Processing, Span, error handling, and index loading is excellent. This significantly improves the reliability and maintainability of the VBQ reader module.

src/vbq/writer.rs (52-58)

medium

Updating the example to use SequencingRecordBuilder and writer.push(record) demonstrates the new, unified writing API. This is a significant improvement over the older write_record methods, offering better type safety and a more ergonomic builder pattern.

src/vbq/writer.rs (61-79)

medium

Removing the record_byte_size and record_byte_size_quality_header helper functions is a good refactoring. Their functionality is now encapsulated within SequencingRecord::configured_size_vbq, centralizing size calculation logic and simplifying the VBinseqWriter.

src/vbq/writer.rs (284-288)

medium

Updating the example to use SequencingRecordBuilder and writer.push(record) demonstrates the new, unified writing API. This is a significant improvement over the older write_record methods, offering better type safety and a more ergonomic builder pattern.

examples/grep.rs (11)

medium

Removing ctx: SeqCtx, from the GrepCounter struct simplifies its state and reduces memory overhead, as the record data is now accessed directly.

Cargo.toml (36-39)

medium

The new [features] section for default, anyhow, and paraseq is well-structured. It clearly defines optional dependencies and allows for flexible build configurations, which is a significant improvement for library users.

Cargo.toml (23-24)

medium

Introducing paraseq and parking_lot as optional dependencies for parallel processing is a good design choice. It allows users to opt-in to parallel features without incurring the overhead if they don't need them.

src/vbq/writer.rs (621-625)

medium

Updating the example to use SequencingRecordBuilder and writer.push(record) demonstrates the new, unified writing API. This is a significant improvement over the older write_record methods, offering better type safety and a more ergonomic builder pattern.

src/vbq/writer.rs (701-705)

medium

Updating the example to use SequencingRecordBuilder and writer.push(record) demonstrates the new, unified writing API. This is a significant improvement over the older write_record methods, offering better type safety and a more ergonomic builder pattern.

src/vbq/writer.rs (878-882)

medium

Refactoring BlockWriter::write_record to accept a &SequencingRecord and encoded buffers (sbuf, xbuf) is a significant improvement. It simplifies the method signature, reduces argument count, and aligns with the new unified API, making the code cleaner and more maintainable.

src/vbq/writer.rs (900-904)

medium

Conditionally writing primary quality scores based on self.has_qualities and record.s_qual ensures that quality data is only written when both the writer is configured for it and the record provides it. This prevents writing unnecessary data or encountering configuration mismatches.

src/vbq/writer.rs (907-911)

medium

Conditionally writing primary header based on self.has_headers and record.s_header ensures that header data is only written when both the writer is configured for it and the record provides it. This prevents writing unnecessary data or encountering configuration mismatches.

src/vbq/writer.rs (920-924)

medium

Conditionally writing extended quality scores based on self.has_qualities and record.x_qual ensures that quality data is only written when both the writer is configured for it and the record provides it. This prevents writing unnecessary data or encountering configuration mismatches.

src/vbq/writer.rs (927-931)

medium

Conditionally writing extended header based on self.has_headers and record.x_header ensures that header data is only written when both the writer is configured for it and the record provides it. This prevents writing unnecessary data or encountering configuration mismatches.

Cargo.toml (14)

medium

Making anyhow an optional dependency is a good practice. It allows users who don't need its generic error handling capabilities to avoid including it in their build, potentially reducing compile times and binary size.

@noamteyssier noamteyssier merged commit da143f1 into main Jan 24, 2026
28 checks passed
@noamteyssier noamteyssier deleted the binseq-0.9.0 branch January 24, 2026 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants