Skip to content

Wasatch-Biolabs-Bfx/CH3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

CH3 File Format Specification (Version 1.0)

Purpose and Overview

The CH3 file format is a compact, efficient storage format for native base modification calls generated from third-generation sequencing platforms (e.g., Oxford Nanopore). It was developed to drastically reduce storage requirements, streamline downstream analysis, and facilitate integration with the MethylSeqR R package. CH3 is optimized for speed, compression, and extensibility while maintaining compatibility with a broad set of programming tools and environments.

File Container and Encoding

  • Underlying format: Apache Parquet (columnar, binary format)
  • Compression: Zstandard (zstd) is applied during file creation
  • Encoding: Dictionary encoding for categorical fields (e.g., chromosome, modification code)
  • Query support: Internal column-based access and row group indexing for rapid region-based queries

Core Schema

Column Name Type Description Required
read_id uuid Unique identifier for the read Yes
chrom string Chromosome or contig name Yes
read_position uint32 0-based read coordinate of modified base Yes
start int64 0-based genomic coordinate of the start of the k-mer Yes
end int64 0-based genomic coordinate of the end of the k-mer Yes
read_length uint32 Total length of the read Yes
query_kmer string Nucleotide k-mer at the modification site (e.g., CG, A, or the DRACH site) Yes
call_prob float32 Call probability at the site (range: 0.0–1.0) Yes
call_code string Modification identifier (e.g., m, h, -); extensible for other codes Yes
base_qual uint8 Phred-scaled quality score of the base Yes
flag uint16 SAMtools-style bit flags. Encodes strand (+/-), alignment status (primary, secondary, supplementary) Yes

Schema Notes:

  • The start and end fields should create a 0-based coordinate for the motif. While this may seem odd, most mods are found on both strands, so this helps collapse positions. For example, a CpG motif would have the reverse strand methylation one base downstream of the forward strand methylation if only the C positions were labeled. However, they really represent methylations of the same genomic locus.

  • The flag field follows SAMtools-style bit flags, compacted into a single integer field. Used to indicate:

    • Alignment classification: primary, secondary, supplementary, or unaligned
    • Strand: positive or negative

    Example:

    • 0 = positive strand, primary alignment
    • 16 = negative strand, primary alignment

Example Entry

read_id chrom read_position start end read_length query_kmer call_prob call_code base_qual flag
550e8400-e29b-41d4-a716-446655440000 chr3 452 1057320 1057322 9842 CG 0.93 m 18 16

Reading Columns

  • Programs should only read columns that are needed.
  • Explicitly stating which columns should be read will make the program more robust when optional columns are added.

Extensibility and Custom Columns

  • Users may include additional columns as needed for specific applications (e.g., read group, flow cell ID). Sample names are not included, as the file should only contain one sample.
  • Reserved names must not be overwritten (e.g., read_id, mod_code)
  • Columns must use supported Parquet data types and be self-describing

File Naming Convention (Optional)

  • Recommended: <sample_name>.ch3
  • When multiple modifications are tracked, suffix can be used (e.g., sample_5mC.ch3)

Software Compatibility

  • Write support: MethylSeqR function make_CH3_archive()
  • Read support: MethylSeqR, DuckDB (R/Python/SQL), PyArrow, Pandas, or any other tool for parquet file reading
  • Platform compatibility: Any system supporting Parquet (Linux, macOS, Windows)

Specification Versioning

  • Current version: CH3 v1.0
  • Versioning follows Semantic Versioning: MAJOR.MINOR
    • Major = incompatible changes
    • Minor = backward-compatible updates (e.g., new optional fields)

Availability and Licensing

Contact

For questions or contributions, please contact Jonathon T. Hill (jhill@byu.edu)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors