The CH3 file format is a compact, efficient storage format for native base modification calls generated from third-generation sequencing platforms (e.g., Oxford Nanopore). It was developed to drastically reduce storage requirements, streamline downstream analysis, and facilitate integration with the MethylSeqR R package. CH3 is optimized for speed, compression, and extensibility while maintaining compatibility with a broad set of programming tools and environments.
- Underlying format: Apache Parquet (columnar, binary format)
- Compression: Zstandard (zstd) is applied during file creation
- Encoding: Dictionary encoding for categorical fields (e.g., chromosome, modification code)
- Query support: Internal column-based access and row group indexing for rapid region-based queries
| Column Name | Type | Description | Required |
|---|---|---|---|
read_id |
uuid | Unique identifier for the read | Yes |
chrom |
string | Chromosome or contig name | Yes |
read_position |
uint32 | 0-based read coordinate of modified base | Yes |
start |
int64 | 0-based genomic coordinate of the start of the k-mer | Yes |
end |
int64 | 0-based genomic coordinate of the end of the k-mer | Yes |
read_length |
uint32 | Total length of the read | Yes |
query_kmer |
string | Nucleotide k-mer at the modification site (e.g., CG, A, or the DRACH site) | Yes |
call_prob |
float32 | Call probability at the site (range: 0.0–1.0) | Yes |
call_code |
string | Modification identifier (e.g., m, h, -); extensible for other codes | Yes |
base_qual |
uint8 | Phred-scaled quality score of the base | Yes |
flag |
uint16 | SAMtools-style bit flags. Encodes strand (+/-), alignment status (primary, secondary, supplementary) | Yes |
Schema Notes:
-
The
startandendfields should create a 0-based coordinate for the motif. While this may seem odd, most mods are found on both strands, so this helps collapse positions. For example, a CpG motif would have the reverse strand methylation one base downstream of the forward strand methylation if only the C positions were labeled. However, they really represent methylations of the same genomic locus. -
The
flagfield follows SAMtools-style bit flags, compacted into a single integer field. Used to indicate:- Alignment classification: primary, secondary, supplementary, or unaligned
- Strand: positive or negative
Example:
0= positive strand, primary alignment16= negative strand, primary alignment
| read_id | chrom | read_position | start | end | read_length | query_kmer | call_prob | call_code | base_qual | flag |
|---|---|---|---|---|---|---|---|---|---|---|
| 550e8400-e29b-41d4-a716-446655440000 | chr3 | 452 | 1057320 | 1057322 | 9842 | CG | 0.93 | m | 18 | 16 |
- Programs should only read columns that are needed.
- Explicitly stating which columns should be read will make the program more robust when optional columns are added.
- Users may include additional columns as needed for specific applications (e.g., read group, flow cell ID). Sample names are not included, as the file should only contain one sample.
- Reserved names must not be overwritten (e.g.,
read_id,mod_code) - Columns must use supported Parquet data types and be self-describing
- Recommended:
<sample_name>.ch3 - When multiple modifications are tracked, suffix can be used (e.g.,
sample_5mC.ch3)
- Write support: MethylSeqR function
make_CH3_archive() - Read support: MethylSeqR, DuckDB (R/Python/SQL), PyArrow, Pandas, or any other tool for parquet file reading
- Platform compatibility: Any system supporting Parquet (Linux, macOS, Windows)
- Current version: CH3 v1.0
- Versioning follows Semantic Versioning:
MAJOR.MINOR- Major = incompatible changes
- Minor = backward-compatible updates (e.g., new optional fields)
- Specification maintained at: https://github.com/Wasatch-Biolabs-Bfx/MethylSeqR
- License: See the github repository
For questions or contributions, please contact Jonathon T. Hill (jhill@byu.edu)