Haplotype-resolved expression quantification for allopolyploid genomes
LineageQuant is a bioinformatics tool designed to resolve haplotype-level expression dosage in allopolyploid genomes. The software employs a three-stage computational framework to deconvolve highly mixed RNA-seq signals using subgenome-specific features:
- K-mer Theta engine – Haplotype-specific k-mer probes are constructed from allele sequences to estimate the initial transcriptional abundance prior (Theta) for each allele.
- BAM lineage P-value assignment – Sequencing reads are scanned in multiple dimensions against ancestral subgenome-specific k-mer dictionaries (forward and reverse-complement) to generate a conditional probability matrix (P) describing each read's assignment to a specific haplotype.
- Expectation-Maximization (EM) deconvolution – The EM algorithm integrates the Theta prior and the alignment-based probability P to iteratively reallocate multi-mapping reads, ultimately converging on precise allele-level expression values.
Additionally, LineageQuant adopts a hybrid quantification strategy that merges parallel count results for single-copy genes into the final output, producing a genome-wide unified TPM expression matrix.
LineageQuant natively supports dynamic multi-dimensional (≥ 2) joint quantification across ancestral lineages, making it particularly well-suited for transcriptomic studies in complex allopolyploid crops such as sugarcane, wheat, and cotton.
FASTQ (R1 + R2)
│
▼ Stage 0 – HISAT2 + featureCounts
Sorted BAM + raw count table
│
▼ Stage 1 – K-mer Theta engine
Theta_Matrix.tsv (per-allele initial abundance)
│
▼ Stage 2 – BAM lineage P-value assignment
P_values.tsv (per-read lineage probability vectors)
│
▼ Stage 3 – EM deconvolution
EM_Final_Expression.tsv (allele-resolved allocated read counts + Theta)
│
▼ Stage 4 – TPM matrix aggregation
LineageQuant_Final_Merged_TPM.tsv
All stages support checkpoint / resume: if an output file already exists it is skipped automatically.
git clone https://github.com/<your-username>/LineageQuant.git
cd LineageQuant
conda env create -f environment.yml
conda activate lineagequant
pip install -e .pip install biopython pysam pandas
# also install: hisat2, samtools, subread (featureCounts), gffread
pip install -e .lineagequant \
--gff /path/to/genome.gff3 \
--fasta /path/to/genome.fasta \
--clusters /path/to/allele_clusters.tsv \
--gene-ids /path/to/target_gene_ids.txt \
--ancestry /path/to/gene_ancestry.tsv \
--orphan-genes /path/to/single_copy_genes.tsv \
--ancestors SubA:subA_k15.fa SubB:subB_k15.fa SubC:subC_k15.fa \
--data-dir /path/to/fastq_directory \
--out-dir /path/to/results \
--threads 16| Argument | Format description |
|---|---|
--gff |
Standard GFF3; genes annotated with type=gene and ID= attribute |
--fasta |
Indexed whole-genome FASTA (.fai index optional, gffread will use it) |
--clusters |
Tab-separated: ClusterID<TAB>Allele1,Allele2,... (one cluster per line) |
--gene-ids |
Plain text, one gene ID per line |
--ancestry |
Tab-separated: GeneID<TAB>LineageName |
--orphan-genes |
Same format as --clusters; lists single-copy genes to quantify with featureCounts |
--ancestors |
NAME:PATH pairs, one per ancestral lineage (≥2 required); PATH points to a FASTA file where each record is a single k-mer sequence |
| FASTQ | Paired-end, gzipped, named <sample>.R1.fq.gz / <sample>.R2.fq.gz |
Per-sample (inside <out_dir>/<sample>/):
| File | Description |
|---|---|
<sample>.sorted.bam |
HISAT2-aligned, sorted BAM |
<sample>_featureCounts.txt |
Raw featureCounts table |
master_index.pkl |
K-mer probe index checkpoint (Stage 1) |
Theta_Matrix.tsv |
Per-allele initial Theta values |
P_values.tsv |
Per-read lineage probability vectors |
EM_Final_Expression.tsv |
EM-allocated read counts and final Theta |
Final merged output (inside <out_dir>/):
| File | Description |
|---|---|
LineageQuant_Final_Merged_TPM.tsv |
Whole-genome TPM matrix across all samples |
Reference files (required):
--gff Whole-genome GFF3 annotation
--fasta Whole-genome reference FASTA
--clusters Allele cluster TSV
--gene-ids Target gene ID list
--ancestry Gene ancestry table
--orphan-genes Orphan gene list
--ancestors Lineage k-mer FASTA files: NAME:PATH [NAME:PATH ...]
Input / Output:
-d, --data-dir FASTQ directory (default: ./data)
-o, --out-dir Results directory (default: ./results)
--index-prefix HISAT2 index prefix (default: <data-dir>/genome_index)
Runtime:
-t, --threads CPU threads (default: 16)
--kmer K-mer size for Theta engine (default: 31)
--lineage-kmer K-mer size for lineage files (default: 15)
--epsilon EM Laplace pseudo-count (default: 0.01)
--max-iter Maximum EM iterations (default: 200)
--tol EM convergence tolerance (default: 1e-5)
Yi Chen, Gengrui Zhu
This project is licensed under a Non-Commercial Research License.
Free to use for academic and research purposes only. Commercial use is strictly prohibited.
See the LICENSE file for details.