LineageQuant

Haplotype-resolved expression quantification for allopolyploid genomes

Description

LineageQuant is a bioinformatics tool designed to resolve haplotype-level expression dosage in allopolyploid genomes. The software employs a three-stage computational framework to deconvolve highly mixed RNA-seq signals using subgenome-specific features:

K-mer Theta engine – Haplotype-specific k-mer probes are constructed from allele sequences to estimate the initial transcriptional abundance prior (Theta) for each allele.
BAM lineage P-value assignment – Sequencing reads are scanned in multiple dimensions against ancestral subgenome-specific k-mer dictionaries (forward and reverse-complement) to generate a conditional probability matrix (P) describing each read's assignment to a specific haplotype.
Expectation-Maximization (EM) deconvolution – The EM algorithm integrates the Theta prior and the alignment-based probability P to iteratively reallocate multi-mapping reads, ultimately converging on precise allele-level expression values.

Additionally, LineageQuant adopts a hybrid quantification strategy that merges parallel count results for single-copy genes into the final output, producing a genome-wide unified TPM expression matrix.

LineageQuant natively supports dynamic multi-dimensional (≥ 2) joint quantification across ancestral lineages, making it particularly well-suited for transcriptomic studies in complex allopolyploid crops such as sugarcane, wheat, and cotton.

Pipeline overview

FASTQ (R1 + R2)
     │
     ▼  Stage 0 – HISAT2 + featureCounts
  Sorted BAM  +  raw count table
     │
     ▼  Stage 1 – K-mer Theta engine
  Theta_Matrix.tsv          (per-allele initial abundance)
     │
     ▼  Stage 2 – BAM lineage P-value assignment
  P_values.tsv              (per-read lineage probability vectors)
     │
     ▼  Stage 3 – EM deconvolution
  EM_Final_Expression.tsv   (allele-resolved allocated read counts + Theta)
     │
     ▼  Stage 4 – TPM matrix aggregation
  LineageQuant_Final_Merged_TPM.tsv

All stages support checkpoint / resume: if an output file already exists it is skipped automatically.

Installation

Option A – Conda (recommended)

git clone https://github.com/<your-username>/LineageQuant.git
cd LineageQuant
conda env create -f environment.yml
conda activate lineagequant
pip install -e .

Option B – pip (tools must be installed separately)

pip install biopython pysam pandas
# also install: hisat2, samtools, subread (featureCounts), gffread
pip install -e .

Quick start

lineagequant \
  --gff        /path/to/genome.gff3 \
  --fasta      /path/to/genome.fasta \
  --clusters   /path/to/allele_clusters.tsv \
  --gene-ids   /path/to/target_gene_ids.txt \
  --ancestry   /path/to/gene_ancestry.tsv \
  --orphan-genes /path/to/single_copy_genes.tsv \
  --ancestors  SubA:subA_k15.fa  SubB:subB_k15.fa  SubC:subC_k15.fa \
  --data-dir   /path/to/fastq_directory \
  --out-dir    /path/to/results \
  --threads    16

Input file formats

Argument	Format description
`--gff`	Standard GFF3; genes annotated with `type=gene` and `ID=` attribute
`--fasta`	Indexed whole-genome FASTA (`.fai` index optional, `gffread` will use it)
`--clusters`	Tab-separated: `ClusterID<TAB>Allele1,Allele2,...` (one cluster per line)
`--gene-ids`	Plain text, one gene ID per line
`--ancestry`	Tab-separated: `GeneID<TAB>LineageName`
`--orphan-genes`	Same format as `--clusters`; lists single-copy genes to quantify with featureCounts
`--ancestors`	`NAME:PATH` pairs, one per ancestral lineage (≥2 required); PATH points to a FASTA file where each record is a single k-mer sequence
FASTQ	Paired-end, gzipped, named `<sample>.R1.fq.gz` / `<sample>.R2.fq.gz`

Output files

Per-sample (inside <out_dir>/<sample>/):

File	Description
`<sample>.sorted.bam`	HISAT2-aligned, sorted BAM
`<sample>_featureCounts.txt`	Raw featureCounts table
`master_index.pkl`	K-mer probe index checkpoint (Stage 1)
`Theta_Matrix.tsv`	Per-allele initial Theta values
`P_values.tsv`	Per-read lineage probability vectors
`EM_Final_Expression.tsv`	EM-allocated read counts and final Theta

Final merged output (inside <out_dir>/):

File	Description
`LineageQuant_Final_Merged_TPM.tsv`	Whole-genome TPM matrix across all samples

All parameters

Reference files (required):
  --gff            Whole-genome GFF3 annotation
  --fasta          Whole-genome reference FASTA
  --clusters       Allele cluster TSV
  --gene-ids       Target gene ID list
  --ancestry       Gene ancestry table
  --orphan-genes   Orphan gene list
  --ancestors      Lineage k-mer FASTA files: NAME:PATH [NAME:PATH ...]

Input / Output:
  -d, --data-dir   FASTQ directory  (default: ./data)
  -o, --out-dir    Results directory (default: ./results)
  --index-prefix   HISAT2 index prefix (default: <data-dir>/genome_index)

Runtime:
  -t, --threads    CPU threads                (default: 16)
  --kmer           K-mer size for Theta engine (default: 31)
  --lineage-kmer   K-mer size for lineage files (default: 15)
  --epsilon        EM Laplace pseudo-count     (default: 0.01)
  --max-iter       Maximum EM iterations       (default: 200)
  --tol            EM convergence tolerance    (default: 1e-5)

Authors

Yi Chen, Gengrui Zhu

License

This project is licensed under a Non-Commercial Research License.
Free to use for academic and research purposes only. Commercial use is strictly prohibited.
See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LineageQuant

Description

Pipeline overview

Installation

Option A – Conda (recommended)

Option B – pip (tools must be installed separately)

Quick start

Input file formats

Output files

All parameters

Authors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
lineagequant		lineagequant
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

LineageQuant

Description

Pipeline overview

Installation

Option A – Conda (recommended)

Option B – pip (tools must be installed separately)

Quick start

Input file formats

Output files

All parameters

Authors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages