This repository contains the code for the experiments and the CGR-Diff software proposed in our paper.
The dataset used in our paper includes genomes from various kingdoms, organized into three subsets based on the type of analysis performed: Reference (Subset 1), Intergenomic Analysis (Subset 2), and Intragenomic Analysis (Subset 3). Below is a summary of the datasets used in our experiments, along with links to their corresponding assemblies in the NCBI database.
| Dataset | Kingdom | Species (Common Name) | Assembly Link | Length (Mbp) | % N |
|---|---|---|---|---|---|
| Reference (Subset 1) | Animalia | Homo sapiens (human) | GCA_009914755.4 | 3117 | 0 |
| Intergenomic Analysis (Subset 2) | Animalia | Pan troglodytes (chimpanzee) | GCA_028858775.2 | 3178 | 0.16 |
| Mus musculus (house mouse) | GCA_000001635.9 | 2723 | 2.7 | ||
| Drosophila melanogaster (fruit fly) | GCA_000001215.4 | 80 | 0.57 | ||
| Fungi | Saccharomyces cerevisiae (yeast) | GCA_000146045.2 | 12 | 0 | |
| Plantae | Arabidopsis thaliana (thale cress) | GCA_000001735.2 | 119 | 0.16 | |
| Protista | Paramecium caudatum📌 | GCA_000715435.1 | 30 | 2.16 | |
| Archaea | Pyrococcus furiosus | GCA_008245085.1 | 2 | 0 | |
| Bacteria | Escherichia coli | GCA_000005845.2 | 5 | 0 | |
| Intragenomic Analysis (Subset 3) | Fungi | Aspergillus nidulans | GCA_000011425.1 | 30 | 0.04 |
| Plantae | Zea mays (maize) | GCA_022117705.1 | 2179 | 0 | |
| Protista | Dictyostelium discoideum | GCA_000004695.1 | 34 | 0.07 |
📌 Among all the species in this table, the assembly for Paramecium caudatum is at the scaffold level.
- Clone this repository and install the required libraries by running:
# Clone the repository
git clone https://github.com/Niousha12/Intragenomic_analysis.git
cd Intragenomic_analysis
# Create a virtual environment and install requirements
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Download the assemblies for each species from the NCBI website, and organize the data in the
Datafolder to match the following structure:
📂 Data/
├── 📂 Human/
│ ├── 📂 chromosomes/
│ │ ├── chr1.fna
│ │ └── ...
│ └── 📂 bedfiles/
│ ├── cytobands.bed
│ ├── telomere.bed
│ └── centromere.bed
├── 📂 Chimp/
│ └── 📂 chromosomes/
│ ├── chr1.fna
│ └── ...
├── 📂 Mouse/
│ └── 📂 chromosomes/
│ └── ....fna
├── 📂 Drosophila melanogaster/
│ └── 📂 chromosomes/
│ └── ....fna
├── 📂 Saccharomyces cerevisiae/
│ └── 📂 chromosomes/
│ └── ....fna
├── 📂 Arabidopsis thaliana/
│ └── 📂 chromosomes/
│ └── ....fna
├── 📂 Paramecium caudatum/
│ └── 📂 chromosomes/
│ └── ....fna (These are scaffolds)
├── 📂 Pyrococcus furiosus/
│ └── 📂 chromosomes/
│ └── ....fna
├── 📂 Escherichia coli/
│ └── 📂 chromosomes/
│ └── ....fna
├── 📂 Aspergillus nidulans/
│ └── 📂 chromosomes/
│ └── ....fna
├── 📂 Maize/
│ └── 📂 chromosomes/
│ └── ....fna
└── 📂 Dictyostelium discoideum/
└── 📂 chromosomes/
└── ....fna
Alternatively for Human, Chimp, Mouse, and Maize you can use the following command to download the chromosomes assemblies directly from the NCBI FTP server.
# Homo sapiens (human)
wget -P Data/Human/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.4_T2T-CHM13v2.0/GCA_009914755.4_T2T-CHM13v2.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Human/chromosomes/*.gz
# Pan troglodytes (chimpanzee)
wget -P Data/Chimp/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/028/858/775/GCA_028858775.2_NHGRI_mPanTro3-v2.0_pri/GCA_028858775.2_NHGRI_mPanTro3-v2.0_pri_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Chimp/chromosomes/*.gz
# Mus musculus (house mouse)
wget -P Data/Mouse/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.9_GRCm39/GCA_000001635.9_GRCm39_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Mouse/chromosomes/*.gz
# Zea mays (maize)
wget -P Data/Maize/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/022/117/705/GCA_022117705.1_Zm-Mo17-REFERENCE-CAU-T2T-assembly/GCA_022117705.1_Zm-Mo17-REFERENCE-CAU-T2T-assembly_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Maize/chromosomes/*.gzYou can also download the complete datasets and bedfiles used in the paper from the Google Drive or Zenodo.
The bedfiles in this dataset were processed from the original CHM13 dataset provided by the CHM13 GitHub repository. However, in the cytobands.bed file, the color of each cytoband region is added based on the NCBI Genome Data Viewer.
Please cite both the original dataset and this repository when using this processed dataset.
- Run the following command to replicate each experiment:
# Experiment 1: Pervasive Nature of Genomic Signatures
python -m scripts.Experiment_1 --species <species_name>
# Example: python -m scripts.Experiment_1 --species Human
# Experiment 2: Distance Selection
python -m scripts.Experiment_2 --Experiment_type <intragenomic(Exp 2.1) | intergenomic(Exp 2.2)>
# Example: python -m scripts.Experiment_2 --Experiment_type intragenomic
# Experiment 3: Intragenomic Variation
python -m scripts.Experiment_3 --species <species_name> --plot_approximate True --plot_random_outliers True --plot_MDS True
# Example: python -m scripts.Experiment_3 --species Human --plot_approximate True --plot_random_outliers True --plot_MDS True
# Example: python -m scripts.Experiment_3 --species Maize --plot_approximate True --plot_random_outliers False --plot_MDS True
# Experiment 4: Taxonomic Classification
python -m scripts.Experiment_4 --Experiment_type <all | no_chimp>
# Example: python -m scripts.Experiment_4 --Experiment_type allThis minimal example runs the intragenomic variation analysis (Experiment 3) on human chromosome 1 and includes three sub-experiments (Exp 3.1, Exp 3.2, and Exp 3.3). In each sub-experiment, a representative segment is selected using a different strategy — Exp 3.1 uses RepSeg, Exp 3.2 uses aRepSeg, and Exp 3.3 selects a random segment from tandem repeat regions — after which the distances of non-overlapping consecutive segments to that representative are computed and visualized.
You can run the example by executing the following commands:
### Clone the repository ---------------------------------------------------
git clone https://github.com/Niousha12/Intragenomic_analysis.git
cd Intragenomic_analysis
### (Optional) Create and activate a virtual environment -------------------
python -m venv venv
source venv/bin/activate
### Install dependencies ---------------------------------------------------
pip install -r requirements.txt
### Download chromosome 1 (Human, T2T-CHM13v2.0) ---------------------------
wget -P Data/Human/chromosomes/ \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.4_T2T-CHM13v2.0/GCA_009914755.4_T2T-CHM13v2.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr1.fna.gz
gunzip Data/Human/chromosomes/chr1.fna.gz
### Run the intragenomic analysis (Experiment 3) ---------------------------
python -m scripts.Experiment_3 \
--species "Human" \
--root_path "Data" \
--k_mer 6 \
--segment_length 500000 \
--distance_metric "DSSIM" \
--chromosome_name "1" \
--plot_approximate True \
--plot_random_outliers True \
--n_value 30 \
--get_MAE_aRepSeg False \
--get_MAE_random_outliers False \
--get_threshold False \
--threshold_value 0.24 \
--plot_MDS False💡 Note:
- This is a lightweight demonstration meant to verify installation and functionality.
- You can replace
--chromosome_name "1"with any other chromosome (e.g., "2", "9", "X") to analyze different chromosomes of the human genome. - Runtime depends on chromosome length, but generally ranges from ~1 to 5 minutes per chromosome on a standard laptop equipped with an 8-core CPU and 16 GB RAM.
-
Chromosome 1 (248,387,328 bp, the longest chromosome) takes approximately 4 min 24 s for the complete experiment. The breakdown of the timing is as follows:
Sub-experiment Time (min:s) Exp 3.1 2:29.87 Exp 3.2 1:00.64 Exp 3.3 0:53.47
-
To run the CGR-Diff software, you can use the following command:
python GUI.pyAlternatively, you can download and run the executable file from the following links.
You can also download the video tutorial from the following link:
