Intragenomic_analysis

This repository contains the code for the experiments and the CGR-Diff software proposed in our paper.

Dataset

The dataset used in our paper includes genomes from various kingdoms, organized into three subsets based on the type of analysis performed: Reference (Subset 1), Intergenomic Analysis (Subset 2), and Intragenomic Analysis (Subset 3). Below is a summary of the datasets used in our experiments, along with links to their corresponding assemblies in the NCBI database.

Dataset	Kingdom	Species (Common Name)	Assembly Link	Length (Mbp)	% N
Reference (Subset 1)	Animalia	Homo sapiens (human)	GCA_009914755.4	3117	0
Intergenomic Analysis (Subset 2)	Animalia	Pan troglodytes (chimpanzee)	GCA_028858775.2	3178	0.16
		Mus musculus (house mouse)	GCA_000001635.9	2723	2.7
		Drosophila melanogaster (fruit fly)	GCA_000001215.4	80	0.57
	Fungi	Saccharomyces cerevisiae (yeast)	GCA_000146045.2	12	0
	Plantae	Arabidopsis thaliana (thale cress)	GCA_000001735.2	119	0.16
	Protista	Paramecium caudatum^📌	GCA_000715435.1	30	2.16
	Archaea	Pyrococcus furiosus	GCA_008245085.1	2	0
	Bacteria	Escherichia coli	GCA_000005845.2	5	0
Intragenomic Analysis (Subset 3)	Fungi	Aspergillus nidulans	GCA_000011425.1	30	0.04
	Plantae	Zea mays (maize)	GCA_022117705.1	2179	0
	Protista	Dictyostelium discoideum	GCA_000004695.1	34	0.07

^📌 Among all the species in this table, the assembly for Paramecium caudatum is at the scaffold level.

Replicate the experiments of our paper

Clone this repository and install the required libraries by running:

# Clone the repository
git clone https://github.com/Niousha12/Intragenomic_analysis.git
cd Intragenomic_analysis

# Create a virtual environment and install requirements
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Download the assemblies for each species from the NCBI website, and organize the data in the Data folder to match the following structure:

📂 Data/
├── 📂 Human/
│   ├── 📂 chromosomes/
│   │   ├── chr1.fna
│   │   └── ...
│   └── 📂 bedfiles/
│       ├── cytobands.bed
│       ├── telomere.bed
│       └── centromere.bed
├── 📂 Chimp/
│   └── 📂 chromosomes/
│       ├── chr1.fna
│       └── ...
├── 📂 Mouse/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Drosophila melanogaster/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Saccharomyces cerevisiae/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Arabidopsis thaliana/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Paramecium caudatum/
│   └── 📂 chromosomes/
│       └── ....fna (These are scaffolds)
├── 📂 Pyrococcus furiosus/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Escherichia coli/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Aspergillus nidulans/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Maize/
│   └── 📂 chromosomes/
│       └── ....fna
└── 📂 Dictyostelium discoideum/
    └── 📂 chromosomes/
        └── ....fna

Alternatively for Human, Chimp, Mouse, and Maize you can use the following command to download the chromosomes assemblies directly from the NCBI FTP server.

# Homo sapiens (human)
wget -P Data/Human/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.4_T2T-CHM13v2.0/GCA_009914755.4_T2T-CHM13v2.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Human/chromosomes/*.gz

# Pan troglodytes (chimpanzee)
wget -P Data/Chimp/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/028/858/775/GCA_028858775.2_NHGRI_mPanTro3-v2.0_pri/GCA_028858775.2_NHGRI_mPanTro3-v2.0_pri_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Chimp/chromosomes/*.gz

# Mus musculus (house mouse)
wget -P Data/Mouse/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.9_GRCm39/GCA_000001635.9_GRCm39_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Mouse/chromosomes/*.gz

# Zea mays (maize)
wget -P Data/Maize/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/022/117/705/GCA_022117705.1_Zm-Mo17-REFERENCE-CAU-T2T-assembly/GCA_022117705.1_Zm-Mo17-REFERENCE-CAU-T2T-assembly_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Maize/chromosomes/*.gz

You can also download the complete datasets and bedfiles used in the paper from the Google Drive or Zenodo.

The bedfiles in this dataset were processed from the original CHM13 dataset provided by the CHM13 GitHub repository. However, in the cytobands.bed file, the color of each cytoband region is added based on the NCBI Genome Data Viewer. Please cite both the original dataset and this repository when using this processed dataset.

Run the following command to replicate each experiment:

# Experiment 1: Pervasive Nature of Genomic Signatures
python -m scripts.Experiment_1 --species <species_name>
# Example: python -m scripts.Experiment_1 --species Human

# Experiment 2: Distance Selection
python -m scripts.Experiment_2 --Experiment_type <intragenomic(Exp 2.1) | intergenomic(Exp 2.2)>
# Example: python -m scripts.Experiment_2 --Experiment_type intragenomic

# Experiment 3: Intragenomic Variation
python -m scripts.Experiment_3 --species <species_name> --plot_approximate True --plot_random_outliers True --plot_MDS True
# Example: python -m scripts.Experiment_3 --species Human --plot_approximate True --plot_random_outliers True --plot_MDS True
# Example: python -m scripts.Experiment_3 --species Maize --plot_approximate True --plot_random_outliers False --plot_MDS True

# Experiment 4: Taxonomic Classification
python -m scripts.Experiment_4 --Experiment_type <all | no_chimp>
# Example: python -m scripts.Experiment_4 --Experiment_type all

Quick Example

This minimal example runs the intragenomic variation analysis (Experiment 3) on human chromosome 1 and includes three sub-experiments (Exp 3.1, Exp 3.2, and Exp 3.3). In each sub-experiment, a representative segment is selected using a different strategy — Exp 3.1 uses RepSeg, Exp 3.2 uses aRepSeg, and Exp 3.3 selects a random segment from tandem repeat regions — after which the distances of non-overlapping consecutive segments to that representative are computed and visualized.

You can run the example by executing the following commands:

### Clone the repository ---------------------------------------------------
git clone https://github.com/Niousha12/Intragenomic_analysis.git
cd Intragenomic_analysis

### (Optional) Create and activate a virtual environment -------------------
python -m venv venv
source venv/bin/activate

### Install dependencies ---------------------------------------------------
pip install -r requirements.txt

### Download chromosome 1 (Human, T2T-CHM13v2.0) ---------------------------
wget -P Data/Human/chromosomes/ \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.4_T2T-CHM13v2.0/GCA_009914755.4_T2T-CHM13v2.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr1.fna.gz
gunzip Data/Human/chromosomes/chr1.fna.gz

### Run the intragenomic analysis (Experiment 3) ---------------------------
python -m scripts.Experiment_3 \
  --species "Human" \
  --root_path "Data" \
  --k_mer 6 \
  --segment_length 500000 \
  --distance_metric "DSSIM" \
  --chromosome_name "1" \
  --plot_approximate True \
  --plot_random_outliers True \
  --n_value 30 \
  --get_MAE_aRepSeg False \
  --get_MAE_random_outliers False \
  --get_threshold False \
  --threshold_value 0.24 \
  --plot_MDS False

💡 Note:

This is a lightweight demonstration meant to verify installation and functionality.
You can replace --chromosome_name "1" with any other chromosome (e.g., "2", "9", "X") to analyze different chromosomes of the human genome.
Runtime depends on chromosome length, but generally ranges from ~1 to 5 minutes per chromosome on a standard laptop equipped with an 8-core CPU and 16 GB RAM.
- Chromosome 1 (248,387,328 bp, the longest chromosome) takes approximately 4 min 24 s for the complete experiment. The breakdown of the timing is as follows:
  
  Sub-experiment Time (min:s)
  
  Exp 3.1 2:29.87
  
  Exp 3.2 1:00.64
  
  Exp 3.3 0:53.47

CGR-Diff

To run the CGR-Diff software, you can use the following command:

python GUI.py

Alternatively, you can download and run the executable file from the following links.

You can also download the video tutorial from the following link:

CGR-Diff Video Tutorial

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.idea		.idea
Data		Data
Figures		Figures
assets		assets
distances		distances
misc		misc
outputs		outputs
scripts		scripts
.gitignore		.gitignore
GUI.py		GUI.py
README.md		README.md
chaos_game_representation.py		chaos_game_representation.py
chromosomes_holder.py		chromosomes_holder.py
constants.py		constants.py
intergenomic_analysis.py		intergenomic_analysis.py
intragenomic_analysis.py		intragenomic_analysis.py
knn.py		knn.py
representative_selection.py		representative_selection.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intragenomic_analysis

Dataset

Replicate the experiments of our paper

Quick Example

CGR-Diff

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Sub-experiment	Time (min:s)
Exp 3.1	2:29.87
Exp 3.2	1:00.64
Exp 3.3	0:53.47

Folders and files

Latest commit

History

Repository files navigation

Intragenomic_analysis

Dataset

Replicate the experiments of our paper

Quick Example

CGR-Diff

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages