Skip to content

Niousha12/Intragenomic_analysis

Repository files navigation

Intragenomic_analysis

This repository contains the code for the experiments and the CGR-Diff software proposed in our paper.

Experiments

Dataset

The dataset used in our paper includes genomes from various kingdoms, organized into three subsets based on the type of analysis performed: Reference (Subset 1), Intergenomic Analysis (Subset 2), and Intragenomic Analysis (Subset 3). Below is a summary of the datasets used in our experiments, along with links to their corresponding assemblies in the NCBI database.

Dataset Kingdom Species (Common Name) Assembly Link Length (Mbp) % N
Reference (Subset 1) Animalia Homo sapiens (human) GCA_009914755.4 3117 0
Intergenomic Analysis (Subset 2) Animalia Pan troglodytes (chimpanzee) GCA_028858775.2 3178 0.16
Mus musculus (house mouse) GCA_000001635.9 2723 2.7
Drosophila melanogaster (fruit fly) GCA_000001215.4 80 0.57
Fungi Saccharomyces cerevisiae (yeast) GCA_000146045.2 12 0
Plantae Arabidopsis thaliana (thale cress) GCA_000001735.2 119 0.16
Protista Paramecium caudatum📌 GCA_000715435.1 30 2.16
Archaea Pyrococcus furiosus GCA_008245085.1 2 0
Bacteria Escherichia coli GCA_000005845.2 5 0
Intragenomic Analysis (Subset 3) Fungi Aspergillus nidulans GCA_000011425.1 30 0.04
Plantae Zea mays (maize) GCA_022117705.1 2179 0
Protista Dictyostelium discoideum GCA_000004695.1 34 0.07

📌 Among all the species in this table, the assembly for Paramecium caudatum is at the scaffold level.

Replicate the experiments of our paper

  1. Clone this repository and install the required libraries by running:
# Clone the repository
git clone https://github.com/Niousha12/Intragenomic_analysis.git
cd Intragenomic_analysis

# Create a virtual environment and install requirements
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
  1. Download the assemblies for each species from the NCBI website, and organize the data in the Data folder to match the following structure:
📂 Data/
├── 📂 Human/
│   ├── 📂 chromosomes/
│   │   ├── chr1.fna
│   │   └── ...
│   └── 📂 bedfiles/
│       ├── cytobands.bed
│       ├── telomere.bed
│       └── centromere.bed
├── 📂 Chimp/
│   └── 📂 chromosomes/
│       ├── chr1.fna
│       └── ...
├── 📂 Mouse/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Drosophila melanogaster/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Saccharomyces cerevisiae/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Arabidopsis thaliana/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Paramecium caudatum/
│   └── 📂 chromosomes/
│       └── ....fna (These are scaffolds)
├── 📂 Pyrococcus furiosus/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Escherichia coli/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Aspergillus nidulans/
│   └── 📂 chromosomes/
│       └── ....fna
├── 📂 Maize/
│   └── 📂 chromosomes/
│       └── ....fna
└── 📂 Dictyostelium discoideum/
    └── 📂 chromosomes/
        └── ....fna

Alternatively for Human, Chimp, Mouse, and Maize you can use the following command to download the chromosomes assemblies directly from the NCBI FTP server.

# Homo sapiens (human)
wget -P Data/Human/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.4_T2T-CHM13v2.0/GCA_009914755.4_T2T-CHM13v2.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Human/chromosomes/*.gz

# Pan troglodytes (chimpanzee)
wget -P Data/Chimp/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/028/858/775/GCA_028858775.2_NHGRI_mPanTro3-v2.0_pri/GCA_028858775.2_NHGRI_mPanTro3-v2.0_pri_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Chimp/chromosomes/*.gz

# Mus musculus (house mouse)
wget -P Data/Mouse/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.9_GRCm39/GCA_000001635.9_GRCm39_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Mouse/chromosomes/*.gz

# Zea mays (maize)
wget -P Data/Maize/chromosomes/ -r -nH --cut-dirs=12 --no-parent \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/022/117/705/GCA_022117705.1_Zm-Mo17-REFERENCE-CAU-T2T-assembly/GCA_022117705.1_Zm-Mo17-REFERENCE-CAU-T2T-assembly_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
gunzip Data/Maize/chromosomes/*.gz

You can also download the complete datasets and bedfiles used in the paper from the Google Drive or Zenodo.

The bedfiles in this dataset were processed from the original CHM13 dataset provided by the CHM13 GitHub repository. However, in the cytobands.bed file, the color of each cytoband region is added based on the NCBI Genome Data Viewer. Please cite both the original dataset and this repository when using this processed dataset.

  1. Run the following command to replicate each experiment:
# Experiment 1: Pervasive Nature of Genomic Signatures
python -m scripts.Experiment_1 --species <species_name>
# Example: python -m scripts.Experiment_1 --species Human

# Experiment 2: Distance Selection
python -m scripts.Experiment_2 --Experiment_type <intragenomic(Exp 2.1) | intergenomic(Exp 2.2)>
# Example: python -m scripts.Experiment_2 --Experiment_type intragenomic

# Experiment 3: Intragenomic Variation
python -m scripts.Experiment_3 --species <species_name> --plot_approximate True --plot_random_outliers True --plot_MDS True
# Example: python -m scripts.Experiment_3 --species Human --plot_approximate True --plot_random_outliers True --plot_MDS True
# Example: python -m scripts.Experiment_3 --species Maize --plot_approximate True --plot_random_outliers False --plot_MDS True

# Experiment 4: Taxonomic Classification
python -m scripts.Experiment_4 --Experiment_type <all | no_chimp>
# Example: python -m scripts.Experiment_4 --Experiment_type all

Quick Example

This minimal example runs the intragenomic variation analysis (Experiment 3) on human chromosome 1 and includes three sub-experiments (Exp 3.1, Exp 3.2, and Exp 3.3). In each sub-experiment, a representative segment is selected using a different strategy — Exp 3.1 uses RepSeg, Exp 3.2 uses aRepSeg, and Exp 3.3 selects a random segment from tandem repeat regions — after which the distances of non-overlapping consecutive segments to that representative are computed and visualized.

You can run the example by executing the following commands:

### Clone the repository ---------------------------------------------------
git clone https://github.com/Niousha12/Intragenomic_analysis.git
cd Intragenomic_analysis

### (Optional) Create and activate a virtual environment -------------------
python -m venv venv
source venv/bin/activate

### Install dependencies ---------------------------------------------------
pip install -r requirements.txt

### Download chromosome 1 (Human, T2T-CHM13v2.0) ---------------------------
wget -P Data/Human/chromosomes/ \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.4_T2T-CHM13v2.0/GCA_009914755.4_T2T-CHM13v2.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr1.fna.gz
gunzip Data/Human/chromosomes/chr1.fna.gz

### Run the intragenomic analysis (Experiment 3) ---------------------------
python -m scripts.Experiment_3 \
  --species "Human" \
  --root_path "Data" \
  --k_mer 6 \
  --segment_length 500000 \
  --distance_metric "DSSIM" \
  --chromosome_name "1" \
  --plot_approximate True \
  --plot_random_outliers True \
  --n_value 30 \
  --get_MAE_aRepSeg False \
  --get_MAE_random_outliers False \
  --get_threshold False \
  --threshold_value 0.24 \
  --plot_MDS False

💡 Note:

  • This is a lightweight demonstration meant to verify installation and functionality.
  • You can replace --chromosome_name "1" with any other chromosome (e.g., "2", "9", "X") to analyze different chromosomes of the human genome.
  • Runtime depends on chromosome length, but generally ranges from ~1 to 5 minutes per chromosome on a standard laptop equipped with an 8-core CPU and 16 GB RAM.
    • Chromosome 1 (248,387,328 bp, the longest chromosome) takes approximately 4 min 24 s for the complete experiment. The breakdown of the timing is as follows:

      Sub-experimentTime (min:s)
      Exp 3.12:29.87
      Exp 3.21:00.64
      Exp 3.30:53.47

CGR-Diff

To run the CGR-Diff software, you can use the following command:

python GUI.py

Alternatively, you can download and run the executable file from the following links.

You can also download the video tutorial from the following link:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages