Skip to content

Latest commit

 

History

History
110 lines (79 loc) · 3.52 KB

File metadata and controls

110 lines (79 loc) · 3.52 KB

XR-seq Analysis Pipeline (C. elegans)

This repository contains a modular pipeline for processing and analyzing XR-seq (excision repair sequencing) data from raw .fastq.gz files through to simulation-normalized metaprofiles and quality control visualizations.

Repository Structure

xrseq_pipeline/
├── XR_Seq.sh                   # Main XR-seq processing pipeline (adapter trim → BAM/BED → filtering)
├── fordinucleotide.sh          # Script to generate fasta and count dinucleotide composition
├── bw.sh                       # Script to generate strand-separated simulation-normalized bigWig files
├── fa2kmerAbundanceTable.py    # k-mer abundance calculator
├── fasta.py                    # FASTA parsing utilities
├── sequence.py                 # Sequence filtering for dipyrimidine reads
├── exact_dam_site.py           # Locating the exact damage site in 2bp resolution 
├── plot_dinucleotide.R         # Dinucleotide positional enrichment barplot script
├── plot_monomer_rld.R          # Monomer frequency and read-length distribution plot

Requirements

  • Modules or tools:
    • cutadapt, fastx_toolkit, bowtie2, samtools, bedtools, ucsctools
    • Python ≥ 3.6 with BioPython
    • R with ggplot2, wesanderson, ggthemes, dplyr, tidyr

Pipeline Overview

1. Core Processing (XR_Seq.sh)

  • Input: .fastq.gz files
  • Steps:
    • Adapter trimming (cutadapt)
    • Read collapsing (fastx_collapser)
    • Genome alignment (bowtie2)
    • Conversion to BED format (samtools → bedtools)
    • Length filtering to retain reads [21–28 nt]
    • FASTA extraction from BED (bedtools)
    • Read count and length distribution generation

2. Dinucleotide Enrichment Analysis

Run:

bash fordinucleotide.sh

This prepares filtered FASTA files and performs dinucleotide counting for TT/TC/CT/CC at each position across reads.

Plot using:

Rscript plot_dinucleotide.R

3. Monomer Profile & Length Distribution

Run monomer.py to compute position-specific nucleotide ratios across filtered reads. Then visualize with:

Rscript plot_monomer_rld.R sample_name

This generates:

  • [sample]_monomer_graph.png
  • [sample]_rld_graph.png

4. Strand-Specific bigWig Files (bw.sh)

Generates plus- and minus-strand bigWig files normalized to RPM:

  • Removes mitochondrial DNA
  • Uses bedtools genomecov and bedGraphToBigWig

5. Locating the exact damage site in 2bp resolution

Run:

python3 exact_dam_site.py \
  --input merged_bed \
  --output merged_bed_exact_dam_site
  • Used only for nucleosome occupancy and dyad-centered analyses (Figure 5g,h and 6a,b)
  • Both input and output should be directories.
  • Inside merged_bed directory, stranded bed files and their simulations should be stored (see Notes below).

Output Files

  • *_filtered.bed: final XR-seq reads, 21–28 nt
  • *_readCount.txt: total filtered reads
  • *_plus.bw, *_minus.bw: RPM-normalized bigWigs
  • *_monomer_graph.png, *_rld_graph.png: QC plots
  • *_get2ker.csv: dinucleotide abundance

Notes

  • Genome index used: WBcel235 (ce11)
  • Adapted for use on HPC environments with SLURM
  • filtered.bed for two biological repeats can be merged and deposited in a new folder "merged_bed"
  • for xrseq around histone markers: stranded xrseq merged bed files generated by strandedmergedbedfiles.sh
  • simulations applied to merged.bed

Author

Cansu Kose, 2025 – UNC Chapel Hill | Sancar Lab Cem Azgari, 2025 – Sabancı University | Adebali Lab