This repository contains a modular pipeline for processing and analyzing XR-seq (excision repair sequencing) data from raw .fastq.gz files through to simulation-normalized metaprofiles and quality control visualizations.
xrseq_pipeline/
├── XR_Seq.sh # Main XR-seq processing pipeline (adapter trim → BAM/BED → filtering)
├── fordinucleotide.sh # Script to generate fasta and count dinucleotide composition
├── bw.sh # Script to generate strand-separated simulation-normalized bigWig files
├── fa2kmerAbundanceTable.py # k-mer abundance calculator
├── fasta.py # FASTA parsing utilities
├── sequence.py # Sequence filtering for dipyrimidine reads
├── exact_dam_site.py # Locating the exact damage site in 2bp resolution
├── plot_dinucleotide.R # Dinucleotide positional enrichment barplot script
├── plot_monomer_rld.R # Monomer frequency and read-length distribution plot
- Modules or tools:
cutadapt,fastx_toolkit,bowtie2,samtools,bedtools,ucsctools- Python ≥ 3.6 with
BioPython - R with
ggplot2,wesanderson,ggthemes,dplyr,tidyr
- Input:
.fastq.gzfiles - Steps:
- Adapter trimming (cutadapt)
- Read collapsing (fastx_collapser)
- Genome alignment (bowtie2)
- Conversion to BED format (samtools → bedtools)
- Length filtering to retain reads [21–28 nt]
- FASTA extraction from BED (bedtools)
- Read count and length distribution generation
Run:
bash fordinucleotide.shThis prepares filtered FASTA files and performs dinucleotide counting for TT/TC/CT/CC at each position across reads.
Plot using:
Rscript plot_dinucleotide.RRun monomer.py to compute position-specific nucleotide ratios across filtered reads. Then visualize with:
Rscript plot_monomer_rld.R sample_nameThis generates:
[sample]_monomer_graph.png[sample]_rld_graph.png
Generates plus- and minus-strand bigWig files normalized to RPM:
- Removes mitochondrial DNA
- Uses
bedtools genomecovandbedGraphToBigWig
Run:
python3 exact_dam_site.py \
--input merged_bed \
--output merged_bed_exact_dam_site- Used only for nucleosome occupancy and dyad-centered analyses (Figure 5g,h and 6a,b)
- Both input and output should be directories.
- Inside
merged_beddirectory, stranded bed files and their simulations should be stored (see Notes below).
*_filtered.bed: final XR-seq reads, 21–28 nt*_readCount.txt: total filtered reads*_plus.bw,*_minus.bw: RPM-normalized bigWigs*_monomer_graph.png,*_rld_graph.png: QC plots*_get2ker.csv: dinucleotide abundance
- Genome index used:
WBcel235(ce11) - Adapted for use on HPC environments with SLURM
- filtered.bed for two biological repeats can be merged and deposited in a new folder "merged_bed"
- for xrseq around histone markers: stranded xrseq merged bed files generated by strandedmergedbedfiles.sh
- simulations applied to merged.bed
Cansu Kose, 2025 – UNC Chapel Hill | Sancar Lab Cem Azgari, 2025 – Sabancı University | Adebali Lab