This repository contains data, scripts, and outputs for the hnRNPH1 intron 4 retention decoy analysis workflow.
The database of predicted decoy loci was generated as a BED file using the bedtools intersect workflow described in decoy_splice_site_flowchart.pdf
Briefly, all human intron coordinates were collected from Vast-DB PSI-TABLE-hg38.tab.gz as any EVENT with an ID beginning with 'HsaIN'. These coordinates were fed in a strandwise fashion to bedtools getfasta, then SpliceAI to predict splice donor scores for every intronic nucleotide, using 400 nt exonic flanks for internal normalization by the canonical 5' splice site. Scores below 0.01 were filtered out, then all coordinates were intersected against PRPF8-binding datasets of RBPnet predictions generated with the same scaled workflow of all introns, and eCLIP and iCLIP signal from PRPF8 experimental data. Predictions within 100nt of a canonical splice site were removed from the dataset. Finally this dataset DecoySpliceSites_f50.bed was filtered to remove exonic loci and keep only 'protein-coding' introns by overlapping with Gencode.v49.annotation.gtf using GenomicRanges in the decoy_exon_overlaps.Rmd script to output DecoySpliceSites_proteincoding.bed.
$ bedtools intersect -a Splice_All.filtered.01min.bed -b rbpnet_clippy_f50_rollmean10_minHeightAdjust1.0_minPromAdjust1.0_minGeneCount5_Peaks.bed PRPF8_iCLIP_HepG2_xlinks.bed PRPF8_eCLIP_HepG2_rollmean10_minHeightAdjust1. 0_minPromAdjust1.0_minGeneCount5_Peaks.bed -c -s | awk '$NF > 0' > SupportedSpliceSites_f50.bed
$ bedtools intersect -a SupportedSpliceSites_f50.bed -b Wide_Canonical_splice_sites.bed -v -s > DecoySpliceSites_f50.bed
The final DecoySpliceSites_proteincoding.bed dataset was used as input for SpliceAI_Inference.py which slops the coordinates 24 nt wider on either side of the proposed decoy site, then uses strand-aware extraction of the fasta sequence with bedtools getfasta. Nucleotide sequences of the 49 nt region surrounding the decoy site is passed to SpliceAI. Max splice donor score is recorded for the locus and an updated BED file is output.
$ python3 ../scripts/SpliceAI_Inference.py --bed Decoys/DecoySpliceSites_proteincoding.bed --fasta ../../../reference/genomes/Gencode49/GRCh38.primary_assembly.genome.fa --genome ../../../reference/genomes/Gencode49/genome.sizes --out Decoys_proteincoding_splicescores.bed
The database of decoys in 'protein-coding' introns is uploaded as Decoys_proteincoding_splicescores.bed in the compile_decoy_intron_data.Rmd to integrate intron retention quantification data from PSI-TABLE-hg38.tab.gz. The Rmd file uses Genomic Ranges to integrate intron coordinates, unique identifiers EVENT and intron retention PSI values in 145 cell and tissue types with SpliceAI inference. Decoy distance from canonical splice site is calculated with strandwise logic. After overlapping the decoy database with introns, MaxEntScan is used to calculate the strength of decoy predicted splice sites and the canonical 5' splice site for the intron harboring the decoy with the scripts run_maxentscan_decoy.sh run_maxentscan_canonical.sh.Average phastCons 100-way and 470-way scoring across the intron harboring the decoy is calculated with extract_phastcons_scores.sh. Part 2 of the R markdown file reloads the results from MaxEntScan and phastCons and merges into the final feature table.
"TNNI2_1839212" "SLC7A6_68264187" "OAZ1_2270281" "ITPA_3221726" "ARHGAP40_38626901" "DHX35_38962112"
These loci are largely within protein-coding introns, with the exception of SLC7A6_68264187 and DHX35_38962112 which appear in the 5' UTR/intergenic space. 6 out of the 10 missing loci are in chromosome 20.
This R Markdown document generates Figure 1: a comparative multiple-sequence alignment view of the hnRNPH1 intron 4 decoy region (chr5:179620560-179620600, hg38). It starts from an extracted UCSC multiz MAF alignment (converted to FASTA), plots the raw alignment, then creates a manuscript-ready alignment by renaming taxa, converting DNA bases from T to U, removing selected outlier species, and dropping columns that are gaps/missing across all taxa.
The final plot highlights the proposed decoy site (positions 27-33 in the processed alignment; labeled as genomic interval 179,620,576-179,620,582) and is intended for direct use in manuscript figure generation.
data/hnRNPH1_intron4decoyMSA.fadata/tree_to_clade_mapping.tsv
data/hnRNPH1_intron4decoyMSA.processed.fa