Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Processed Sequencing Files/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@

### What you will find here:

- wig files for each experiment. After aligning to the correct genome of choice we generate wig files for each experiment. These are tab seperated files that consist of a header indicating what genome was used for bowtie alignment of the reads, and then two columns of data, the first indicating the genome position and the second indicating the number of reads whose 5' end aligned to that position. Each experiment will have 2 wig files for each of the genomes it was aligned to, one for each strand.
- dataframe files. These files take the data in each wig files and compare it to a CDS file to generate the number of reads found in each coding body sequence for each gene. They are written as simple tab deliminated text files with the gene name, read count, and several pieces of information about each gene. We like to read in this file and work with it using pandas.
- wig files for each experiment. After aligning to the correct genome of choice we generate wig files for each experiment. These are tab separated files that consist of a header indicating what genome was used for bowtie alignment of the reads, and then two columns of data, the first indicating the genome position and the second indicating the number of reads whose 5' end aligned to that position. Each experiment will have 2 wig files for each of the genomes it was aligned to, one for each strand.
- dataframe files. These files take the data in each wig files and compare it to a CDS file to generate the number of reads found in each coding body sequence for each gene. They are written as simple tab delimited text files with the gene name, read count, and several pieces of information about each gene. We like to read in this file and work with it using pandas.
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,31 @@
# Analysis Scripts for Molecular Time Capsule Project

### What you will find here:
### What you will find here

- Copies of the scripts/exact parameters I used to convert my raw sequencing reads to processed read/gene dataframe files used in downstream analysis.
- The bowtie indices I aligned my raw data with and their corresponding CDS files.
- The dataframe read/gene files for each of the experiments referenced in my paper.
- The wig files for each of the experiments referenced in the paper.
- Jupyter notebook files for each figure in my paper which transform the data in the read/gene files into the exact figures seen in the paper.

### More Details:
### More Details

<details>
<summary>Raw data Processing:</summary>
<summary>Raw data Processing</summary>
<br>

The raw data processing consists of the following steps:
The raw data processing consists of the following steps.

- Trimming of each read as needed to remove any adapter sequence or nucleotides that may have been added as part of the library preparation. This is done via a custom python script `trim_reads.py`.
- Alignment to the bowtie index of choice. This is done using [bowtie](https://bowtie-bio.sourceforge.net/index.shtml) (not bowtie2) and provided indices.
- Sorting of aligned output (sam file), and compression to BAM file using samtools.
- Generate a "depth file" using the 5' end of each read as a read count at that location. (here we also seperate reads from the + and - strands). This is down using the `bedtools genomecov` command.
- Generate a "depth file" using the 5' end of each read as a read count at that location. (here we also separate reads from the + and - strands). This is down using the `bedtools genomecov` command.
- Conversion of density files to wigs (the default format used in our lab to view sequencing results). This is done through a custom python script `density_to_wig.py`.
- Conversion of wigs to read/gene dataframe files. This is done through a custom python file that requires a CDS file for the genome annotation that the reads were aligned to: `wig_to_df.py`.
- Conversion of wigs to read/gene dataframe files. This is done through a custom python file that requires a CDS file for the genome annotation that the reads were aligned to: `wig_to_df.py`.

All of these steps are collected in a single bash shell program called `process_seq.sh`. This program takes a single argument from the command line: - another shell file (denoted as a `config.sh` file) which contains all the experiment specific parameters. For each experiment in this paper I have a seperate `config.sh` file available with the exact parameters used for the pulished analysis. If you wish to redo this analysis yourself you need only modify the relevant `config.sh` file to update the parameters for the relevant location of the raw reads and (optionally) where you want the processed reads and intermediates to be stored.
All of these steps are collected in a single bash shell program called `process_seq.sh`. This program takes a single argument from the command line: - another shell file (denoted as a `config.sh` file) which contains all the experiment specific parameters. For each experiment in this paper I have a separate `config.sh` file available with the exact parameters used for the pulished analysis. If you wish to redo this analysis yourself you need only modify the relevant `config.sh` file to update the parameters for the relevant location of the raw reads and (optionally) where you want the processed reads and intermediates to be stored.

In order to run this analysis you will require the following:
In order to run this analysis the following tools are required.

- python
- pandas
Expand All @@ -35,11 +36,11 @@ In order to run this analysis you will require the following:
</details>

<details>
<summary> Figure Analysis </summary>
<summary>Figure Analysis</summary>
<br>

Each figure or subfigure has its own folder which contains:
- The final version of each figure that was included in the paper. Where possible there will be both the svg of the image that was included in the paper.
- A jupyter notebook can recreate the images presented. (They will also generate an embedded interactive image when run).
- The final version of each figure that was included in the paper. Where possible there will be both the svg of the image that was included in the paper.
- A jupyter notebook can recreate the images presented. (They will also generate an embedded interactive image when run).

</details>
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

#Change these variables to match what is needed in your project:

Genome="/home/mirae/data/Publication_Raw_Data_Processing/bowtie_indices/ecoli_bsub_and_mtcs" #"," Seperated list of bowtie indexes to align to.
Genome="/home/mirae/data/Publication_Raw_Data_Processing/bowtie_indices/ecoli_bsub_and_mtcs" #"," Separated list of bowtie indexes to align to.
Raw_Data_Location="/home/mirae/data/MTC_Publication_Data/230712LiA_Mixing/raw_fastq_files/" #Folder where the raw data lives.
Save_Folder_Location="/home/mirae/data/MTC_Publication_Data/230712LiA_Mixing/" #Folder for where to store all intermediary files.
Bowtie_Args="-m 1 -v 2"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

#Change these variables to match what is needed in your project:

Genome="/home/mirae/data/Publication_Raw_Data_Processing/bowtie_indices/ecoli_with_mtcs" #"," Seperated list of bowtie indexes to align to.
Genome="/home/mirae/data/Publication_Raw_Data_Processing/bowtie_indices/ecoli_with_mtcs" #"," Separated list of bowtie indexes to align to.
Raw_Data_Location="/home/mirae/data/MTC_Publication_Data/230712LiA_Snapshot/raw_fastq_files/" #Folder where the raw data lives.
Save_Folder_Location="/home/mirae/data/MTC_Publication_Data/230712LiA_Snapshot/" #Folder for where to store all intermediary files.
Bowtie_Args="-k 1 -v 2"
Expand Down
2 changes: 1 addition & 1 deletion Raw Data Analysis Scripts/config_files/230724Li_config.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

#Change these variables to match what is needed in your project:

Genome="/home/mirae/data/Publication_Raw_Data_Processing/bowtie_indices/ecoli_with_mtcs" #"," Seperated list of bowtie indexes to align to.
Genome="/home/mirae/data/Publication_Raw_Data_Processing/bowtie_indices/ecoli_with_mtcs" #"," Separated list of bowtie indexes to align to.
Raw_Data_Location="/home/mirae/data/MTC_Publication_Data/230724Li/raw_fastq_files/" #Folder where the raw data lives.
Save_Folder_Location="/home/mirae/data/MTC_Publication_Data/230724Li/" #Folder for where to store all intermediary files.
Bowtie_Args="-m 1 -v 2"
Expand Down
2 changes: 1 addition & 1 deletion Raw Data Analysis Scripts/config_files/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Config files:

Each of the experiments in the paper has a coresponding config file which contains all of the experiment-specific parameters for processing the raw data and converting it into wig files (tab seperated files containing genome position and read count) and gene-count files - data frame like files that collate the number of reads per gene.
Each of the experiments in the paper has a corresponding config file which contains all of the experiment-specific parameters for processing the raw data and converting it into wig files (tab separated files containing genome position and read count) and gene-count files - data frame like files that collate the number of reads per gene.
2 changes: 1 addition & 1 deletion Raw Data Analysis Scripts/fastq_to_dataframe.sh
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ fi

#If a valid CDS_Dir is provided then make a dataframe file with counts for each gene.
if [ -d "$CDS_Dir" ]; then
echo -e "\n--------------------------------------------\nUsing CDS Files to Generate Gene Specifc Counts."
echo -e "\n--------------------------------------------\nUsing CDS Files to Generate Gene Specific Counts."
echo "Making ${DataFrame_Save}${f_no_start/_minus.wig/} dataframe.."
if [ ! -f "$dataframe_file" ]; then
python /home/mirae/data/Publication_Raw_Data_Processing/wig_to_df.py ${minus_wig_name/_minus.wig/} $CDS_Dir $CDS_Files $CDS_Genomes $CDS_Names $dataframe_file
Expand Down
8 changes: 4 additions & 4 deletions Raw Data Analysis Scripts/wig_to_df.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ def make_CDS_df(cds_file, name):
Makes a dataframe for each CDS file for easy access downstream.
-cds_file = str with the path of the cds_file of interest.

Returns a pandas dataframe with the relevant columsn of the original CDS.
Returns a pandas dataframe with the relevant columns of the original CDS.
"""
cds_df = pd.DataFrame(columns=["Name", "Strand", "Start", "Stop", "Note"])

Expand Down Expand Up @@ -131,9 +131,9 @@ def make_CDS_df(cds_file, name):
Expected arguments (in order):
-wig_file = string of where name of current wig file
-cds_folder = string path where CDS files are kept
-cds_files = `,` seperated str with names of CDS files
-genomes = `,` seperated str with the corresponding genomes of the CDS files (eg NC_000913.2)
-names = `,` seperated str with names for each genome/CDS pair (eg ecoli)
-cds_files = `,` separated str with names of CDS files
-genomes = `,` separated str with the corresponding genomes of the CDS files (eg NC_000913.2)
-names = `,` separated str with names for each genome/CDS pair (eg ecoli)
-prefix = str with prefix for save location of the dataframe file.
"""

Expand Down