Author: Michael Wang (fw262@cornell.edu)
We developed meta-scRNA-seq, a pipeline for unbiased detection of non-host transcriptomic information from scRNA-seq data. To achieve this, meta-scRNA-seq aligns scRNA-seq data against the host-genome reference using standard approaches, collected single-cell tagged unmapped reads, labeled them based on sequence similarity against a large metagenomic database, and demultiplexed the reads to generate a cell-by-metagenome count matrix in parallel with the standard cell-by-gene (host) matrix.
This workflow requires the following packages listed below. Please ensure that tool can be called from the command line (i.e. the paths to each tool is in your path variable).
1. Snakemake
2. STAR Aligner
conda install -c bioconda star
Please also ensure that you have downloaded the following R packages. They will be used throughout the pipeline.
4. Samtools
conda install -c bioconda samtools
5. Kraken2
Please make sure this tool is available in your working environment. Please also download the reference database.
Run the following command in your command line.
git clone https://github.com/fw262/Meta-scRNA-seq.git
Please ensure to include all required software before starting.
Please move raw fastq files for each experiment into one data directory. Please ensure the sequence files end in "{sample}_R1_001.fastq.gz" and "{sample}_R1_001.fastq.gz" in your data directory.
Please change the variable names in the config.yaml as required for your analysis. This includes the following changes:
-
Samples: Samples prefix (before the _R1_001.fastq.gz)
-
STAR_IND: Path to your STAR generated index folder.
-
DATADIR: Path to where the sequencing samples ({sample}_R1_001.fastq.gz) are stored.
-
PIPELINE_MAJOR: Directory where the outputs (expression matrices, plots) are stored.
-
GLOBAL: Define global variables for pipeline including number of mismatches allowed in STAR, cell barcode base pair range in read 1, and UMI base pair range in read 1.
-
STAREXEC: Path to STAR.
-
KRAKEN: Path to Kraken2.
-
KRAKEN_DB: Path to Kraken2 database.
-
CORES: Number of cores used in each step of the pipeline. To run multiple samples in parallel, please specify total number of cores in the snakemake command (i.e. "snakemake -j {total cores}").
Please ensure the Snakefile and config.yaml files as well as the scripts folder are in the directory where you intend to run the pipeline.
- Merged transcriptome + metagenomice expression matrices are stored in "[PIPELINE_MAJOR]/[Samples]_solo/Solo.out/merged" folder.
- The "[PIPELINE_MAJOR]/[Samples]_solo/plots" folder contains several useful plots including UMAP projection of the data, level of unmapped reads for each cell cluster, as well as cell-cluster specific expression of all metagenomic features, differentially expressed genes, and differentially expressed metagenomic features.
