diff --git a/README.md b/README.md index 708ece6..615fe58 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ By the end of this course, students will: A working knowledge and experience with R, GitHub, and Unix is required. Students are required to complete GSND 5345Q (Fundamentals of Data Science) prior to taking this course, or have equivalent experience. An introductory course in statistics or biostatistics is also recommended (e.g., GSND 5135Q or MSBS N5030), or comparable experience in statistical analysis. Students without sufficient background in coding or statistics are expected to acquire these skills before the course begins. Please contact Dr. Johnson for a list of required proficiencies and available asynchronous resources to help build the necessary programming and statistical foundations. ## COURSE FORMAT: -This class will be taught virtually using a synchronous remote modality, although students will be provided a classroom to gather for each lecture. A co-instructor will be present in the classroom for each lecture. Class will occur Mondays and Wednesdays from 12:00pm-1:50pm.. Courses may also be recorded and made available for students who need to miss classes due to personal reasons, illness, or research related needs. +This class will be taught virtually using a synchronous remote modality, although students will be provided a classroom to gather for each lecture. A co-instructor will be present in the classroom for each lecture. Class will occur Mondays and Wednesdays from 12:00pm-1:50pm. Courses may also be recorded and made available for students who need to miss classes due to personal reasons, illness, or research related needs. ## ZOOM LINK AND CLASSROOM: Zoom Meeting ID for all sessions is Meeting ID: 929 7046 3934, with the password: 850292, or use the following direct link (the link is also available though the course GitHub page): https://rutgers.zoom.us/j/92970463934?pwd=wVF7nGblCNaAMon3fFlRwboSgEiDUg.1 @@ -69,12 +69,12 @@ This is an 8-week, 2.0 credit class near the end of Spring 2024. In general, you I **strongly** encourage you to contact early me if you have difficulty with the material. This course builds on material from prior lectures, so do not fall behind! My job is to help you understand the material as well as possible, and I am flexible with meeting times. ### ACADEMIC INTEGRITY: -You are expected to have read and follow the guidelines at the university’s academic integrity website (http://academicintegrity.rutgers.edu ). These principles forbid plagiarism and require that every Rutgers University student to: +You are expected to have read and follow the guidelines at the university’s academic integrity website (http://academicintegrity.rutgers.edu). These principles forbid plagiarism and require that every Rutgers University student to: * Properly acknowledge and cite all use of the ideas, results, or words of others * Properly acknowledge all contributors to a given piece of work * Make sure that all work submitted as his or her own in a course or other academic activity isproduced without the aid of unsanctioned materials or unsanctioned collaboration -* Treat all other students in an ethical manner, respecting their integrity and right to pursue their educational goals without interference. This requires that a student neither facilitate academic dishonesty by others nor obstruct their academic progress (reproduced from: ttp://academicintegrity.rutgers.edu/academic-integrity-at-rutgers/ ). +* Treat all other students in an ethical manner, respecting their integrity and right to pursue their educational goals without interference. This requires that a student neither facilitate academic dishonesty by others nor obstruct their academic progress (reproduced from: http://academicintegrity.rutgers.edu/academic-integrity-at-rutgers/). Violations of academic integrity will be treated in accordance with university policy, and sanctions for violations may range from no credit for the assignment, to a failing course grade to (for the most severe violations) dismissal from the university. diff --git a/homework/HW3BMDA.Rmd b/homework/HW3BMDA.Rmd new file mode 100644 index 0000000..7a73621 --- /dev/null +++ b/homework/HW3BMDA.Rmd @@ -0,0 +1,203 @@ +--- +title: "HW 3 BMDA" +author: "Nat Lim" +date: "2026-06-13" +output: html_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +In Homework 1 you downloaded and QC'd the 33 FASTQ files for the HIV/TB dataset from SRA. In this assignment you will carry those reads through a complete RNA-seq differential expression analysis. + +Please save all your code for each part of the assignment. Your instructor may request it, and it may also be used for awarding partial credit if needed. We strongly recommend completing the assignment in an RMarkdown document to ensure your work is well-documented and reproducible. + +Submit your completed assignment via the designated homework (Quiz) submission site on Canvas. + +## RNA-sequencing analysis + +##### 1. Align the HIV/TB reads (downloaded in Homework 1) to the human genome reference using your choice of the `Rsubread` or the `STAR` aligners. Confirm whether or not you were able to successfully complete this step (on your honor). + + 1. Yes + 2. No + +```{r, eval=FALSE} +# Install the Bioconductor installer +install.packages("BiocManager") + +# Install Rsubread +BiocManager::install("Rsubread") + +# Index the genome: +library(Rsubread) +buildindex("genomefile", "genomefile.fa") + +# Align the reads: +align("genomefile", "myfastqfile.fq", + output_file="myfastq.bam", nthreads = 4) + +``` + + +##### 2. Use the `featureCounts` function in the `Rsubread` package to generate a counts file for this dataset. Upload your feature counts .txt file here. + + + +For the rest of the assignment, please use the feature counts provided by Dr. Johnson in the lecture materials: {https://github.com/wevanjohnson/2026_Spring_BMDA/raw/refs/heads/main/lectures/week6/example_data/features_combined.txt}{features_combined.txt (link)} + + + +##### 3. Generate a `SummarizedExperiment` object for your counts. The `colData` for these data are provided in the `homework2\_metadata.txt` file. What function is used to create a SummarizedExperiment object? + + 1. SummarizedExperiment(assays = list(counts = as.matrix(counts)), colData = metadata) + 2. makeSummarizedExp(as.matrix(counts), metadata) + 3. seObject(counts, metadata) + 4. summarizeData(data = counts, metadata = metadata) + 5. buildExperiment(assay = as.matrix(counts), metadata = metadata) + +```{r} +library(SummarizedExperiment) + +# Read in counts +counts <- read.delim("features_combined.txt", row.names = 1, check.names = FALSE) + +# Read in metadata +metadata <- read.delim("hivtb_SRR_metadata.txt") + +# Make sure SampleID matches the count matrix column order +rownames(metadata) <- metadata$SampleID +metadata <- metadata[colnames(counts), ] + +# Build the SummarizedExperiment +se <- SummarizedExperiment( + assays = list(counts = as.matrix(counts)), + colData = metadata +) + +se +``` + + +##### 4. Preprocess these data by removing `TB-HIV-ART` samples (should be two of them), removing any genes with zero expression for all samples, and by generating a log counts per million assay. How may genes were removed with the zero expression filter? + +1. 48 +2. 2,204 +3. 9,700 +4. 17,385 + +```{r} +# Remove TB-HIV-ART samples +se <- se[, se$disease_status != "TB-HIV-ART"] +dim(se) # should now show 31 samples + +# Filter genes with zero expression across all samples +keep <- rowSums(assay(se, "counts")) > 0 +n_removed <- sum(!keep) +n_removed # this answers the question + +se <- se[keep, ] +dim(se) # should show fewer genes + +# Generate log CPM assay +library(edgeR) +logcpm <- cpm(assay(se, "counts"), log = TRUE) +assay(se, "logCPM") <- logcpm + +se +``` + + +##### 5. Create a batch corrected assay in your `SummarizedExperiment` using ComBat-Seq. You can use the `ComBat\_Seq` function in the `sva` package, or simply do it in `BatchQC` and then extract the `SummarizedExperiment`. For this example, pretend that `disease\_status` is the batch variable. Note that this will remove the disease status variability, so **don't** use this assay in the following analyses! This was merely a practice for cases where you have an actual batch variable. What command/syntax would work to apply ComBat-Seq? + + 1. ComBat_seq(as.matrix(assay(se, "counts")), batch = se\$disease_status) + 2. ComBat_seq(counts = assay(se), batch = disease_status) + 3. BatchFix(se, "counts", batch = "disease_status") + 4. combatCorrect(se\$counts, batch = se\$disease_status) + 5. ComBat(se, "counts", batch = "disease_status") + +```{r} +library(sva) + +# ComBat-Seq practice (do NOT use this assay for downstream DE analysis) +combat_counts <- ComBat_seq( + as.matrix(assay(se, "counts")), + batch = se$disease_status +) + +# Store it as a separate assay so it's clearly distinguished +assay(se, "combat_counts") <- combat_counts + +se +``` + + + +##### 6. Apply PCA and UMAP to your data and generate dimension reduction plots for the results. Color the TB-HIV to the HIV only samples in different colors. Note that you should be using the log CPM values for this analysis. Please attach your UMAP plot here. + +```{r} +library(umap) + +logcpm_mat <- t(assay(se, "logCPM")) # samples as rows for PCA/UMAP + +# PCA +pca <- prcomp(logcpm_mat, scale. = TRUE) +pca_df <- data.frame( + PC1 = pca$x[,1], + PC2 = pca$x[,2], + disease_status = se$disease_status +) + +library(ggplot2) +ggplot(pca_df, aes(PC1, PC2, color = disease_status)) + + geom_point(size = 3) + + theme_minimal() + + labs(title = "PCA of log CPM values") + +# UMAP +set.seed(42) +umap_res <- umap(logcpm_mat) +umap_df <- data.frame( + UMAP1 = umap_res$layout[,1], + UMAP2 = umap_res$layout[,2], + disease_status = se$disease_status +) + +ggplot(umap_df, aes(UMAP1, UMAP2, color = disease_status)) + + geom_point(size = 3) + + theme_minimal() + + labs(title = "UMAP of log CPM values") +``` + + +##### 7. Use `DESeq2` to do a differential expression analysis (on the counts) comparing the TB-HIV to the HIV only samples. What is the twenty second gene on the differentially expressed gene list? + + 1. IGF2BP3 + 2. AIM2 + 3. IL1R2 + 4. RAB20 + 5. FCGR1C + + +##### 8. Now conduct the same analysis using `limma` on the log CPM values. How do the `DESeq2` results compare to the `limma` results? What is the seventh gene on the differentially expressed gene list? + + 1. IGF2BP3 + 2. AIM2 + 3. IL1R2 + 4. RAB20 + 5. FCGR1C + + +##### 9. Give a heatmap plot of either the `DESeq2` or the `limma` results (top 50). Add a colorbar for disease status. Upload your plot here. + +##### 10. Conduct a pathway analysis of the top 50 Limma genes using enrichR (through R or online). Use the "Reactome\_Pathways\_2024" database. What is the top scoring pathway? + + 1. "Interferon Gamma Signaling" + 2. "Fatty Acid Metabolism" + 3. "Cytokine Signaling in Immune System" + 4. "RAB Geranylgeranylation" + + +##### 11. Conduct a TBSignatureProfiler analysis on these data -- including signature heatmaps, individual boxplots, and AUC boxplots. Interpret your findings. Upload one of your TBSignatureProfiler plots here. + + diff --git a/homework/HW3BMDA.html b/homework/HW3BMDA.html new file mode 100644 index 0000000..7078c60 --- /dev/null +++ b/homework/HW3BMDA.html @@ -0,0 +1,724 @@ + + + + +
+ + + + + + + + + + +In Homework 1 you downloaded and QC’d the 33 FASTQ files for the +HIV/TB dataset from SRA. In this assignment you will carry those reads +through a complete RNA-seq differential expression analysis.
+Please save all your code for each part of the assignment. Your +instructor may request it, and it may also be used for awarding partial +credit if needed. We strongly recommend completing the assignment in an +RMarkdown document to ensure your work is well-documented and +reproducible.
+Submit your completed assignment via the designated homework (Quiz) +submission site on Canvas.
+Rsubread or the
+STAR aligners. Confirm whether or not you were able to
+successfully complete this step (on your honor).# Install the Bioconductor installer
+install.packages("BiocManager")
+
+# Install Rsubread
+BiocManager::install("Rsubread")
+
+# Index the genome:
+library(Rsubread)
+buildindex("genomefile", "genomefile.fa")
+
+# Align the reads:
+align("genomefile", "myfastqfile.fq",
+ output_file="myfastq.bam", nthreads = 4)
+featureCounts function in the
+Rsubread package to generate a counts file for this
+dataset. Upload your feature counts .txt file here.For the rest of the assignment, please use the feature counts +provided by Dr. Johnson in the lecture materials: {https://github.com/wevanjohnson/2026_Spring_BMDA/raw/refs/heads/main/lectures/week6/example_data/features_combined.txt}{features_combined.txt +(link)}
+SummarizedExperiment object for your
+counts. The colData for these data are provided in the
+homework2\_metadata.txt file. What function is used to
+create a SummarizedExperiment object?library(SummarizedExperiment)
+## Loading required package: MatrixGenerics
+## Loading required package: matrixStats
+##
+## Attaching package: 'MatrixGenerics'
+## The following objects are masked from 'package:matrixStats':
+##
+## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+## colWeightedMeans, colWeightedMedians, colWeightedSds,
+## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+## rowWeightedSds, rowWeightedVars
+## Loading required package: GenomicRanges
+## Loading required package: stats4
+## Loading required package: BiocGenerics
+## Loading required package: generics
+##
+## Attaching package: 'generics'
+## The following objects are masked from 'package:base':
+##
+## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
+## setequal, union
+##
+## Attaching package: 'BiocGenerics'
+## The following objects are masked from 'package:stats':
+##
+## IQR, mad, sd, var, xtabs
+## The following objects are masked from 'package:base':
+##
+## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
+## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
+## get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
+## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+## rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
+## unsplit, which.max, which.min
+## Loading required package: S4Vectors
+##
+## Attaching package: 'S4Vectors'
+## The following object is masked from 'package:utils':
+##
+## findMatches
+## The following objects are masked from 'package:base':
+##
+## expand.grid, I, unname
+## Loading required package: IRanges
+##
+## Attaching package: 'IRanges'
+## The following object is masked from 'package:grDevices':
+##
+## windows
+## Loading required package: Seqinfo
+## Loading required package: Biobase
+## Welcome to Bioconductor
+##
+## Vignettes contain introductory material; view with
+## 'browseVignettes()'. To cite Bioconductor, see
+## 'citation("Biobase")', and for packages 'citation("pkgname")'.
+##
+## Attaching package: 'Biobase'
+## The following object is masked from 'package:MatrixGenerics':
+##
+## rowMedians
+## The following objects are masked from 'package:matrixStats':
+##
+## anyMissing, rowMedians
+# Read in counts
+counts <- read.delim("features_combined.txt", row.names = 1, check.names = FALSE)
+
+# Read in metadata
+metadata <- read.delim("hivtb_SRR_metadata.txt")
+
+# Make sure SampleID matches the count matrix column order
+rownames(metadata) <- metadata$SampleID
+metadata <- metadata[colnames(counts), ]
+
+# Build the SummarizedExperiment
+se <- SummarizedExperiment(
+ assays = list(counts = as.matrix(counts)),
+ colData = metadata
+)
+
+se
+## class: SummarizedExperiment
+## dim: 25369 33
+## metadata(0):
+## assays(1): counts
+## rownames(25369): DDX11L1 WASH7P ... GOLGA2P3Y GOLGA2P2Y
+## rowData names(0):
+## colnames(33): R01_1 R01_10 ... R03_32 R03_33
+## colData names(3): sample_name SampleID disease_status
+TB-HIV-ART samples
+(should be two of them), removing any genes with zero expression for all
+samples, and by generating a log counts per million assay. How may genes
+were removed with the zero expression filter?# Remove TB-HIV-ART samples
+se <- se[, se$disease_status != "TB-HIV-ART"]
+dim(se) # should now show 31 samples
+## [1] 25369 31
+# Filter genes with zero expression across all samples
+keep <- rowSums(assay(se, "counts")) > 0
+n_removed <- sum(!keep)
+n_removed # this answers the question
+## [1] 2204
+se <- se[keep, ]
+dim(se) # should show fewer genes
+## [1] 23165 31
+# Generate log CPM assay
+library(edgeR)
+## Loading required package: limma
+##
+## Attaching package: 'limma'
+## The following object is masked from 'package:BiocGenerics':
+##
+## plotMA
+logcpm <- cpm(assay(se, "counts"), log = TRUE)
+assay(se, "logCPM") <- logcpm
+
+se
+## class: SummarizedExperiment
+## dim: 23165 31
+## metadata(0):
+## assays(2): counts logCPM
+## rownames(23165): DDX11L1 WASH7P ... RBMY2FP LOC100652931
+## rowData names(0):
+## colnames(31): R01_1 R01_10 ... R02_30 R02_31
+## colData names(3): sample_name SampleID disease_status
+SummarizedExperiment using ComBat-Seq. You can use the
+ComBat\_Seq function in the sva package, or
+simply do it in BatchQC and then extract the
+SummarizedExperiment. For this example, pretend that
+disease\_status is the batch variable. Note that this will
+remove the disease status variability, so don’t use
+this assay in the following analyses! This was merely a practice for
+cases where you have an actual batch variable. What command/syntax would
+work to apply ComBat-Seq?library(sva)
+## Loading required package: mgcv
+## Loading required package: nlme
+##
+## Attaching package: 'nlme'
+## The following object is masked from 'package:IRanges':
+##
+## collapse
+## This is mgcv 1.9-4. For overview type '?mgcv'.
+## Loading required package: genefilter
+##
+## Attaching package: 'genefilter'
+## The following objects are masked from 'package:MatrixGenerics':
+##
+## rowSds, rowVars
+## The following objects are masked from 'package:matrixStats':
+##
+## rowSds, rowVars
+## Loading required package: BiocParallel
+# ComBat-Seq practice (do NOT use this assay for downstream DE analysis)
+combat_counts <- ComBat_seq(
+ as.matrix(assay(se, "counts")),
+ batch = se$disease_status
+)
+## Found 2 batches
+## Using null model in ComBat-seq.
+## Adjusting for 0 covariate(s) or covariate level(s)
+## Estimating dispersions
+## Fitting the GLM model
+## Shrinkage off - using GLM estimates for parameters
+## Adjusting the data
+# Store it as a separate assay so it's clearly distinguished
+assay(se, "combat_counts") <- combat_counts
+
+se
+## class: SummarizedExperiment
+## dim: 23165 31
+## metadata(0):
+## assays(3): counts logCPM combat_counts
+## rownames(23165): DDX11L1 WASH7P ... RBMY2FP LOC100652931
+## rowData names(0):
+## colnames(31): R01_1 R01_10 ... R02_30 R02_31
+## colData names(3): sample_name SampleID disease_status
+DESeq2 to do a differential expression analysis
+(on the counts) comparing the TB-HIV to the HIV only samples. What is
+the twenty second gene on the differentially expressed gene list?limma on the log
+CPM values. How do the DESeq2 results compare to the
+limma results? What is the seventh gene on the
+differentially expressed gene list?DESeq2 or the
+limma results (top 50). Add a colorbar for disease status.
+Upload your plot here.Take a real biomedical dataset from raw data through interpretation, +using one (or more) of the methods we covered this semester. This is a +small but complete data analysis project. You will +present your analysis in a recorded video and provide constructive +feedback on your classmates’ projects.
+Choose one of the following analysis families that we covered in +class (or another one you clear with the instructor):
+Where to find data:
+Submit a .Rmd (or .qmd) that, when knit,
+walks the reader through your analysis end-to-end. At minimum:
sessionInfo(), links
+to data, all code included in the Rmd.Record a 5-7 minute video walking through your +analysis. Cover:
+Post the video in the Canvas Chat section by 11:59 PM, Friday +June 19, 2026.
+Watch at least five of your classmates’ videos. For +each:
+Document the videos you reviewed and your comments in the same
+.Rmd from part 2. Submit the updated .Rmd on
+Canvas by 11:59 PM, Friday June 26, 2026.
| Category | +Points | +
|---|---|
| Reproducible R Markdown report with end-to-end analysis | +60 | +
| Video presentation with clear walkthrough of question, methods, and +result | +40 | +
| Peer review: meaningful comments on five classmates’ videos | +50 | +
| Total | +150 | +
Happy analyzing!
+