Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ By the end of this course, students will:
A working knowledge and experience with R, GitHub, and Unix is required. Students are required to complete GSND 5345Q (Fundamentals of Data Science) prior to taking this course, or have equivalent experience. An introductory course in statistics or biostatistics is also recommended (e.g., GSND 5135Q or MSBS N5030), or comparable experience in statistical analysis. Students without sufficient background in coding or statistics are expected to acquire these skills before the course begins. Please contact Dr. Johnson for a list of required proficiencies and available asynchronous resources to help build the necessary programming and statistical foundations.

## COURSE FORMAT:
This class will be taught virtually using a synchronous remote modality, although students will be provided a classroom to gather for each lecture. A co-instructor will be present in the classroom for each lecture. Class will occur Mondays and Wednesdays from 12:00pm-1:50pm.. Courses may also be recorded and made available for students who need to miss classes due to personal reasons, illness, or research related needs.
This class will be taught virtually using a synchronous remote modality, although students will be provided a classroom to gather for each lecture. A co-instructor will be present in the classroom for each lecture. Class will occur Mondays and Wednesdays from 12:00pm-1:50pm. Courses may also be recorded and made available for students who need to miss classes due to personal reasons, illness, or research related needs.

## ZOOM LINK AND CLASSROOM:
Zoom Meeting ID for all sessions is Meeting ID: 929 7046 3934, with the password: 850292, or use the following direct link (the link is also available though the course GitHub page): https://rutgers.zoom.us/j/92970463934?pwd=wVF7nGblCNaAMon3fFlRwboSgEiDUg.1
Expand Down Expand Up @@ -69,12 +69,12 @@ This is an 8-week, 2.0 credit class near the end of Spring 2024. In general, you
I **strongly** encourage you to contact early me if you have difficulty with the material. This course builds on material from prior lectures, so do not fall behind! My job is to help you understand the material as well as possible, and I am flexible with meeting times.

### ACADEMIC INTEGRITY:
You are expected to have read and follow the guidelines at the university’s academic integrity website (http://academicintegrity.rutgers.edu ). These principles forbid plagiarism and require that every Rutgers University student to:
You are expected to have read and follow the guidelines at the university’s academic integrity website (http://academicintegrity.rutgers.edu). These principles forbid plagiarism and require that every Rutgers University student to:

* Properly acknowledge and cite all use of the ideas, results, or words of others
* Properly acknowledge all contributors to a given piece of work
* Make sure that all work submitted as his or her own in a course or other academic activity isproduced without the aid of unsanctioned materials or unsanctioned collaboration
* Treat all other students in an ethical manner, respecting their integrity and right to pursue their educational goals without interference. This requires that a student neither facilitate academic dishonesty by others nor obstruct their academic progress (reproduced from: ttp://academicintegrity.rutgers.edu/academic-integrity-at-rutgers/ ).
* Treat all other students in an ethical manner, respecting their integrity and right to pursue their educational goals without interference. This requires that a student neither facilitate academic dishonesty by others nor obstruct their academic progress (reproduced from: http://academicintegrity.rutgers.edu/academic-integrity-at-rutgers/).

Violations of academic integrity will be treated in accordance with university policy, and sanctions for violations may range from no credit for the assignment, to a failing course grade to (for the most severe violations) dismissal from the university.

Expand Down
203 changes: 203 additions & 0 deletions homework/HW3BMDA.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
---
title: "HW 3 BMDA"
author: "Nat Lim"
date: "2026-06-13"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

In Homework 1 you downloaded and QC'd the 33 FASTQ files for the HIV/TB dataset from SRA. In this assignment you will carry those reads through a complete RNA-seq differential expression analysis.

Please save all your code for each part of the assignment. Your instructor may request it, and it may also be used for awarding partial credit if needed. We strongly recommend completing the assignment in an RMarkdown document to ensure your work is well-documented and reproducible.

Submit your completed assignment via the designated homework (Quiz) submission site on Canvas.

## RNA-sequencing analysis

##### 1. Align the HIV/TB reads (downloaded in Homework 1) to the human genome reference using your choice of the `Rsubread` or the `STAR` aligners. Confirm whether or not you were able to successfully complete this step (on your honor).

1. Yes
2. No

```{r, eval=FALSE}
# Install the Bioconductor installer
install.packages("BiocManager")

# Install Rsubread
BiocManager::install("Rsubread")

# Index the genome:
library(Rsubread)
buildindex("genomefile", "genomefile.fa")

# Align the reads:
align("genomefile", "myfastqfile.fq",
output_file="myfastq.bam", nthreads = 4)

```


##### 2. Use the `featureCounts` function in the `Rsubread` package to generate a counts file for this dataset. Upload your feature counts .txt file here.



For the rest of the assignment, please use the feature counts provided by Dr. Johnson in the lecture materials: {https://github.com/wevanjohnson/2026_Spring_BMDA/raw/refs/heads/main/lectures/week6/example_data/features_combined.txt}{features_combined.txt (link)}



##### 3. Generate a `SummarizedExperiment` object for your counts. The `colData` for these data are provided in the `homework2\_metadata.txt` file. What function is used to create a SummarizedExperiment object?

1. SummarizedExperiment(assays = list(counts = as.matrix(counts)), colData = metadata)
2. makeSummarizedExp(as.matrix(counts), metadata)
3. seObject(counts, metadata)
4. summarizeData(data = counts, metadata = metadata)
5. buildExperiment(assay = as.matrix(counts), metadata = metadata)

```{r}
library(SummarizedExperiment)

# Read in counts
counts <- read.delim("features_combined.txt", row.names = 1, check.names = FALSE)

# Read in metadata
metadata <- read.delim("hivtb_SRR_metadata.txt")

# Make sure SampleID matches the count matrix column order
rownames(metadata) <- metadata$SampleID
metadata <- metadata[colnames(counts), ]

# Build the SummarizedExperiment
se <- SummarizedExperiment(
assays = list(counts = as.matrix(counts)),
colData = metadata
)

se
```


##### 4. Preprocess these data by removing `TB-HIV-ART` samples (should be two of them), removing any genes with zero expression for all samples, and by generating a log counts per million assay. How may genes were removed with the zero expression filter?

1. 48
2. 2,204
3. 9,700
4. 17,385

```{r}
# Remove TB-HIV-ART samples
se <- se[, se$disease_status != "TB-HIV-ART"]
dim(se) # should now show 31 samples

# Filter genes with zero expression across all samples
keep <- rowSums(assay(se, "counts")) > 0
n_removed <- sum(!keep)
n_removed # this answers the question

se <- se[keep, ]
dim(se) # should show fewer genes

# Generate log CPM assay
library(edgeR)
logcpm <- cpm(assay(se, "counts"), log = TRUE)
assay(se, "logCPM") <- logcpm

se
```


##### 5. Create a batch corrected assay in your `SummarizedExperiment` using ComBat-Seq. You can use the `ComBat\_Seq` function in the `sva` package, or simply do it in `BatchQC` and then extract the `SummarizedExperiment`. For this example, pretend that `disease\_status` is the batch variable. Note that this will remove the disease status variability, so **don't** use this assay in the following analyses! This was merely a practice for cases where you have an actual batch variable. What command/syntax would work to apply ComBat-Seq?

1. ComBat_seq(as.matrix(assay(se, "counts")), batch = se\$disease_status)
2. ComBat_seq(counts = assay(se), batch = disease_status)
3. BatchFix(se, "counts", batch = "disease_status")
4. combatCorrect(se\$counts, batch = se\$disease_status)
5. ComBat(se, "counts", batch = "disease_status")

```{r}
library(sva)

# ComBat-Seq practice (do NOT use this assay for downstream DE analysis)
combat_counts <- ComBat_seq(
as.matrix(assay(se, "counts")),
batch = se$disease_status
)

# Store it as a separate assay so it's clearly distinguished
assay(se, "combat_counts") <- combat_counts

se
```



##### 6. Apply PCA and UMAP to your data and generate dimension reduction plots for the results. Color the TB-HIV to the HIV only samples in different colors. Note that you should be using the log CPM values for this analysis. Please attach your UMAP plot here.

```{r}
library(umap)

logcpm_mat <- t(assay(se, "logCPM")) # samples as rows for PCA/UMAP

# PCA
pca <- prcomp(logcpm_mat, scale. = TRUE)
pca_df <- data.frame(
PC1 = pca$x[,1],
PC2 = pca$x[,2],
disease_status = se$disease_status
)

library(ggplot2)
ggplot(pca_df, aes(PC1, PC2, color = disease_status)) +
geom_point(size = 3) +
theme_minimal() +
labs(title = "PCA of log CPM values")

# UMAP
set.seed(42)
umap_res <- umap(logcpm_mat)
umap_df <- data.frame(
UMAP1 = umap_res$layout[,1],
UMAP2 = umap_res$layout[,2],
disease_status = se$disease_status
)

ggplot(umap_df, aes(UMAP1, UMAP2, color = disease_status)) +
geom_point(size = 3) +
theme_minimal() +
labs(title = "UMAP of log CPM values")
```


##### 7. Use `DESeq2` to do a differential expression analysis (on the counts) comparing the TB-HIV to the HIV only samples. What is the twenty second gene on the differentially expressed gene list?

1. IGF2BP3
2. AIM2
3. IL1R2
4. RAB20
5. FCGR1C


##### 8. Now conduct the same analysis using `limma` on the log CPM values. How do the `DESeq2` results compare to the `limma` results? What is the seventh gene on the differentially expressed gene list?

1. IGF2BP3
2. AIM2
3. IL1R2
4. RAB20
5. FCGR1C


##### 9. Give a heatmap plot of either the `DESeq2` or the `limma` results (top 50). Add a colorbar for disease status. Upload your plot here.

##### 10. Conduct a pathway analysis of the top 50 Limma genes using enrichR (through R or online). Use the "Reactome\_Pathways\_2024" database. What is the top scoring pathway?

1. "Interferon Gamma Signaling"
2. "Fatty Acid Metabolism"
3. "Cytokine Signaling in Immune System"
4. "RAB Geranylgeranylation"


##### 11. Conduct a TBSignatureProfiler analysis on these data -- including signature heatmaps, individual boxplots, and AUC boxplots. Interpret your findings. Upload one of your TBSignatureProfiler plots here.


724 changes: 724 additions & 0 deletions homework/HW3BMDA.html

Large diffs are not rendered by default.

80 changes: 80 additions & 0 deletions homework/NL BMDA Project.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: "NL BMDA Final Project"
author: "Nat Lim"
date: "2026-06-13"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Nat Lim BMDA Project

## **Objective**
Take a real biomedical dataset from raw data through interpretation, using one (or more) of the methods we covered this semester. This is a small but **complete data analysis project.** You will present your analysis in a recorded video and provide constructive feedback on your classmates' projects.

## **Instructions**

### **1. Pick a Dataset and an Analysis Type**
Choose one of the following analysis families that we covered in class (or another one you clear with the instructor):

- **GWAS / variant calling** -- alignment, variant calling, association or family/segregation analysis, interpretation of disease-relevant variants.
- **Bulk RNA-seq differential expression** -- counts, normalization, DESeq2 / edgeR / limma, pathway analysis.
- **Microbiome / metagenomics** -- 16S or shotgun reads, taxonomic profiling, alpha / beta diversity, differential abundance.
- **Single-cell RNA-seq** -- QC, normalization, clustering, marker/DE.
- **Another -omics method** (proteomics, methylation, ATAC-seq, etc.) -- get instructor approval first.

Where to find data:

- A dataset provided in lecture (asthma microbiome, HIV/TB RNA-seq, Ogden's Syndrome variants, TBnanostring, ...). You may **not** repeat the exact analysis we did in class -- pick a new biological question, a new comparison, or a new method.
- A public repository: SRA, GEO, dbGaP, TCGA, GTEx, ENCODE, Qiita, ImmPort, etc.
- A dataset from your own research, with PI permission.

### **2. R Markdown Report (60 points)**
Submit a `.Rmd` (or `.qmd`) that, when knit, walks the reader through your analysis end-to-end. At minimum:

- **Background and question** -- one short paragraph describing the dataset, the biological/clinical question, and why it matters.
- **Data acquisition** -- how you got the data (download script, SRA accessions, GEO ID, dataset DOI). Include the code or SLURM script.
- **Preprocessing / QC** -- normalization, filtering, batch correction, etc. as appropriate.
- **Statistical analysis** -- the core method (DESeq2, limma, association test, alpha/beta diversity, clustering, ...) with a brief justification of *why* this method fits your question.
- **Visualization** -- at least one publication-quality figure that supports your main finding.
- **Interpretation** -- what does your top result mean biologically? Cite at least one supporting paper.
- **Reproducibility** -- `sessionInfo()`, links to data, all code included in the Rmd.

### **3. Video Presentation (40 points)**
Record a **5-7 minute video** walking through your analysis. Cover:

- The dataset and your question.
- The pipeline (data $\to$ QC $\to$ analysis $\to$ result).
- Your main finding and one figure that shows it.
- Limitations and reasonable next steps.

Post the video in the Canvas Chat section by **11:59 PM, Friday June 19, 2026**.

### **4. Peer Review (50 points)**
Watch at least **five** of your classmates' videos. For each:

- Name one specific strength of the analysis.
- Suggest one improvement -- an alternative method, an additional QC step, a confounder to check, a follow-up comparison.

Document the videos you reviewed and your comments in the same `.Rmd` from part 2. Submit the updated `.Rmd` on Canvas by **11:59 PM, Friday June 26, 2026**.

## **Grading Breakdown**

| **Category** | **Points** |
|----|----|
| Reproducible R Markdown report with end-to-end analysis | 60 |
| Video presentation with clear walkthrough of question, methods, and result | 40 |
| Peer review: meaningful comments on five classmates' videos | 50 |
| **Total** | **150** |

## **Additional Notes**

- This is a **data analysis** project, not just a plotting project. Your report must include a statistical or model-based step -- not only descriptive plots.
- You may reuse code from lectures or earlier homework, but the question and dataset must be your own, or used in a different way than we did in class.
- Be respectful and constructive in peer reviews -- the goal is to learn from each other.

---

*Happy analyzing!*
549 changes: 549 additions & 0 deletions homework/NL-BMDA-Project.html

Large diffs are not rendered by default.

Loading