A complete end-to-end bioinformatics pipeline for analyzing differential gene expression, functional enrichment, and gene interaction networks in SARS-CoV-2 infected cells.
This project presents a comprehensive exploratory analysis of human transcriptomic responses to SARS-CoV-2 infection using publicly available RNA-seq data (GSE147507). The pipeline integrates:
- Differential Expression Analysis (DESeq2-style normalization, statistical testing)
- Functional Annotation (Gene Ontology, KEGG pathway enrichment)
- Network Analysis (Protein-protein interactions, hub gene identification)
- LLM-Powered Interpretation (Plain-language biological summaries)
Dataset: GSE147507 from NCBI GEO
Samples: 20 (9 Mock controls, 8 SARS-CoV-2 infected, 3 drug-treated)
Platform: RNA-seq (Illumina NextSeq 500)
Cell Types: A549-ACE2, Calu-3 (human lung epithelial cells)
- 365 significantly altered genes (|log2FC| β₯ 1.5, FDR < 0.05)
- 331 upregulated (antiviral & inflammatory response)
- 34 downregulated (metabolic suppression)
Top upregulated genes:
- IFNB1 (6.47 log2FC) - Type I Interferon.
- TNF (5.56 log2FC) - Pro-inflammatory cytokine.
- IL6 (4.46 log2FC) - Cytokine storm mediator.
- CXCL2/3 (~5.2 log2FC) - Neutrophil chemotaxis.
- 205 enriched biological processes (defense response to virus, transcriptional regulation)
- 67 enriched KEGG pathways (TNF signaling, NF-ΞΊB, interferon response)
- 80 hub genes in highly connected network (density: 0.597)
- Top hubs: IRF1, FOSB, IER3, CXCL2, NFKBIZ (master regulators)
- 1,886 gene-gene interactions (co-expression network)
- IRF1 - Interferon regulatory factor (central hub)
- NFKBIZ - NF-ΞΊB pathway regulator
- TNF pathway - Anti-cytokine therapies (infliximab, adalimumab)
- IL-6 - Tocilizumab (already FDA-approved for COVID-19)
- β Reproducible pipeline (phase-gate workflow)
- β Publication-quality figures (12 high-resolution plots)
- β Statistical rigor (FDR correction, multiple testing)
- β Multi-tier LLM integration (Gemini, Groq, local fallback)
- β Evidence-grounded interpretations (no hallucinations)
- β Educational summaries (plain-language explanations)
- β Version controlled (Git with descriptive commits)
- OS: Windows 10/11 (optimized for PowerShell)
- Python: 3.13.2
- RAM: 8GB+ recommended
- Storage: 2GB for data and results
# Clone repository
git clone https://github.com/YOUR_USERNAME/human-transcriptomics-analysis.git
cd human-transcriptomics-analysis
# Create virtual environment
python -m venv transcriptomics_env
.\transcriptomics_env\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txt --break-system-packages
# Configure API keys (optional for LLM interpretation)
# Create .env file:
GEMINI_API_KEY=your_key_here
GROQ_API_KEY=your_key_here# Activate environment
.\transcriptomics_env\Scripts\Activate.ps1
# Run complete pipeline (sequential execution)
python scripts/01_inspect_data.py
python scripts/02_preprocess_normalize.py
python scripts/03_differential_expression.py
python scripts/04_functional_annotation.py
python scripts/05_network_analysis.py
python scripts/06_llm_interpretation.py# Launch Jupyter
jupyter notebook
# Open: notebooks/Complete_Analysis_Pipeline.ipynbTranscriptomics_Project/
βββ data/
β βββ raw/ # Original count matrix
β β βββ covid19_raw_counts.tsv
β βββ processed/ # Normalized, filtered data
β β βββ counts_filtered_raw.csv
β β βββ counts_normalized.csv
β β βββ counts_log2_transformed.csv
β βββ metadata/
β βββ covid19_sample_metadata.txt
β βββ metadata_covid_vs_mock.csv
βββ scripts/
β βββ 01_inspect_data.py # QC & data validation
β βββ 02_preprocess_normalize.py # DESeq2 normalization
β βββ 03_differential_expression.py # DEG analysis
β βββ 04_functional_annotation.py # GO/KEGG enrichment
β βββ 05_network_analysis.py # PPI networks
β βββ 06_llm_interpretation.py # LLM summaries
βββ results/
β βββ figures/ # 12 publication plots
β β βββ 01_library_sizes.png
β β βββ 06_volcano_plot.png
β β βββ 08_network_visualization.png
β βββ tables/ # CSV result files
β β βββ deg_significant_only.csv
β β βββ go_enrichment_*.csv
β β βββ network_hub_genes.csv
β βββ FINAL_REPORT.md # Executive summary
β βββ LLM_BIOLOGICAL_INTERPRETATION.md
β βββ EDUCATIONAL_SUMMARY.md
βββ notebooks/
β βββ Complete_Analysis_Pipeline.ipynb
βββ requirements.txt
βββ .gitignore
βββ .env # API keys (not committed)
βββ README.md
Volcano Plot (Differential Expression)

Network Visualization (Hub Genes)

PCA Analysis (Sample Clustering)

| File | Description | Records |
|---|---|---|
deg_full_results.csv |
All tested genes | 13,803 |
deg_significant_only.csv |
Significant DEGs | 365 |
network_hub_genes.csv |
Hub genes with centrality | 80 |
go_enrichment_*.csv |
Enriched GO terms/pathways | 295 |
- FINAL_REPORT.md - Executive summary with key findings
- METHODS_DOCUMENTATION.md - Detailed computational methods
- LLM_BIOLOGICAL_INTERPRETATION.md - Plain-language analysis
- EDUCATIONAL_SUMMARY.md - Student-friendly guide
SARS-CoV-2 infection triggers a coordinated transcriptional program:
- Type I/III Interferon Response β Antiviral defense (IFNB1, IFNL1-3)
- Pro-inflammatory Cytokines β Immune recruitment (TNF, IL6, IL1A)
- Chemokine Secretion β Neutrophil attraction (CXCL2, CCL20)
- Transcriptional Activation β NF-ΞΊB/IRF1 pathways
- Metabolic Reprogramming β Resource allocation to immunity
Clinical Relevance:
- Cytokine storm pathways identified (TNF, IL-6)
- Therapeutic targets validated (tocilizumab, JAK inhibitors)
- Biomarker candidates for disease severity
| Step | Method | Tool/Library |
|---|---|---|
| Quality Control | Library size filtering, gene filtering | pandas, matplotlib |
| Normalization | DESeq2 median-of-ratios | scipy, numpy |
| DEG Analysis | Welch's t-test + FDR correction | scipy.stats, statsmodels |
| Enrichment | Hypergeometric test | Enrichr API |
| Network | Gene co-expression (Pearson r β₯ 0.7) | networkx |
| Interpretation | LLM-powered summarization | Google Gemini / Groq |
If you use this pipeline or find these results useful, please cite:
@software{transcriptomics_pipeline_2026,
author = {Your Name},
title = {Exploratory Analysis of Human Transcriptomics Data: SARS-CoV-2 Response},
year = {2026},
url = {https://github.com/YOUR_USERNAME/human-transcriptomics-analysis}
}Original Dataset:
- Blanco-Melo D, et al. (2020). Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19. Cell. GSE147507.
This project is licensed under the MIT License - see LICENSE file for details.
Note: The GSE147507 dataset is publicly available from NCBI GEO under their terms of use.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m "Add AmazingFeature") - Push to branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Project Maintainer: P Sumanth
Email: sumanthp141005@gmail.com
GitHub: @Sumanth1410-git
- NCBI GEO for providing public transcriptomics data
- Enrichr API (Ma'ayan Lab) for functional enrichment
- Google Gemini & Groq for LLM interpretation
- Open-source Python bioinformatics community
β If you found this project useful, please consider giving it a star!
Last Updated: February 24, 2026
Status: β
Complete & Production-Ready