Skip to content

End-to-end bioinformatics pipeline for SARS-CoV-2 transcriptomics analysis with LLM-powered interpretation

License

Notifications You must be signed in to change notification settings

Sumanth1410-git/human-transcriptomics-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Exploratory Analysis of Human Transcriptomics Data

SARS-CoV-2 Infection Response Study

Python 3.13 License: MIT Status: Complete

A complete end-to-end bioinformatics pipeline for analyzing differential gene expression, functional enrichment, and gene interaction networks in SARS-CoV-2 infected cells.


πŸ“‹ Table of Contents


πŸ”¬ Overview

This project presents a comprehensive exploratory analysis of human transcriptomic responses to SARS-CoV-2 infection using publicly available RNA-seq data (GSE147507). The pipeline integrates:

  • Differential Expression Analysis (DESeq2-style normalization, statistical testing)
  • Functional Annotation (Gene Ontology, KEGG pathway enrichment)
  • Network Analysis (Protein-protein interactions, hub gene identification)
  • LLM-Powered Interpretation (Plain-language biological summaries)

Dataset: GSE147507 from NCBI GEO
Samples: 20 (9 Mock controls, 8 SARS-CoV-2 infected, 3 drug-treated)
Platform: RNA-seq (Illumina NextSeq 500)
Cell Types: A549-ACE2, Calu-3 (human lung epithelial cells)


🎯 Key Findings

Differential Expression

  • 365 significantly altered genes (|log2FC| β‰₯ 1.5, FDR < 0.05)
  • 331 upregulated (antiviral & inflammatory response)
  • 34 downregulated (metabolic suppression)

Top upregulated genes:

  • IFNB1 (6.47 log2FC) - Type I Interferon.
  • TNF (5.56 log2FC) - Pro-inflammatory cytokine.
  • IL6 (4.46 log2FC) - Cytokine storm mediator.
  • CXCL2/3 (~5.2 log2FC) - Neutrophil chemotaxis.

Functional Enrichment

  • 205 enriched biological processes (defense response to virus, transcriptional regulation)
  • 67 enriched KEGG pathways (TNF signaling, NF-ΞΊB, interferon response)

Network Analysis

  • 80 hub genes in highly connected network (density: 0.597)
  • Top hubs: IRF1, FOSB, IER3, CXCL2, NFKBIZ (master regulators)
  • 1,886 gene-gene interactions (co-expression network)

Therapeutic Targets Identified

  1. IRF1 - Interferon regulatory factor (central hub)
  2. NFKBIZ - NF-ΞΊB pathway regulator
  3. TNF pathway - Anti-cytokine therapies (infliximab, adalimumab)
  4. IL-6 - Tocilizumab (already FDA-approved for COVID-19)

✨ Features

  • βœ… Reproducible pipeline (phase-gate workflow)
  • βœ… Publication-quality figures (12 high-resolution plots)
  • βœ… Statistical rigor (FDR correction, multiple testing)
  • βœ… Multi-tier LLM integration (Gemini, Groq, local fallback)
  • βœ… Evidence-grounded interpretations (no hallucinations)
  • βœ… Educational summaries (plain-language explanations)
  • βœ… Version controlled (Git with descriptive commits)

πŸ› οΈ Installation

Prerequisites

  • OS: Windows 10/11 (optimized for PowerShell)
  • Python: 3.13.2
  • RAM: 8GB+ recommended
  • Storage: 2GB for data and results

Setup

# Clone repository
git clone https://github.com/YOUR_USERNAME/human-transcriptomics-analysis.git
cd human-transcriptomics-analysis

# Create virtual environment
python -m venv transcriptomics_env
.\transcriptomics_env\Scripts\Activate.ps1

# Install dependencies
pip install -r requirements.txt --break-system-packages

# Configure API keys (optional for LLM interpretation)
# Create .env file:
GEMINI_API_KEY=your_key_here
GROQ_API_KEY=your_key_here

πŸš€ Usage

Quick Start

# Activate environment
.\transcriptomics_env\Scripts\Activate.ps1

# Run complete pipeline (sequential execution)
python scripts/01_inspect_data.py
python scripts/02_preprocess_normalize.py
python scripts/03_differential_expression.py
python scripts/04_functional_annotation.py
python scripts/05_network_analysis.py
python scripts/06_llm_interpretation.py

Alternative: Jupyter Notebook

# Launch Jupyter
jupyter notebook

# Open: notebooks/Complete_Analysis_Pipeline.ipynb

πŸ“ Project Structure

Transcriptomics_Project/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                          # Original count matrix
β”‚   β”‚   └── covid19_raw_counts.tsv
β”‚   β”œβ”€β”€ processed/                    # Normalized, filtered data
β”‚   β”‚   β”œβ”€β”€ counts_filtered_raw.csv
β”‚   β”‚   β”œβ”€β”€ counts_normalized.csv
β”‚   β”‚   └── counts_log2_transformed.csv
β”‚   └── metadata/
β”‚       β”œβ”€β”€ covid19_sample_metadata.txt
β”‚       └── metadata_covid_vs_mock.csv
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ 01_inspect_data.py           # QC & data validation
β”‚   β”œβ”€β”€ 02_preprocess_normalize.py   # DESeq2 normalization
β”‚   β”œβ”€β”€ 03_differential_expression.py # DEG analysis
β”‚   β”œβ”€β”€ 04_functional_annotation.py  # GO/KEGG enrichment
β”‚   β”œβ”€β”€ 05_network_analysis.py       # PPI networks
β”‚   └── 06_llm_interpretation.py     # LLM summaries
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ figures/                      # 12 publication plots
β”‚   β”‚   β”œβ”€β”€ 01_library_sizes.png
β”‚   β”‚   β”œβ”€β”€ 06_volcano_plot.png
β”‚   β”‚   └── 08_network_visualization.png
β”‚   β”œβ”€β”€ tables/                       # CSV result files
β”‚   β”‚   β”œβ”€β”€ deg_significant_only.csv
β”‚   β”‚   β”œβ”€β”€ go_enrichment_*.csv
β”‚   β”‚   └── network_hub_genes.csv
β”‚   β”œβ”€β”€ FINAL_REPORT.md              # Executive summary
β”‚   β”œβ”€β”€ LLM_BIOLOGICAL_INTERPRETATION.md
β”‚   └── EDUCATIONAL_SUMMARY.md
β”œβ”€β”€ notebooks/
β”‚   └── Complete_Analysis_Pipeline.ipynb
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
β”œβ”€β”€ .env                              # API keys (not committed)
└── README.md

πŸ“Š Results

Key Visualizations

Volcano Plot (Differential Expression)
Volcano Plot

Network Visualization (Hub Genes)
Network

PCA Analysis (Sample Clustering)
PCA

Output Files

File Description Records
deg_full_results.csv All tested genes 13,803
deg_significant_only.csv Significant DEGs 365
network_hub_genes.csv Hub genes with centrality 80
go_enrichment_*.csv Enriched GO terms/pathways 295

πŸ“– Documentation


🧬 Biological Interpretation

SARS-CoV-2 infection triggers a coordinated transcriptional program:

  1. Type I/III Interferon Response β†’ Antiviral defense (IFNB1, IFNL1-3)
  2. Pro-inflammatory Cytokines β†’ Immune recruitment (TNF, IL6, IL1A)
  3. Chemokine Secretion β†’ Neutrophil attraction (CXCL2, CCL20)
  4. Transcriptional Activation β†’ NF-ΞΊB/IRF1 pathways
  5. Metabolic Reprogramming β†’ Resource allocation to immunity

Clinical Relevance:

  • Cytokine storm pathways identified (TNF, IL-6)
  • Therapeutic targets validated (tocilizumab, JAK inhibitors)
  • Biomarker candidates for disease severity

πŸ”¬ Methods Summary

Step Method Tool/Library
Quality Control Library size filtering, gene filtering pandas, matplotlib
Normalization DESeq2 median-of-ratios scipy, numpy
DEG Analysis Welch's t-test + FDR correction scipy.stats, statsmodels
Enrichment Hypergeometric test Enrichr API
Network Gene co-expression (Pearson r β‰₯ 0.7) networkx
Interpretation LLM-powered summarization Google Gemini / Groq

πŸŽ“ Citation

If you use this pipeline or find these results useful, please cite:

@software{transcriptomics_pipeline_2026,
  author = {Your Name},
  title = {Exploratory Analysis of Human Transcriptomics Data: SARS-CoV-2 Response},
  year = {2026},
  url = {https://github.com/YOUR_USERNAME/human-transcriptomics-analysis}
}

Original Dataset:

  • Blanco-Melo D, et al. (2020). Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19. Cell. GSE147507.

πŸ“œ License

This project is licensed under the MIT License - see LICENSE file for details.

Note: The GSE147507 dataset is publicly available from NCBI GEO under their terms of use.


🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit changes (git commit -m "Add AmazingFeature")
  4. Push to branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“§ Contact

Project Maintainer: P Sumanth

Email: sumanthp141005@gmail.com

GitHub: @Sumanth1410-git


πŸ™ Acknowledgments

  • NCBI GEO for providing public transcriptomics data
  • Enrichr API (Ma'ayan Lab) for functional enrichment
  • Google Gemini & Groq for LLM interpretation
  • Open-source Python bioinformatics community

⭐ If you found this project useful, please consider giving it a star!


Last Updated: February 24, 2026
Status: βœ… Complete & Production-Ready

About

End-to-end bioinformatics pipeline for SARS-CoV-2 transcriptomics analysis with LLM-powered interpretation

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published