Skip to content

sarehsoltani/scGPT-based-Single-Cell-RNA-seq-Immune-Cell-Annotation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Immune Cell Type Annotation from scRNA-seq using scGPT

We applied the scGPT transformer-based model to single-cell RNA-seq data to classify and annotate immune cell types from blood samples. This end-to-end pipeline processed ~20,000 cells and leveraged scGPT embeddings for dimensionality reduction, clustering, and immune subtype classification.

We identified ~10 immune subpopulations, validated using canonical marker genes and highly variable gene (HVG) selection. Our results demonstrate the power of large-scale generative models for high-resolution immune profiling, showcasing how transformer-based deep learning can effectively analyze complex, high-dimensional biological data.


📊 1. Data Preprocessing

  • Loaded raw single-cell RNA-seq count matrix.
  • Performed quality control: filtered low-quality cells and genes.
  • Normalized and log-transformed gene expression values.
  • Selected highly variable genes (HVGs) for downstream analysis.

🤖 2. Embedding Extraction with scGPT

  • Used the pre-trained whole_human checkpoint from scGPT.
  • Extracted cell embeddings using the embed_data function for all ~20K cells.

🧬 3. Clustering and Visualization

  • Applied UMAP for 2D visualization of the scGPT-generated cell embeddings.
  • Performed Leiden clustering to detect distinct cell populations.

UMAP of scGPT embeddings with Leiden clusters

Image

🧾 4. Cell Type Annotation

  • Used canonical marker genes to assign immune cell identities.
  • Grouped cell types into three major immune categories:
    • Lymphocyte
    • Myeloid
    • Platelet
  • Visualized both detailed annotations and grouped categories on UMAP plots.

Annotated UMAP with immune subtypes and categories

Image ---

✅ Results

  • UMAP visualization revealed clear separation of major immune cell types.
  • Leiden clustering detected distinct subpopulations with high biological relevance.
  • Marker gene-based annotation identified the following cell types:
    • CD4+ T cells
    • CD8+ T cells
    • Regulatory T cells (Tregs)
    • B cells
    • Plasma cells
    • Natural Killer (NK) cells
    • Monocytes
    • Dendritic cells
    • Macrophages
    • Neutrophils
    • Platelets
    • Erythrocytes
  • Main immune categories (Lymphocyte, Myeloid, Platelet) were visualized and quantified.

✅ Strengths

  • Transformer-powered: scGPT models complex, nonlinear, and sparse single-cell data.
  • Pre-trained on large-scale data: The whole_human checkpoint enables strong generalization and transfer learning.
  • Gene-gene and cell-cell interactions: Captures long-range biological dependencies.
  • Superior embeddings: Clearer separation of cell types and subtypes.
  • Highly flexible: Adaptable to multi-modal data, metadata integration, and custom tasks.
  • Scalable and efficient: Optimized for large-scale datasets.
  • Effective for rare cell types: Better annotation in underrepresented populations.
  • Minimal fine-tuning required: Transfer learning enables use on new datasets with limited labeled data.

⚠️ Limitations

  • Requires careful preprocessing and thoughtful marker selection.
  • Performance depends on pre-trained model quality and comprehensive marker gene lists.
  • Computationally intensive, particularly when training from scratch.
  • Transformer models are less interpretable than simpler alternatives (e.g., PCA, scVI).

🔗 Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors