LLM driven nuanced cell type anntotation #1176
Replies: 3 comments 7 replies
-
|
Hi @parashardhapola. I'm Jen, the Scientific Community Manager at the Data Lab. Thank you for interest in the OpenScPCA project! Our team is currently reviewing your proposed ideas, and we look forward to discussing more with you soon. We will get back to you here with any questions and/or next steps within three business days. Please let me know if you have any questions about OpenScPCA. We look forward to hearing more about you plans! |
Beta Was this translation helpful? Give feedback.
-
|
@parashardhapola In the meantime, here is the contributor form. Filling out this form ensures you have agreed to the OpenScPCA terms and conditions and other policies. Please submit this at your earliest convenience. Thank you! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @parashardhapola, thanks for submitting this and your interest in OpenScPCA! This is really interesting! First, I wanted to note that only employees of or researchers with an appointment at a non-profit are eligible for OpenScPCA-related grants. You are welcome to contribute either way, but I wanted to say that upfront just in case that is a barrier (complete eligibility information is available here: https://openscpca.readthedocs.io/en/latest/grant-opportunities/). Regarding the proposal itself, we had some initial, general questions for you and about the CyteType method:
Thanks again for your proposal! We’re looking forward to continuing the discussion. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
[EDITS: 2025-07-30: Added new screenshots from updated UI and added links for updated version]
Samples to be annotated
All the samples on scPCA. 😊
As an example, I have annotated all the 23 samples in the SCPCP000001 project (high grade gliomas).
Here are the annotation reports:
CyteType Report: SCPCS000001
CyteType Report: SCPCS000002
CyteType Report: SCPCS000003
CyteType Report: SCPCS000004
CyteType Report: SCPCS000005
CyteType Report: SCPCS000006
CyteType Report: SCPCS000007
CyteType Report: SCPCS000008
CyteType Report: SCPCS000009
CyteType Report: SCPCS000010
CyteType Report: SCPCS000011
CyteType Report: SCPCS000012
CyteType Report: SCPCS000013
CyteType Report: SCPCS000014
CyteType Report: SCPCS000015
CyteType Report: SCPCS000016
CyteType Report: SCPCS000017
CyteType Report: SCPCS000018
CyteType Report: SCPCS000019
CyteType Report: SCPCS000020
CyteType Report: SCPCS000021
CyteType Report: SCPCS000022
CyteType Report: SCPCS000023
Example Annotation card:
Input data
h5ad files (AnnData objects). They should have cluster information (in obs) and the gene symbols (in var).
Detailed protocol for cell type annotation
For annotation we will use the CyteType package that uses a multi-agent LLM framework for annotation. I have chosen Deepseek-R1 0528 as the underlying LLM, due to a good trade-off between cost, speed and detailed justification. However, other LLMs can also be tested.
The LLMs are provided with a context for each cluster by summarizing the metadata information connected with that particular cluster.
In terms of preprocessing, a marker gene search needs to be performed for the clusters present in the data. The top 75 marker genes are provided to CyteType.
Fully self-contained notebook is available here:
https://colab.research.google.com/drive/1BDUQwH1mIoX1cJQEtTt_4gY9Es90X-Yg?usp=sharing
Citations of scientific literature or publicly available analyses
https://github.com/NygenAnalytics/CyteType
CyteType is the only tool that provides detailed contextual information about the clusters and is able to adapt to the disease contexts.
Potential pitfalls
LLMs can hallucinate, so detailed checks across LLMs can be performed. CyteType mitigates this to a large extent through implementation of a reviewer agent that double checks the annotation and adds a confidence score to the annotation.
Required files
All information is pulled from H5ad files.
Other details
None beyond regular laptops. For large AnnData objects (more than 100K cells), subsampling can be performed as shown in the CyteType documentation.
Beta Was this translation helpful? Give feedback.
All reactions