Some datasets have large numbers of features (> 30,000) which makes some processing difficult. Reducing this during dataset processing would help with scalability. Some options include:
- Selecting the top X HVGs (where X is large, say 20,000)
- Selecting the top X genes with highest mean expression
- Selecting the top X genes with lowest percentage zeros
- Doing a high-dimensional (say 500) PCA/SVD and using that as input
The goal would be to limit dimensionality enough that processing/methods/metrics are able to run but without removing much information from the dataset. Selection could be used just for preprocessing or applied to the dataset for consistency.
Some datasets have large numbers of features (> 30,000) which makes some processing difficult. Reducing this during dataset processing would help with scalability. Some options include:
The goal would be to limit dimensionality enough that processing/methods/metrics are able to run but without removing much information from the dataset. Selection could be used just for preprocessing or applied to the dataset for consistency.