This document provides a gap analysis comparing the SpatialTranscriptFormer project against the industry standard recommendations from sc-best-practices.org. It identifies current strengths and prioritised recommendations for future development.
These are areas where the project already follows industry best practices:
- Global Gene Vocabulary:
build_vocab.pyensures a consistent feature space across all samples, preventing feature mismatch during training and inference. - Spatial Context via Neighbourhoods: The use of KD-trees in
HEST_FeatureDatasetto incorporate spatial neighbours aligns with best practices for spatially-aware deep learning. - Histology-Gene Integration: The architecture (extracting features from histology to predict/interact with gene expression) follows the recommended multi-modal integration patterns.
- Coordinate Standardisation:
normalize_coordinates()prevents spatial scale bias between slides from different technologies (e.g., standard Visium vs. Visium HD). - Pathway-Aware Feature Selection: Prioritising MSigDB genes in the vocabulary builder ensures that biologically relevant signal is captured even when using limited gene sets.
- Statistical Loss Modelling: The implementation of
ZINBLoss(Zero-Inflated Negative Binomial) accounts for the overdispersion and sparsity inherent in transcriptomic count data.
The following items are recommended for future sprints to improve model robustness and biological accuracy.
Priority: High
Rationale: Currently, genes are selected based on total expression or pathway membership. However, the model's primary task is to learn spatial patterns. Selecting genes based on Spatially Variable Gene (SVG) metrics like Moran's I (available in Squidpy) would prioritise genes that have learned spatial coherence over those that are just highly expressed (like housekeeping genes).
Priority: Medium-High
Rationale: The current pipeline lacks a standardised library-size normalisation (e.g., CPM/CP10k) before the log1p transform. Consistent normalisation ensures that sequencing depth variation between spots does not bias the model's predictions.
Priority: Medium
Rationale: In addition to total counts, filtering for Highly Variable Genes (HVG) using dispersion metrics (as in sc.pp.highly_variable_genes) would ensure the model focuses on genes that carry biological variation between tissue states rather than static structural signal.
Priority: Medium
Rationale: Adding explicit QC thresholds (e.g., minimum UMI count, minimum detected genes, maximum mitochondrial fraction) to the dataset loading scripts would protect the model from training on low-quality "noise" spots.
Priority: Medium
Rationale: Aggregate metrics like MSE or PCC don't capture whether the spatial distribution of predictions is realistic. Adding a validation step that compares the Moran's I of predicted vs. ground-truth expression would provide a much stronger biological validation signal.
Priority: Low/QoL
Rationale: Explicitly documenting the "data contract" (which normalisation is applied when, and how genes were selected) in a dedicated PREPROCESSING.md or as a standard header in output folders.