Mapping predicted gene expression into biological pathways is key to making the SpatialTranscriptFormer clinically interpretable. Instead of looking at 1,000 individual genes, clinicians can look at the activity of specific processes (e.g., "Wnt Signaling" or "EMT").
We recommend using the following curated databases for mapping:
- MSigDB Hallmark: 50 gene sets that summarize specific biological states or processes. This is the "gold standard" for general cancer research because it's non-redundant and well-defined.
- License: MSigDB Hallmark sets (v6.0–v7.5.1, v2022.1+) are subject to the CC BY 4.0 license.
- Copyright: © 2004–2025 Broad Institute, Inc., MIT, and Regents of the University of California.
- KEGG & Reactome: More detailed, hierarchical pathways that describe specific biochemical reactions.
- Gene Ontology (GO): Useful for finding genes associated with specific molecular functions or cellular components.
There are three ways to implement this in the current architecture:
After the model makes predictions (N spots x G genes), we run a statistical test (e.g., Gene Set Enrichment Analysis or a simple hypergeometric test) to see which pathways are "upregulated" in specific spatial regions.
- Tool:
gseapyor a custom mapping script. - Use Case: Generating a "Pathway Activation Map" from a trained model's output.
The SpatialTranscriptFormer interaction model inherently represents pathway activations as part of its attention mechanism and output process. Rather than a simple linear bottleneck, it utilizes learnable pathway tokens and Multi-Task Learning (MTL).
In this mode, the network receives direct supervision on its pathway tokens, guided by established biological databases (e.g., MSigDB):
-
Architecture Flow:
-
Interaction: Learnable pathway tokens
$P$ interact with Histology patch features$H$ via self-attention (e.g.,$p2h$ ,$h2p$ ). -
Activation: Pathway scores
$S \in \mathbb{R}^P$ are computed using a learnable temperature-scaled cosine similarity between the pathway tokens and image patch tokens. -
Gene Reconstruction: $\hat{y} = S \cdot \mathbf{W}{recon} + b$, where $\mathbf{W}{recon}$ is initialized using the binary pathway membership matrix
$M$ .
-
Interaction: Learnable pathway tokens
-
MTL Auxiliary Loss: To prevent standard bottleneck collapse, an explicit auxiliary loss bridges the spatial representations directly to biological data. The pathway scores
$S$ are supervised against a pathway ground truth using a Pearson Correlation Coefficient (PCC) loss.- To prevent highly expressed housekeeping genes dominating the signal, the raw spatial gene counts (
$Y_{genes}$ ) are first spatially Z-score normalized ($Z_{genes}$ ). - These are then projected onto the pathway matrix and mean-aggregated by member count (
$C$ ):$$L_{total} = L_{gene} + \lambda_{pathway} (1 - PCC(S, \frac{Z_{genes} \cdot M^T}{C}))$$
- To prevent highly expressed housekeeping genes dominating the signal, the raw spatial gene counts (
- Benefit: The model is forced to explicitly align its internal interaction tokens with concrete biological pathways, granting direct interpretability where every gene gets an equal vote.
The model supports any dataset within the HEST1k collection (e.g., Breast, Kidney, Lung, Colon). Instead of being bound to a single disease context, users can leverage the --custom-gmt flag to map genes to pathways relevant to their specific investigation.
Regardless of the tissue of origin (e.g., Kidney versus Breast), researchers often track core functional states within the tumor microenvironment. A user might define a .gmt file to explicitly monitor:
| Pathway Concept | Hallmarks / Relevant Genes | Interpretive Value across Tissues |
|---|---|---|
| Hypoxia & Angiogenesis | VEGFA, FLT1, HIF1A |
Identifies oxygen-deprived or highly vascularized tumor cores. |
| Immune Infiltration | CD8A, GZMB, IFNG |
Maps regions of active anti-tumor immune response. |
| Stromal / EMT | VIM, SNAI1, ZEB1 |
Highlights desmoplastic stroma and invasion fronts. |
| Proliferation | MKI67, PCNA, MYC |
Pinpoints highly active, dividing cell populations. |
By supplying these functional groupings via --custom-gmt, the model's MTL process explicitly aligns its spatial interaction tokens to monitor these exact states across any whole-slide image in the HEST1k dataset.
- MSigDB Hallmarks Initialization (
--pathway-initflag): Downloads the GMT file, matches genes againstglobal_genes.json, and initializesgene_reconstructor.weightwith the binary membership matrix. Seepathways.py.- 50 Hallmark pathways (default fixed fallback when using
--pathway-init). - GMT file cached in
.cache/after first download.
- 50 Hallmark pathways (default fixed fallback when using
- Custom Pathway Definitions (
--custom-gmtflag): Users can override the default Hallmarks by providing a URL or local path to a.gmtfile, enabling custom database integrations (e.g., KEGG, Reactome, or highly specific tissue masks).
# With biological initialization (50 MSigDB Hallmarks)
python -m spatial_transcript_former.train \
--model interaction --pathway-init ...- Spatial Pathway Maps: Visualize pathway activations as spatial heatmaps overlaid on histology using
stf-predict. See the README for inference instructions.
- Post-Hoc Enrichment:
gseapyintegration for pathway activation maps from model outputs without architectural bottlenecks. - End-to-End Risk Assessment Module: Developing a downstream prediction system that takes the spatially-resolved pathway activations and gene expressions derived from the model and maps them directly to clinical risk and survival outcomes.