Skip to content

Commit c6ba5e0

Browse files
feat: integrate auxiliary pathway loss and sparsity regularization (#4)
Architecture & Refactoring: Remove the deprecated Bowel Cancer download script. Updated Preset Model Configurations. Documentation & Compliance: Formally document auxiliary loss and training objectives in the IP Statement. Expand model documentation and training guides to reflect bottleneck transitions and latent discovery capabilities. Add Future Directions and Clinical Collaborations to the project README. Resolve markdownlint error in LICENSE by adding a top-level H1 heading. Updates to the contribution guidelines.
1 parent 2c145b0 commit c6ba5e0

17 files changed

Lines changed: 166 additions & 249 deletions

CONTRIBUTING.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -35,18 +35,23 @@ We use `black` for formatting and `flake8` for linting. Please ensure your code
3535

3636
```bash
3737
black .
38-
flake8 src/
3938
```
4039

41-
### 3. Testing
40+
### 4. AI-Assisted Development
4241

43-
All new features must include unit tests in the `tests/` directory. We use `pytest` for our test suite.
42+
We welcome contributions developed with the assistance of AI tools (e.g., Copilot, ChatGPT, Claude, or agentic frameworks). However, to ensure the long-term maintainability and integrity of the project:
4443

45-
```bash
46-
# Run all tests
47-
.\test.ps1 # Windows
48-
bash test.sh # Linux
49-
```
44+
- **Ownership**: You are ultimately responsible for the code you submit. Do not commit code you do not fully understand.
45+
- **Explainability**: During the review process, you must be able to explain the logic, design decisions, and any subtle side effects of the AI-suggested changes.
46+
- **Verification**: AI-generated code must strictly follow our coding standards, naming conventions, and architectural patterns. It must be accompanied by robust tests (see our [Testing Guide](docs/TESTING.md)).
47+
48+
### 3. Testing & Quality Assurance
49+
50+
All new features must be accompanied by relevant tests in the `tests/` directory natively using `pytest`.
51+
52+
We highly encourage rigorous testing approaches such as **Mutation Testing** (via `cosmic-ray`) for critical model components to prevent surviving mutants.
53+
54+
For full details on our testing requirements, how to run the test suites locally, and our guidelines on mutation testing, please read the [Testing Guide](docs/TESTING.md).
5055

5156
## Pull Request Process
5257

Download-BowelCancer.ps1

Lines changed: 0 additions & 105 deletions
This file was deleted.

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
PROPRIETARY SOURCE CODE LICENSE (NON-COMMERCIAL + NEGOTIATED COMMERCIAL)
1+
# PROPRIETARY SOURCE CODE LICENSE (NON-COMMERCIAL + NEGOTIATED COMMERCIAL)
22

33
Copyright (c) 2026 Benjamin Isaac Wilson. All rights reserved.
44

README.md

Lines changed: 17 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -36,40 +36,23 @@ This project requires [Conda](https://docs.conda.io/en/latest/).
3636

3737
## Usage
3838

39-
**Before running any commands**, you must activate the conda environment:
39+
### Dataset Access
4040

41-
```bash
42-
conda activate SpatialTranscriptFormer
43-
```
44-
45-
### Download HEST Data
46-
47-
> [!CAUTION]
48-
> **Authentication Required**: The HEST dataset is gated. You must accept the terms of use at [MahmoodLab/hest](https://huggingface.co/datasets/MahmoodLab/hest) and authenticate with your Hugging Face account to download the data.
49-
50-
Please provide your token using ONE of the following methods before running the download tool:
51-
52-
1. **Persistent Login**: Run `huggingface-cli login` and paste your access token when prompted.
53-
2. **Environment Variable**: Set the `HF_TOKEN` environment variable in your active terminal session.
54-
55-
Once authenticated, download specific subsets using filters or the entire dataset:
41+
The model uses the **HEST1k** dataset. You can download specific subsets (by organ, technology, etc.) or the entire dataset using the `stf-download` utility:
5642

5743
```bash
58-
# Option 1: Download the ENTIRE HEST dataset (requires confirmation)
59-
stf-download --local_dir hest_data
44+
# List available filtering options
45+
stf-download --list-options
6046

61-
# Option 2: Download a specific subset (e.g., Bowel Cancer)
62-
stf-download --organ Bowel --disease Cancer --local_dir hest_data
47+
# Download a specific subset (e.g., Breast Cancer samples from Visium)
48+
stf-download --organ Breast --disease Cancer --tech Visium --local_dir hest_data
6349

64-
# Option 3: Filter by technology (e.g., Visium)
65-
stf-download --tech Visium --local_dir hest_data
50+
# Download all human samples
51+
stf-download --species "Homo sapiens" --local_dir hest_data
6652
```
6753

68-
To see all available organs in the metadata:
69-
70-
```bash
71-
stf-download --list_organs
72-
```
54+
> [!NOTE]
55+
> The HEST dataset is gated on Hugging Face. Ensure you have accepted the terms at [MahmoodLab/hest](https://huggingface.co/datasets/MahmoodLab/hest) and are logged in via `huggingface-cli login`.
7356
7457
### Train Models
7558

@@ -122,6 +105,13 @@ Visualization plots will be saved to the `./results` directory.
122105
.\test.ps1
123106
```
124107

108+
## Future Directions & Clinical Collaborations
109+
110+
A major future direction for **SpatialTranscriptFormer** is to integrate this architecture into an **end-to-end pipeline for patient risk assessment** and prognosis tracking. By leveraging the model's predicted expression and pathway activations, we aim to build a downstream risk prediction module that allows users to directly evaluate how spatially-resolved expression relates to patient survival.
111+
112+
> [!NOTE]
113+
> **Call for Collaborators:** Rigorous risk assessment models require vast datasets of clinical metadata and survival outcomes, which we currently lack access to. We are open to investigating *any* disease of interest! If you have access to large clinical cohorts and are interested in exploring how spatial pathway activation correlates with patient prognosis, we would love to partner with you.
114+
125115
## Contributing
126116

127117
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on our coding standards and the process for submitting pull requests. Note that this project is under a proprietary license; contributions involve an assignment of rights for non-academic use.

docs/IP_STATEMENT.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,18 @@ The primary innovation is the **multimodal bottleneck transformer** designed for
1414
- **Quadrant-Based Interaction Masking**: The logic used to zero out specific attention quadrants (e.g., $A_{H \to H}$) to optimize memory while maintaining multimodal context.
1515
- **Biologically-Informed Reconstruction Bottleneck**: The specific matrix decomposition approach where gene expression is reconstructed from a linear combination of pathway activations.
1616

17+
### Proposed Auxiliary Pathway Loss
18+
19+
To prevent bottleneck collapse and provide a direct gradient signal to the pathway tokens, we use the `AuxiliaryPathwayLoss`. This loss compares the model's internal pathway scores against "ground truth" pathway activations computed from the gene expression targets via MSigDB membership.
20+
21+
The total objective becomes:
22+
$$\mathcal{L} = \mathcal{L}_{gene} + \lambda_{aux} (1 - \text{PCC}(\text{pathway\_scores}, \text{target\_pathways}))$$
23+
24+
The `--log-transform` flag applies `log1p` to targets, mitigating the heavy-tailed gene expression distribution where housekeeping genes dominate MSE.
25+
26+
The full training objective with pathway sparsity regularisation:
27+
$$\mathcal{L} = \mathcal{L}_{task} + \lambda \|W_{recon}\|_1$$
28+
1729
## 2. Spatial Context Methodologies
1830

1931
- **Euclidean-Gated Attention**: The implementation of spatial distance-based masking ($M_{spatial}$) to constrain model focus to local morphological regions.

docs/MODELS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ Together, these ensure the model learns *spatially-varying* pathway activation m
122122

123123
#### Frozen Backbone (Feature Extraction)
124124

125-
Pre-computed features from a pathology foundation model. The backbone is never fine-tuned.
125+
Pre-computed features from a pathology foundation model. (The backbone is never fine-tuned, though this might change!)
126126

127127
| Backbone | Feature Dim | Source |
128128
| :--- | :--- | :--- |
@@ -214,7 +214,7 @@ The Zero-Inflated Negative Binomial (ZINB) loss is designed for raw, highly disp
214214

215215
The model outputs these parameters, and the loss computes the negative log-likelihood of the ground truth counts given this distribution.
216216

217-
### Auxiliary Pathway Loss
217+
### Proposed Auxiliary Pathway Loss
218218

219219
To prevent bottleneck collapse and provide a direct gradient signal to the pathway tokens, we use the `AuxiliaryPathwayLoss`. This loss compares the model's internal pathway scores against "ground truth" pathway activations computed from the gene expression targets via MSigDB membership.
220220

docs/PATHWAY_MAPPING.md

Lines changed: 34 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -21,49 +21,55 @@ After the model makes predictions (N spots x G genes), we run a statistical test
2121
- **Tool**: `gseapy` or a custom mapping script.
2222
- **Use Case**: Generating a "Pathway Activation Map" from a trained model's output.
2323

24-
### B. Pathway Bottleneck (Model Architecture)
24+
### B. Interaction Model via Multi-Task Learning (MTL)
2525

26-
The **SpatialTranscriptFormer** replaces the standard linear output head with a two-step projection that can be configured in two modes:
26+
The **SpatialTranscriptFormer** interaction model inherently represents pathway activations as part of its attention mechanism and output process. Rather than a simple linear bottleneck, it utilizes learnable pathway tokens and Multi-Task Learning (MTL).
2727

28-
#### 1. Informed Projection (Prior Knowledge)
28+
#### 1. Informed Supervision via Auxiliary Loss
2929

30-
In this mode, the **Gene Reconstruction Matrix** $\mathbf{W}_{recon}$ is guided by established biological databases (MSigDB, KEGG).
30+
In this mode, the network receives direct supervision on its pathway tokens, guided by established biological databases (e.g., MSigDB):
3131

32-
- **Implementation**: $\mathbf{W}_{recon}$ is initialized as a binary mask $M \in \{0, 1\}^{G \times P}$ where $M_{gk} = 1$ if gene $g$ belongs to pathway $k$.
33-
- **Benefit**: Predictions are guaranteed to be linear combinations of known biological processes, making them instantly interpretable by clinicians.
32+
- **Architecture Flow**:
33+
1. **Interaction**: Learnable pathway tokens $P$ interact with Histology patch features $H$ via self-attention (e.g., $p2h$, $h2p$).
34+
2. **Activation**: Pathway scores $S \in \mathbb{R}^P$ are computed using a learnable temperature-scaled cosine similarity between the pathway tokens and image patch tokens.
35+
3. **Gene Reconstruction**: $\hat{y} = S \cdot \mathbf{W}_{recon} + b$, where $\mathbf{W}_{recon}$ is initialized using the binary pathway membership matrix $M$.
36+
- **MTL Auxiliary Loss**: To prevent standard bottleneck collapse, an explicit auxiliary loss bridges the spatial representations directly to biological data. The pathway scores $S$ are supervised against a pathway ground truth ($Y_{genes} \cdot M^T$) using a Pearson Correlation Coefficient (PCC) loss.
37+
$$L_{total} = L_{gene} + \lambda_{pathway} (1 - PCC(S, Y_{genes} \cdot M^T))$$
38+
- **Benefit**: The model is forced to explicitly align its internal interaction tokens with concrete biological pathways, granting direct interpretability.
3439

35-
#### 2. Data-Driven Projection (Latent Discovery)
40+
#### 2. Data-Driven Discovery (Latent Projection)
3641

37-
In this mode, the model learns its own "latent pathways" based on morphological patterns.
42+
In the absence of a biological prior, the model can learn its own "latent pathways".
3843

39-
- **Implementation**: $\mathbf{W}_{recon}$ is randomly initialized and learned via backpropagation.
40-
- **Sparsity Constraint**: We apply an L1 penalty to force the model to identify "canonical" gene sets: $L_{total} = L_{MSE} + \lambda \|\mathbf{W}_{recon}\|_1$.
44+
- **Implementation**: $\mathbf{W}_{recon}$ is randomly initialized and the auxiliary pathway loss is disabled.
45+
- **Sparsity Constraint**: We apply an L1 penalty to force the model to identify "canonical" sparse gene sets: $L_{total} = L_{gene} + \lambda_{sparsity} \|\mathbf{W}_{recon}\|_1$.
4146
- **Benefit**: Can discover novel spatial-transcriptomic relationships that aren't yet captured in curated databases.
4247

43-
- **Architecture Flow**:
44-
1. **Interaction**: Pathway tokens $P$ query the Histology $H$.
45-
2. **Activation**: A linear layer reduces $P_{tokens}$ to activation scores $S \in \mathbb{R}^P$.
46-
3. **Reconstruction**: $\hat{y} = S \cdot \mathbf{W}_{recon} + b$.
48+
## 3. Generalizing to HEST1k Tissues
49+
50+
The model supports any dataset within the HEST1k collection (e.g., Breast, Kidney, Lung, Colon). Instead of being bound to a single disease context, users can leverage the `--custom-gmt` flag to map genes to pathways relevant to their specific investigation.
4751

48-
## 3. Clinical Application in Bowel Cancer
52+
### Example: Profiling the Tumor Microenvironment
4953

50-
For colorectal cancer, we should prioritize monitoring these specific pathways:
54+
Regardless of the tissue of origin (e.g., Kidney versus Breast), researchers often track core functional states within the tumor microenvironment. A user might define a `.gmt` file to explicitly monitor:
5155

52-
| Pathway | Clinically Relevant Genes | Clinical Significance |
56+
| Pathway Concept | Hallmarks / Relevant Genes | Interpretive Value across Tissues |
5357
| :--- | :--- | :--- |
54-
| **Wnt Signaling** | `CTNNB1`, `MYC`, `AXIN2` | Common driver in CRC (APC mutations) |
55-
| **MMR / DNA Repair** | `MLH1`, `MSH2`, `MSH6` | MSI vs MSS status (Immunotherapy response) |
56-
| **EMT** | `SNAI1`, `VIM`, `ZEB1` | Tumor invasion and metastasis risk |
57-
| **Angiogenesis** | `VEGFA`, `FLT1` | Potential for anti-angiogenic therapy |
58+
| **Hypoxia & Angiogenesis** | `VEGFA`, `FLT1`, `HIF1A` | Identifies oxygen-deprived or highly vascularized tumor cores. |
59+
| **Immune Infiltration** | `CD8A`, `GZMB`, `IFNG` | Maps regions of active anti-tumor immune response. |
60+
| **Stromal / EMT** | `VIM`, `SNAI1`, `ZEB1` | Highlights desmoplastic stroma and invasion fronts. |
61+
| **Proliferation** | `MKI67`, `PCNA`, `MYC` | Pinpoints highly active, dividing cell populations. |
62+
63+
By supplying these functional groupings via `--custom-gmt`, the model's MTL process explicitly aligns its spatial interaction tokens to monitor these exact states across any whole-slide image in the HEST1k dataset.
5864

5965
## 4. Implementation Status
6066

6167
### Implemented
6268

6369
- **MSigDB Hallmarks Initialization** (`--pathway-init` flag): Downloads the GMT file, matches genes against `global_genes.json`, and initializes `gene_reconstructor.weight` with the binary membership matrix. See [`pathways.py`](../src/spatial_transcript_former/data/pathways.py).
64-
- 50 Hallmark pathways (fixed when using `--pathway-init`)
65-
- ~54% gene coverage (542/1000 genes mapped to at least one pathway)
66-
- GMT file cached in `.cache/` after first download
70+
- 50 Hallmark pathways (default fixed fallback when using `--pathway-init`).
71+
- GMT file cached in `.cache/` after first download.
72+
- **Custom Pathway Definitions** (`--custom-gmt` flag): Users can override the default Hallmarks by providing a URL or local path to a `.gmt` file, enabling custom database integrations (e.g., KEGG, Reactome, or highly specific tissue masks).
6773

6874
- **Sparsity Regularization** (`--sparsity-lambda` flag): L1 penalty on `gene_reconstructor` weights to encourage pathway-like groupings when using data-driven (random) initialization.
6975

@@ -79,8 +85,9 @@ python -m spatial_transcript_former.train \
7985
--model interaction --num-pathways 50 --sparsity-lambda 0.01 ...
8086
```
8187

88+
- **Spatial Pathway Maps**: Visualize pathway activations as spatial heatmaps overlaid on histology using `stf-predict`. See the [README](../README.md) for inference instructions.
89+
8290
### Future Work
8391

84-
- **KEGG/Reactome**: More granular pathway databases for finer-grained analysis.
85-
- **Post-Hoc Enrichment**: `gseapy` integration for pathway activation maps from model outputs.
86-
- **Spatial Pathway Maps**: Visualize pathway activations as spatial heatmaps overlaid on histology.
92+
- **Post-Hoc Enrichment**: `gseapy` integration for pathway activation maps from model outputs without architectural bottlenecks.
93+
- **End-to-End Risk Assessment Module**: Developing a downstream prediction system that takes the spatially-resolved pathway activations and gene expressions derived from the model and maps them directly to clinical risk and survival outcomes.

0 commit comments

Comments
 (0)