feat: add SVG-aware gene selection and spatial coherence validation

BenjaminIsaac0111 · BenjaminIsaac0111 · commit f34e8c316752 · 2026-03-12T22:09:13.000Z
Introduce Moran's I-based spatial variability scoring to improve gene
vocabulary selection and add a spatial coherence metric for validation.

Spatial Statistics Module (NEW):
- data/spatial_stats.py: lightweight Moran's I via KNN weights
  (numpy + scipy only, no new dependencies)
- morans_i(), morans_i_batch(), spatial_coherence_score()

SVG-aware Gene Selection:
- build_vocab.py: --svg-weight (0-1) blends expression rank with
  Moran's I rank; --svg-k controls KNN graph size
- Default svg_weight=0.0 preserves original behaviour
- Stats CSV now includes morans_i column

Spatial Coherence Validation:
- engine.py: computes Moran's I correlation between predicted and
  ground-truth expression on top-50 SVGs during validation
- train.py: logs spatial_coherence to SQLite

Tests:
- test_spatial_stats.py: 14 tests covering Moran's I (uniform,
  clustered, checkerboard, gradient) and coherence scoring

Docs:
- SC_BEST_PRACTICES.md: marked SVG selection and spatial coherence
  as implemented
diff --git a/README.md b/README.md
@@ -113,7 +113,7 @@ Visualization plots and spatial expression maps will be saved to the `./results`
 - **[Pathway Mapping](docs/PATHWAY_MAPPING.md)**: Clinical interpretability, pathway bottleneck design, and MSigDB integration.
 - **[Gene Analysis](docs/GENE_ANALYSIS.md)**: Modeling strategies for mapping morphology to high-dimensional gene spaces.
 - **[Data Structure](docs/DATA_STRUCTURE.md)**: Detailed breakdown of the HEST data structure on disk, metadata conventions, and preprocessing invariants.
-- **[Single-cell Best Practices](docs/SC_BEST_PRACTICES.md)**: Gap analysis and roadmap for alignment with industry standard recommendations.
+- **[Single-cell Best Practices](docs/SC_BEST_PRACTICES.md)**: Gap analysis and roadmap for alignment with standard recommendations.
 
 ## Development
 
diff --git a/docs/SC_BEST_PRACTICES.md b/docs/SC_BEST_PRACTICES.md
@@ -21,11 +21,13 @@ These are areas where the project already follows industry best practices:
 
 The following items are recommended for future sprints to improve model robustness and biological accuracy.
 
-### 1. SVG-aware Gene Selection (Moran's I)
+### 1. SVG-aware Gene Selection (Moran's I) ✅
 
-**Priority: High**  
+**Priority: High** — **Implemented**  
 **Rationale**: Currently, genes are selected based on total expression or pathway membership. However, the model's primary task is to learn spatial patterns. Selecting genes based on **Spatially Variable Gene (SVG)** metrics like Moran's I (available in Squidpy) would prioritise genes that have learned spatial coherence over those that are just highly expressed (like housekeeping genes).
 
+**Usage**: `stf-build-vocab --svg-weight 0.5 --svg-k 6` enables a hybrid ranking that blends total expression with Moran's I spatial variability. See `data/spatial_stats.py` for the implementation.
+
 ### 2. Standardised Preprocessing Pipeline
 
 **Priority: Medium-High**  
@@ -41,10 +43,12 @@ The following items are recommended for future sprints to improve model robustne
 **Priority: Medium**  
 **Rationale**: Adding explicit QC thresholds (e.g., minimum UMI count, minimum detected genes, maximum mitochondrial fraction) to the dataset loading scripts would protect the model from training on low-quality "noise" spots.
 
-### 5. Spatial Coherence Validation Metrics
+### 5. Spatial Coherence Validation Metrics ✅
 
-**Priority: Medium**  
-**Rationale**: Aggregate metrics like MSE or PCC don't capture whether the *spatial distribution* of predictions is realistic. Adding a validation step that compares the Moran's I of predicted vs. ground-truth expression would provide a much stronger biological validation signal.
+**Priority: Medium** — **Implemented**  
+**Rationale**: Aggregate metrics like MSE or PCC don't capture whether the *spatial distribution* of predictions is realistic. A validation step now compares the Moran's I of predicted vs. ground-truth expression for the top-50 spatially variable genes, reporting a Pearson correlation as the **Spatial Coherence Score**.
+
+**Integration**: Computed automatically during validation in `training/engine.py` and logged to SQLite as `spatial_coherence`. See `data/spatial_stats.py:spatial_coherence_score()`.
 
 ### 6. Preprocessing Documentation
 
diff --git a/src/spatial_transcript_former/data/spatial_stats.py b/src/spatial_transcript_former/data/spatial_stats.py
@@ -0,0 +1,173 @@
+"""
+Spatial statistics utilities for gene selection.
+
+Provides lightweight, dependency-free Moran's I computation for
+identifying spatially variable genes (SVGs) from spatial
+transcriptomics data.
+
+Moran's I measures spatial autocorrelation: whether nearby spots tend
+to have similar (positive I) or dissimilar (negative I) expression
+for a given gene. Genes with high Moran's I show distinct spatial
+patterns and are the strongest learning targets for
+SpatialTranscriptFormer.
+"""
+
+import numpy as np
+from scipy.spatial import KDTree
+from scipy.sparse import csr_matrix
+
+
+def _build_knn_weights(coords: np.ndarray, k: int = 6) -> csr_matrix:
+    """Build a row-normalised KNN spatial weight matrix.
+
+    Args:
+        coords: (N, 2) array of spatial coordinates.
+        k: Number of nearest neighbours per spot.
+
+    Returns:
+        (N, N) sparse CSR matrix where ``W[i, j] = 1/k`` if j is one
+        of the k nearest neighbours of i, else 0. Row-normalisation
+        ensures that the weight contribution is independent of local
+        spot density.
+    """
+    n = coords.shape[0]
+    tree = KDTree(coords)
+    # k+1 because the first neighbour returned is the point itself
+    _, indices = tree.query(coords, k=min(k + 1, n))
+
+    rows = []
+    cols = []
+    for i in range(n):
+        neighbours = indices[i]
+        neighbours = neighbours[neighbours != i][:k]
+        for j in neighbours:
+            rows.append(i)
+            cols.append(j)
+
+    data = np.ones(len(rows), dtype=np.float64) / k
+    W = csr_matrix((data, (rows, cols)), shape=(n, n))
+    return W
+
+
+def morans_i(x: np.ndarray, W: csr_matrix) -> float:
+    """Compute Moran's I for a single variable.
+
+    .. math::
+
+        I = \\frac{N}{W_{sum}} \\cdot
+            \\frac{\\sum_i \\sum_j w_{ij} (x_i - \\bar{x})(x_j - \\bar{x})}
+                  {\\sum_i (x_i - \\bar{x})^2}
+
+    Args:
+        x: (N,) array of values (e.g. gene expression per spot).
+        W: (N, N) sparse spatial weight matrix.
+
+    Returns:
+        Moran's I statistic. Ranges roughly from -1 (perfect
+        dispersion) through 0 (random) to +1 (perfect clustering).
+        Returns 0.0 if variance is zero (constant gene).
+    """
+    n = len(x)
+    x_mean = x.mean()
+    z = x - x_mean
+
+    denominator = np.sum(z ** 2)
+    if denominator < 1e-12:
+        return 0.0  # Constant expression → no spatial pattern
+
+    # W @ z gives the spatially-lagged deviation for each spot
+    lag = W.dot(z)
+    numerator = np.sum(z * lag)
+
+    W_sum = W.sum()
+    if W_sum < 1e-12:
+        return 0.0
+
+    I = (n / W_sum) * (numerator / denominator)
+    return float(I)
+
+
+def morans_i_batch(
+    expression: np.ndarray,
+    coords: np.ndarray,
+    k: int = 6,
+) -> np.ndarray:
+    """Compute Moran's I for all genes in an expression matrix.
+
+    Args:
+        expression: (N, G) dense expression matrix (spots × genes).
+        coords: (N, 2) spatial coordinates for each spot.
+        k: Number of nearest neighbours for the spatial weight graph.
+
+    Returns:
+        (G,) array of Moran's I scores, one per gene.
+    """
+    if expression.shape[0] < k + 1:
+        # Too few spots to build a meaningful KNN graph
+        return np.zeros(expression.shape[1], dtype=np.float64)
+
+    W = _build_knn_weights(coords, k=k)
+    n_genes = expression.shape[1]
+    scores = np.empty(n_genes, dtype=np.float64)
+
+    for g in range(n_genes):
+        scores[g] = morans_i(expression[:, g], W)
+
+    return scores
+
+
+def spatial_coherence_score(
+    predicted: np.ndarray,
+    ground_truth: np.ndarray,
+    coords: np.ndarray,
+    k: int = 6,
+    top_k_genes: int = 50,
+) -> float:
+    """Compare spatial structure of predictions vs ground truth.
+
+    Computes Moran's I for both the predicted and ground-truth
+    expression matrices, then returns the Pearson correlation between
+    the two Moran's I vectors. A score near 1.0 means the model
+    reproduces the correct spatial patterns; near 0 means random.
+
+    To keep computation fast (this runs every validation epoch), only
+    the ``top_k_genes`` with highest ground-truth spatial variability
+    are evaluated.
+
+    Args:
+        predicted: (N, G) predicted expression matrix.
+        ground_truth: (N, G) ground-truth expression matrix.
+        coords: (N, 2) spatial coordinates.
+        k: KNN neighbours for the spatial weight graph.
+        top_k_genes: Number of top-Moran's-I genes to evaluate.
+
+    Returns:
+        Pearson correlation between predicted and ground-truth
+        Moran's I vectors. Returns 0.0 if computation fails.
+    """
+    n_spots, n_genes = ground_truth.shape
+    if n_spots < k + 1 or n_genes < 2:
+        return 0.0
+
+    W = _build_knn_weights(coords, k=k)
+
+    # Compute Moran's I for ground truth
+    mi_gt = np.empty(n_genes, dtype=np.float64)
+    for g in range(n_genes):
+        mi_gt[g] = morans_i(ground_truth[:, g], W)
+
+    # Select top-K genes by ground-truth Moran's I (most spatially variable)
+    top_indices = np.argsort(mi_gt)[-top_k_genes:]
+
+    # Compute Moran's I for predictions on those genes only
+    mi_pred = np.empty(len(top_indices), dtype=np.float64)
+    mi_gt_top = mi_gt[top_indices]
+    for i, g in enumerate(top_indices):
+        mi_pred[i] = morans_i(predicted[:, g], W)
+
+    # Pearson correlation between the two Moran's I vectors
+    if np.std(mi_gt_top) < 1e-12 or np.std(mi_pred) < 1e-12:
+        return 0.0
+
+    corr = np.corrcoef(mi_gt_top, mi_pred)[0, 1]
+    return float(corr) if np.isfinite(corr) else 0.0
diff --git a/src/spatial_transcript_former/recipes/hest/build_vocab.py b/src/spatial_transcript_former/recipes/hest/build_vocab.py
@@ -8,6 +8,7 @@
 import sys
 from collections import defaultdict
 from scipy.sparse import csr_matrix
+from spatial_transcript_former.data.spatial_stats import morans_i_batch
 
 # Add src to path
 sys.path.append(os.path.abspath("src"))
@@ -42,15 +43,23 @@ def scan_h5ad_files(data_dir):
     return sample_ids
 
 
-def calculate_global_genes(data_dir, ids, num_genes=1000, target_pathways=None):
+def calculate_global_genes(
+    data_dir, ids, num_genes=1000, target_pathways=None,
+    svg_weight=0.0, svg_k=6,
+):
     st_dir = os.path.join(data_dir, "st")
     if not ids:
         print("No samples provided for calculation.")
         return [], []
 
     print(f"Scanning {len(ids)} samples in {st_dir}...")
+    if svg_weight > 0:
+        print(f"SVG mode: weight={svg_weight}, k={svg_k}")
 
     gene_totals = defaultdict(float)
+    # Moran's I accumulators (sum and count for averaging across samples)
+    gene_morans_sum = defaultdict(float)
+    gene_morans_count = defaultdict(int)
 
     for sample_id in tqdm(ids):
         h5ad_path = os.path.join(st_dir, f"{sample_id}.h5ad")
@@ -78,10 +87,27 @@ def calculate_global_genes(data_dir, ids, num_genes=1000, target_pathways=None):
                 for i, gene in enumerate(gene_names):
                     gene_totals[gene] += float(sums[i])
 
+                # --- SVG: compute Moran's I per gene for this sample ---
+                if svg_weight > 0 and "obsm" in f and "spatial" in f["obsm"]:
+                    coords = f["obsm"]["spatial"][:]
+                    # Densify the expression matrix for Moran's I
+                    if isinstance(mat, csr_matrix):
+                        dense_mat = mat.toarray()
+                    else:
+                        dense_mat = np.asarray(mat)
+
+                    mi_scores = morans_i_batch(dense_mat, coords, k=svg_k)
+
+                    for i, gene in enumerate(gene_names):
+                        gene_morans_sum[gene] += mi_scores[i]
+                        gene_morans_count[gene] += 1
+
         except Exception as e:
             print(f"Error processing {sample_id}: {e}")
 
     print(f"Aggregated counts for {len(gene_totals)} unique genes.")
+    if svg_weight > 0:
+        print(f"Computed Moran's I for {len(gene_morans_sum)} genes.")
 
     prioritized_genes = set()
     if target_pathways:
@@ -108,18 +134,61 @@ def calculate_global_genes(data_dir, ids, num_genes=1000, target_pathways=None):
 
         print(f"Found {len(prioritized_genes)} valid target pathway genes.")
 
-    # Sort all by total expression
-    sorted_all = sorted(gene_totals.items(), key=lambda x: x[1], reverse=True)
+    # --- Ranking: expression-only or hybrid ---
+    all_genes = list(gene_totals.keys())
+
+    if svg_weight > 0 and gene_morans_sum:
+        # Compute average Moran's I per gene
+        gene_morans_avg = {
+            g: gene_morans_sum[g] / gene_morans_count[g]
+            for g in all_genes
+            if gene_morans_count.get(g, 0) > 0
+        }
+
+        # Rank by expression (lower rank = higher expression)
+        expr_sorted = sorted(all_genes, key=lambda g: gene_totals[g], reverse=True)
+        expr_rank = {g: r for r, g in enumerate(expr_sorted)}
+
+        # Rank by Moran's I (lower rank = higher spatial variability)
+        mi_sorted = sorted(
+            all_genes, key=lambda g: gene_morans_avg.get(g, 0.0), reverse=True
+        )
+        mi_rank = {g: r for r, g in enumerate(mi_sorted)}
+
+        # Hybrid score: weighted sum of ranks (lower = better)
+        alpha = svg_weight
+        hybrid_score = {
+            g: (1 - alpha) * expr_rank[g] + alpha * mi_rank[g]
+            for g in all_genes
+        }
+        sorted_all_genes = sorted(all_genes, key=lambda g: hybrid_score[g])
+
+        # Build stats list with Moran's I column
+        sorted_all = [
+            (g, gene_totals[g], gene_morans_avg.get(g, 0.0))
+            for g in sorted_all_genes
+        ]
+        print(
+            f"Hybrid ranking: expression weight={(1 - alpha):.1f}, "
+            f"SVG weight={alpha:.1f}"
+        )
+    else:
+        # Expression-only ranking (original behaviour)
+        sorted_all = sorted(gene_totals.items(), key=lambda x: x[1], reverse=True)
+        sorted_all_genes = [g for g, _ in sorted_all]
+        # Pad stats tuples with 0.0 Moran's I for consistent CSV format
+        sorted_all = [(g, c, 0.0) for g, c in sorted_all]
 
     top_genes = list(prioritized_genes)
-    for g, _ in sorted_all:
+    for g in sorted_all_genes:
         if len(top_genes) >= num_genes:
             break
         if g not in prioritized_genes:
             top_genes.append(g)
 
     print(
-        f"Final set: {len(prioritized_genes)} pathway genes + {len(top_genes) - len(prioritized_genes)} global genes"
+        f"Final set: {len(prioritized_genes)} pathway genes + "
+        f"{len(top_genes) - len(prioritized_genes)} global genes"
     )
 
     return top_genes, sorted_all
@@ -147,6 +216,19 @@ def main():
         default=None,
         help="List of MSigDB pathway names to explicitly prioritize (e.g., HALLMARK_P53_PATHWAY)",
     )
+    parser.add_argument(
+        "--svg-weight",
+        type=float,
+        default=0.0,
+        help="Weight for spatial variability (Moran's I) in gene ranking. "
+             "0.0=expression-only (default), 1.0=SVG-only, 0.5=balanced.",
+    )
+    parser.add_argument(
+        "--svg-k",
+        type=int,
+        default=6,
+        help="Number of KNN neighbours for spatial weight matrix (default: 6).",
+    )
 
     args = parser.parse_args()
 
@@ -160,14 +242,15 @@ def main():
         sys.exit(1)
 
     top_genes, all_stats = calculate_global_genes(
-        args.data_dir, ids, args.num_genes, target_pathways=args.pathways
+        args.data_dir, ids, args.num_genes, target_pathways=args.pathways,
+        svg_weight=args.svg_weight, svg_k=args.svg_k,
     )
 
     print(f"Saving top {len(top_genes)} genes to {output_path}")
     with open(output_path, "w") as f:
         json.dump(top_genes, f, indent=4)
 
-    stats_df = pd.DataFrame(all_stats, columns=["gene", "total_counts"])
+    stats_df = pd.DataFrame(all_stats, columns=["gene", "total_counts", "morans_i"])
     stats_df.to_csv(output_path.replace(".json", "_stats.csv"), index=False)
     print("Saved stats to CSV.")
 
diff --git a/src/spatial_transcript_former/train.py b/src/spatial_transcript_former/train.py
@@ -184,6 +184,8 @@ def main():
             epoch_row["val_pcc"] = round(val_metrics["val_pcc"], 4)
         if val_metrics.get("pred_variance") is not None:
             epoch_row["pred_variance"] = round(val_metrics["pred_variance"], 6)
+        if val_metrics.get("spatial_coherence") is not None:
+            epoch_row["spatial_coherence"] = round(val_metrics["spatial_coherence"], 4)
         if val_metrics.get("attn_correlation") is not None:
             epoch_row["attn_correlation"] = round(val_metrics["attn_correlation"], 4)
 
diff --git a/src/spatial_transcript_former/training/engine.py b/src/spatial_transcript_former/training/engine.py
diff --git a/tests/test_spatial_stats.py b/tests/test_spatial_stats.py