Skip to content

DaviBonetto/SpectralBio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpectralBio

Spectral Covariance Analysis of Protein Language Model Hidden States for Zero-Shot Variant Pathogenicity Prediction

Paper | Quickstart | Experiments | Public API

CLAW4S 2026 Stanford and Princeton clawRxiv Paper Archive HuggingFace Demo

Read Full Paper PDF Launch Demo HuggingFace Open Dataset HF Datasets

Hugging Face Official Logo

clawRxiv Logo

Claw (AI Co-author) | Davi Bonetto


Abstract

SpectralBio is a reproducible, zero-shot variant-effect pipeline for missense pathogenicity prediction from sequence alone. The method leverages facebook/esm2_t30_150M_UR50D and measures mutation-induced geometric perturbations in hidden-state covariance inside a local ±40 residue window. It combines spectral features with proper masked-LM evidence to improve ranking quality without supervised training.

On TP53 ClinVar (N=255, 115 pathogenic, 140 benign), the best pair (0.55 * FrobDist + 0.45 * LL Proper) reaches AUC=0.7498 (~0.750). On BRCA1 transfer (N=100 subset), LL Proper reaches AUC=0.9174 (~0.917). Reproducibility is exact with fixed seeds (Δrep=0.0).


Key Results

Benchmark Best Method AUC-ROC Interpretation
TP53 (ClinVar, N=255) 0.55*frob_dist + 0.45*ll_proper 0.7498 Best pair in ablation search
BRCA1 transfer (N=100) ll_proper 0.9174 Strong cross-protein generalization

Main empirical takeaways

  • Matrix-level covariance features (FrobDist, TraceRatio) are consistently stronger than eigenvalue-only SPS variants on TP53.
  • The best TP53 score is a hybrid geometric + probabilistic signal, not a single raw feature.
  • LL Proper is moderate on TP53 but exceptionally strong on BRCA1 transfer.
  • Reported reproducibility check remains exact (Δrep=0.0) with deterministic seeds.

Results


Method (Brief)

For each variant, SpectralBio compares WT and MUT hidden-state covariance across ESM2 layers.

$$ \mathrm{FrobDist} = \frac{1}{L}\sum_{l=1}^{L}\left|C_{\text{MUT}}^{(l)} - C_{\text{WT}}^{(l)}\right|_F $$

$$ \mathrm{TraceRatio} = \frac{1}{L}\sum_{l=1}^{L}\left|\frac{\mathrm{tr}\left(C_{\text{MUT}}^{(l)}\right)}{\mathrm{tr}\left(C_{\text{WT}}^{(l)}\right)} - 1\right| $$

$$ \mathrm{SPS\text{-}log} = \frac{1}{L}\sum_{l=1}^{L}\left|\log\left|\lambda_{\text{MUT}}^{(l)}\right| - \log\left|\lambda_{\text{WT}}^{(l)}\right|\right|_2^2 $$

Where C^(l) is the residue-level covariance matrix at layer l, and λ^(l) is its eigenvalue spectrum.

Best TP53 combination in this release:

$$ \mathrm{Score}_{\text{best}} = 0.55\cdot\mathrm{FrobDist} + 0.45\cdot\mathrm{LL\ Proper} $$


Method Diagram (Pipeline Discussion)

Method Diagram

The diagram summarizes the end-to-end computational flow used in the paper:

  1. Variant ingestion: TP53/BRCA1 ClinVar missense variants are filtered and normalized.
  2. Sequence-local encoding: WT and MUT windows (±40) are encoded by ESM2-150M.
  3. Layerwise covariance: per-layer residue covariance matrices are computed.
  4. Spectral descriptors: FrobDist, TraceRatio, SPS-log quantify geometric perturbation.
  5. Likelihood integration: proper masked-LM likelihood (LL Proper) is blended with spectral signals.
  6. Evaluation: AUC ablations on TP53 and transfer analysis on BRCA1.

This is intentionally designed as an executable science pipeline: each block has a direct file-level artifact in this repository.


Paper Discussion (Deep Dive)

Why covariance geometry matters

Likelihood scores answer “how probable is this mutation under token prediction,” while covariance perturbation answers “how strongly did internal representation geometry shift.” In practice, those are complementary biological signals.

Why the TP53 best score is a mixture

On TP53, no single feature dominates all decision boundaries. The top-performing rule is the weighted pair (FrobDist + LL Proper), indicating that geometric and probabilistic cues contribute non-redundant information.

Why BRCA1 transfer is important

The BRCA1 result (AUC=0.9174 for LL Proper) suggests cross-protein portability in the likelihood component. This is important for zero-shot settings where no supervised retraining is available.

Current limitations

  • Two-gene scope for this release (TP53 primary + BRCA1 transfer subset).
  • Window size and feature mixing are fixed in this benchmark snapshot.
  • Intended for research reproducibility, not clinical deployment.

Feature-level interpretation from the TP53 ablation

Feature AUC-ROC (TP53) Interpretation
ll_crude 0.7026 Strong baseline proxy from hidden-state norm differences
trace_ratio 0.6242 Useful matrix-scale perturbation signal
frob_dist 0.6209 Stable global covariance displacement measure
sps_log 0.5988 Eigenvalue-only signal, weaker than matrix-level features
ll_proper 0.5956 Modest on TP53 but strongest transfer behavior on BRCA1

This ranking is exactly why SpectralBio keeps both geometric and probabilistic channels. The top TP53 score comes from feature complementarity, while the top BRCA1 score comes from transfer stability in ll_proper.


Quick Start

Option A (agents): SKILL.md

  • Open SKILL.md at repository root.
  • Execute each step sequentially.
  • Runtime target: ~24 min on T4 GPU.

Option B (humans): notebook


🤖 For AI Agents (Reproduction-First)

1) Where to start

  • Primary entrypoint: SKILL.md
  • Notebook mirror: colab/spectralbio.ipynb
  • Paper text: paper/spectralbio.tex and paper/spectralbio_clawrxiv.md

2) Minimal execution contract

  • Model: facebook/esm2_t30_150M_UR50D
  • Seeds: 42 (torch, numpy, random)
  • Core deps: torch, transformers, scipy, scikit-learn, numpy
  • Expected key metrics: TP53 ~0.750, BRCA1 LL ~0.917, Δrep=0.0

3) Agent navigation map

If you need... Open this path What you get
Reproducible workflow SKILL.md Ordered execution steps and validation criteria
Raw TP53 variants colab/results/tp53_variants.json Curated benchmark variant list
Raw per-variant scores colab/results/scores.json 11-feature table for each TP53 variant
Final metric snapshot colab/results/summary.json AUCs, counts, timing, reproducibility delta
Visual diagnostics colab/results/figures.png ROC + ablation + distributions
Interactive demo code huggingface/app.py Space logic and scoring API
Dataset metadata huggingface/dataset_card.md Schema, splits, usage notes
Submission flow submit/submit.py clawRxiv publish script

4) Output checklist after running

  • colab/results/summary.json exists and contains TP53 + BRCA1 metrics.
  • colab/results/scores.json exists and has TP53 rows with spectral/LL features.
  • colab/results/figures.png exists and renders correctly.
  • reproducibility_delta in summary remains 0.0.

5) Submission notes for agents

  • submit/submit.py expects local submit/api_key.txt.
  • Never commit API keys.
  • Publish target is clawRxiv (http://18.118.210.52).

6) 60-second bootstrap commands (agents)

git clone https://github.com/DaviBonetto/SpectralBio
cd SpectralBio
# Primary reproducibility path
# Follow SKILL.md step-by-step and validate outputs in colab/results/

7) Fast troubleshooting map

Symptom Likely cause Fix
Equations or metrics not matching expected values Seed drift or partial pipeline execution Re-run with seeds fixed to 42 and execute full SKILL.md sequence
Missing figures.png or scores.json Early stop before evaluation steps Continue notebook/SKILL flow through final validation block
Submission script fails Missing API key file Create local submit/api_key.txt (never commit)
Slow runtime CPU execution Prefer T4/A100 runtime for expected wall-clock behavior

Project Structure

SpectralBio/
├── README.md
├── LICENSE
├── .gitignore
├── SKILL.md
├── Claw4S_conference.md
├── colab/
│   ├── spectralbio.ipynb
│   └── results/
│       ├── tp53_variants.json
│       ├── tp53_sequence.txt
│       ├── summary.json
│       ├── scores.json
│       └── figures.png
├── paper/
│   ├── spectralbio.tex
│   ├── spectralbio.pdf
│   ├── spectralbio_clawrxiv.md
│   ├── references.bib
│   └── assets/
├── submit/
│   └── submit.py
└── huggingface/
    ├── app.py
    ├── dataset_card.md
    ├── requirements.txt
    ├── README.md
    ├── assets/
    └── data/

Reproducibility Snapshot

  • Model: facebook/esm2_t30_150M_UR50D
  • Window radius: ±40
  • Deterministic seeds: 42
  • TP53: N=255, AUC_best_pair=0.7498
  • BRCA1 transfer subset: N=100, AUC_ll_proper=0.9174
  • Reproducibility delta: 0.0
  • Scoring time: ~1447s (~24 minutes)

Links


Citation

@inproceedings{claw2026spectralbio,
  title={SpectralBio: Spectral Covariance Analysis of Protein Language Model Hidden States for Zero-Shot Variant Pathogenicity Prediction},
  author={Claw and Bonetto, Davi},
  booktitle={Claw4S Conference 2026, Stanford--Princeton},
  year={2026}
}

License

MIT License.


Acknowledgments

  • Claw4S 2026 organizers
  • Stanford University
  • Princeton University
  • Meta AI (ESM2)
  • ClinVar (NCBI/FDA)