Skip to content

Clint-Holt/AbLangPDB1

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AbLangPDB1: Epitope-Aware Antibody Embeddings

State-of-the-art antibody embeddings that predict epitope overlap for targeted therapeutic discovery

Paper Model License

AbLangPDB1 generates 1536-dimensional embeddings where antibodies targeting similar epitopes cluster together - enabling rapid epitope classification, antibody search, and therapeutic discovery.

Model Description

AbLangPDB1 is designed to predict epitope overlap between antibodies by generating high-quality embeddings that capture epitope-specificity information. The model uses contrastive learning on paired heavy and light chain sequences to learn representations where antibodies targeting similar epitopes cluster together in embedding space.

Architecture

Heavy Chain Seq → [AbLang Heavy] → 768-dim → |
                                              | → [Concatenate] → [Mixer Network] → 1536-dim Paired Embedding
Light Chain Seq → [AbLang Light] → 768-dim → |

The model processes heavy and light chains independently using pre-trained AbLang models, then fuses their embeddings through a custom Mixer network (6 fully connected layers) to produce a unified 1536-dimensional embedding.

Training Data

  • Source: 1,909 non-redundant human antibodies from Structural Antibody Database (SAbDab)
  • Cutoff Date: February 19, 2024
  • Antigen Assignment: Pfam domain-based categorization using pfam_scan
  • Data Splits: 80% training, 10% validation, 10% test (clone-group aware splitting)

Quick Start

Model Download

AbLangPDB1 is hosted on HuggingFace for optimal download experience:

🤗 Download from HuggingFace Hub

# Option 1: Direct download (recommended)
curl -L "https://huggingface.co/clint-holt/AbLangPDB1/resolve/main/ablangpdb_model.safetensors?download=true" -o ablangpdb_model.safetensors

# Option 2: Using huggingface_hub
pip install huggingface_hub
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='clint-holt/AbLangPDB1', filename='ablangpdb_model.safetensors', local_dir='.')"

Get Started in 3 Steps

# 1. Clone repository and install dependencies
git clone https://github.com/your-username/AbLangPDB1.git
cd AbLangPDB1
pip install torch pandas transformers safetensors

# 2. Download model weights (738MB)
curl -L "https://huggingface.co/clint-holt/AbLangPDB1/resolve/main/ablangpdb_model.safetensors?download=true" -o ablangpdb_model.safetensors

# 3. Run inference
python quick_start_example.py

Explore examples: pdb_inference_examples.ipynb | 🔬 Run benchmarks: benchmarking/

30-Second Example

import torch
from transformers import AutoTokenizer
from ablangpaired_model import AbLangPaired, AbLangPairedConfig

# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AbLangPairedConfig(checkpoint_filename="ablangpdb_model.safetensors")
model = AbLangPaired(config, device).eval()

# Your antibody sequences
heavy_chain = "EVQLVESGGGLVQPGGSLRLSCAASGFNLYYYSIHWVRQAPGKGLEWVASISPYSSSTSYADSVKGRFTISADTSKNTAYLQMNSLRAEDTAVYYCARGRWYRRALDYWGQGTLVTVSS"
light_chain = "DIQMTQSPSSLSASVGDRVTITCRASQSVSSAVAWYQQKPGKAPKLLIYSASSLYSGVPSRFSGSRSGTDFTLTISSLQPEDFATYYCQQYPYYSSLITFGQGTKVEIK"

# Tokenize (add spaces between amino acids)
heavy_tokenizer = AutoTokenizer.from_pretrained("clint-holt/AbLangPDB1", subfolder="heavy_tokenizer")
light_tokenizer = AutoTokenizer.from_pretrained("clint-holt/AbLangPDB1", subfolder="light_tokenizer")

h_tokens = heavy_tokenizer(" ".join(heavy_chain), return_tensors="pt")
l_tokens = light_tokenizer(" ".join(light_chain), return_tensors="pt")

# Generate embedding
with torch.no_grad():
    embedding = model(
        h_input_ids=h_tokens['input_ids'].to(device),
        h_attention_mask=h_tokens['attention_mask'].to(device),
        l_input_ids=l_tokens['input_ids'].to(device),
        l_attention_mask=l_tokens['attention_mask'].to(device)
    )

print(f"Generated embedding shape: {embedding.shape}")  # (1, 1536)

Performance Highlights

Dataset Metric AbLangPDB1 Best Baseline
SAbDab ROC-AUC 0.89 0.82
SAbDab F1 Score 0.76 0.69
DMS ROC-AUC 0.85 0.79
DMS F1 Score 0.72 0.65

Comparison against ESM-2, AbLang, AntiBERTy, and other state-of-the-art models

Use Cases

  1. Epitope Classification: Compare antibodies with unknown epitopes against reference databases
  2. Antibody Search: Find antibodies with similar epitope specificity in large sequence databases
  3. Therapeutic Discovery: Identify candidate antibodies targeting the same epitope as reference therapeutics
  4. Antibody Clustering: Group antibodies by epitope similarity for analysis

📄 Publication

Contrastive Learning Enables Epitope Overlap Predictions for Targeted Antibody Discovery
Clinton M. Holt, Alexis K. Janke, Parastoo Amlashi, Parker J. Jamieson, Toma M. Marinov, Ivelin S. Georgiev
bioRxiv, 2025. https://doi.org/10.1101/2025.02.25.640114

Vanderbilt Center for Antibody Therapeutics, Vanderbilt University Medical Center

📈 Benchmarking

This repository includes comprehensive benchmarking code to evaluate AbLangPDB1 against other methods:

Available Scripts

  • benchmarking/calculate_metrics.py: Evaluation on SAbDab dataset with continuous epitope/antigen labels
  • benchmarking/calculate_metrics_dms.py: Evaluation on DMS dataset with binary epitope labels
  • benchmarking/get_metrics.ipynb: Interactive notebook for running benchmarks

Benchmark Datasets

  • SAbDab Dataset: Structural antibody database with epitope/antigen overlap labels
  • DMS Dataset: Deep mutational scanning data with binary epitope matching

Metrics

  • ROC-AUC: Area under the receiver operating characteristic curve
  • Average Precision: Area under the precision-recall curve
  • F1 Score: Harmonic mean of precision and recall

Training

Code for training is in the folder train To perform training:

cd train
python train_ablangpdb.py

Interpretability

Code for doing Captum Integrated Gradient Analysis is in the folder interpretability Look at the README as well as interpretability/DataAnalysis.ipynb to get started.

Paper Figures

See Paper_Figures.ipynb for the code that generated the paper figures. Not all data files are included to generate these figures. Email clinton.m.holt@gmail.com if there are specific data files you would like.

⚠️ Limitations

  • Species: Optimized for human antibodies; reduced accuracy expected for mouse BCRs
  • Domain Coverage: Lower performance for antibodies targeting antigen families not included in training. Also lower performance when not referencing new antibodies against antibodies in the PDB.
  • Custom Architecture: Requires the provided ablangpaired_model.py implementation (not compatible with standard HuggingFace model loading)

Model Weights & Resources

  • HuggingFace Model: clint-holt/AbLangPDB1
  • Direct Download:
    curl -L https://huggingface.co/clint-holt/AbLangPDB1/resolve/main/ablangpdb_model.safetensors?download=true -o ablangpdb_model.safetensors
  • Paper: bioRxiv link

📚 Citation

If you use this model or code in your research, please cite our paper:

@article{Holt2025.02.25.640114,
    author = {Holt, Clinton M. and Janke, Alexis K. and Amlashi, Parastoo and Jamieson, Parker J. and Marinov, Toma M. and Georgiev, Ivelin S.},
    title = {Contrastive Learning Enables Epitope Overlap Predictions for Targeted Antibody Discovery},
    elocation-id = {2025.02.25.640114},
    year = {2025},
    doi = {10.1101/2025.02.25.640114},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/04/01/2025.02.25.640114},
    eprint = {https://www.biorxiv.org/content/early/2025/04/01/2025.02.25.640114.full.pdf},
    journal = {bioRxiv}
}

📝 License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 97.1%
  • Python 2.9%