Skip to content

burstein-lab/B-PPI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

B-PPI: A Cross-Attention Model for Large-Scale Bacterial Protein-Protein Interaction Prediction

B-PPI provides a specialized framework for rapid prediction of bacterial protein-protein interactions (PPIs). B-PPI was trained on B-PPI-DB, a database of positive and negative bacterial protein-protein interactions (derived from STRING) and utilizes a cross-attention mechanism to capture residue-level relationships between protein pairs.

Installation

pip install -r requirements.txt

Usage

B-PPI offers modules for prediction (inference) and fine-tuning. It is recommended to run it on GPU. Below are the details for each command.

1. Prediction (Inference)

Option A: Target Specific Pairs (predict command) Use this command to predict interactions of a specific list of protein pairs.

Input:

  • fasta_path: (Required) Path to a standard .fasta file containing the sequences for all proteins involved.
  • input_csv: (Required) Path to a .csv file defining the pairs to test. Must contain a header row with columns protein1 and protein2. The names must match the headers in the FASTA file.
  • model_path: (Required) Path to the pre-trained model file (.pt).
  • output_csv: (Required) Path where the results will be saved.
  • score_cutoff: (Required) A score above which to consider proteins pair as binding.

Command:

    python main.py predict \
      --fasta_path sample.fasta \
      --input_csv sample.csv \
      --model_path model.pt \
      --output_csv bppi_output.csv \
      --score_cutoff 0.6

Option B: All-vs-All Screening (predict_all command) Use this command to predict interactions between all potential pairs of proteins contained in two FASTA files (or within a single file if the same path is provided twice).

Input:

  • fasta_A_path: (Required) Path to the first .fasta file.
  • fasta_B_path: (Required) Path to the second .fasta file.
  • model_path: (Required) Path to the pre-trained model file (.pt).
  • output_csv: (Required) Path where the results will be saved.
  • score_cutoff: (Required) A score above which to consider proteins pair as binding.

Command:

python main.py predict_all \
  --fasta_A_path sample1.fasta \
  --fasta_B_path sample2.fasta \
  --model_path model.pt \
  --output_csv bppi_output.csv \
  --score_cutoff 0.6

2. Fine-Tuning

If you have specific data (pairs known to bind vs. not bind) of bacteria, you can fine-tune the model on your dataset to improve accuracy. This is a two-step process. Step 1: Extract Embeddings (prostT5_embeddings command) Before fine-tuning, you should extract embeddings for your protein sequences using the ProstT5 model.

Input:

  • fasta_path: (Required) Path to a .fasta file containing all protein sequences used in your training/testing data.
  • embeddings_h5_path: (Required) Output path for the generated .h5 embeddings file.

Command:

python main.py prostT5_embeddings \
  --fasta_path sample.fasta \
  --embeddings_h5_path sample_emb.h5

Step 2: Run Fine-Tuning (finetune command) Train the model using your labeled data and the embeddings generated in Step 1.

Input:

  • train_csv, val_csv, test_csv: (Required) Paths to your training, validation, and testing datasets. The .csv files must contain a header with columns: protein1, protein2, and label (1 for binding - positive, 0 for non-binding - negative).
  • embeddings_h5: (Required) Path to the .h5 file generated in Step 1.
  • model_to_finetune: (Required) Path to the base model (.pt) you wish to fine-tune.
  • model_save_path: (Required) Path where the new fine-tuned model will be saved.

Command:

python main.py finetune \
  --train_csv train_sample_to_finetune.csv \
  --val_csv val_sample_to_finetune.csv \
  --test_csv test_sample_to_finetune.csv \
  --embeddings_h5 sample_emb.h5 \
  --model_to_finetune model.pt \
  --model_save_path model_finetuned.pt

Development

The dev/ folder contains additional modules for reproduction and benchmarking:

  • create_dataset.py: Script used to create B-PPI-DB from the STRING database.
  • train.py: The original script used to train the B-PPI model.
  • evaluate_bppi_db.ipynb: Notebook containing evaluation results of a 5-fold cross-validation on B-PPI-DB.

About

B-PPI: A Cross-Attention Model for Large-Scale Bacterial Protein-Protein Interaction Prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors