B-PPI provides a specialized framework for rapid prediction of bacterial protein-protein interactions (PPIs). B-PPI was trained on B-PPI-DB, a database of positive and negative bacterial protein-protein interactions (derived from STRING) and utilizes a cross-attention mechanism to capture residue-level relationships between protein pairs.
pip install -r requirements.txt
B-PPI offers modules for prediction (inference) and fine-tuning. It is recommended to run it on GPU. Below are the details for each command.
Option A: Target Specific Pairs (predict command) Use this command to predict interactions of a specific list of protein pairs.
Input:
- fasta_path: (Required) Path to a standard .fasta file containing the sequences for all proteins involved.
- input_csv: (Required) Path to a .csv file defining the pairs to test. Must contain a header row with columns protein1 and protein2. The names must match the headers in the FASTA file.
- model_path: (Required) Path to the pre-trained model file (.pt).
- output_csv: (Required) Path where the results will be saved.
- score_cutoff: (Required) A score above which to consider proteins pair as binding.
Command:
python main.py predict \
--fasta_path sample.fasta \
--input_csv sample.csv \
--model_path model.pt \
--output_csv bppi_output.csv \
--score_cutoff 0.6
Option B: All-vs-All Screening (predict_all command) Use this command to predict interactions between all potential pairs of proteins contained in two FASTA files (or within a single file if the same path is provided twice).
Input:
- fasta_A_path: (Required) Path to the first .fasta file.
- fasta_B_path: (Required) Path to the second .fasta file.
- model_path: (Required) Path to the pre-trained model file (.pt).
- output_csv: (Required) Path where the results will be saved.
- score_cutoff: (Required) A score above which to consider proteins pair as binding.
Command:
python main.py predict_all \
--fasta_A_path sample1.fasta \
--fasta_B_path sample2.fasta \
--model_path model.pt \
--output_csv bppi_output.csv \
--score_cutoff 0.6
If you have specific data (pairs known to bind vs. not bind) of bacteria, you can fine-tune the model on your dataset to improve accuracy. This is a two-step process. Step 1: Extract Embeddings (prostT5_embeddings command) Before fine-tuning, you should extract embeddings for your protein sequences using the ProstT5 model.
Input:
- fasta_path: (Required) Path to a .fasta file containing all protein sequences used in your training/testing data.
- embeddings_h5_path: (Required) Output path for the generated .h5 embeddings file.
Command:
python main.py prostT5_embeddings \
--fasta_path sample.fasta \
--embeddings_h5_path sample_emb.h5
Step 2: Run Fine-Tuning (finetune command) Train the model using your labeled data and the embeddings generated in Step 1.
Input:
- train_csv, val_csv, test_csv: (Required) Paths to your training, validation, and testing datasets. The .csv files must contain a header with columns: protein1, protein2, and label (1 for binding - positive, 0 for non-binding - negative).
- embeddings_h5: (Required) Path to the .h5 file generated in Step 1.
- model_to_finetune: (Required) Path to the base model (.pt) you wish to fine-tune.
- model_save_path: (Required) Path where the new fine-tuned model will be saved.
Command:
python main.py finetune \
--train_csv train_sample_to_finetune.csv \
--val_csv val_sample_to_finetune.csv \
--test_csv test_sample_to_finetune.csv \
--embeddings_h5 sample_emb.h5 \
--model_to_finetune model.pt \
--model_save_path model_finetuned.pt
The dev/ folder contains additional modules for reproduction and benchmarking:
- create_dataset.py: Script used to create B-PPI-DB from the STRING database.
- train.py: The original script used to train the B-PPI model.
- evaluate_bppi_db.ipynb: Notebook containing evaluation results of a 5-fold cross-validation on B-PPI-DB.