Skip to content

arminshzd/gskernel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generic String Kernel implementation in GPyTorch

Overview

This repository implements a Gaussian process model for biological sequences based on a Generic String Kernel (GSK). The core pieces are:

  • gskernel: a custom kernel that scores similarity between sequences using amino acid properties.
  • gskgpr: a Gaussian process regression model that wraps the kernel for training and prediction.
  • seq2ascii: a helper that encodes sequences into tensors the kernel can consume.

The main goal of developing this version is to integrate GSK into BoTorch active learning pipelines.

Submodules

gskernel

The GenericStringKernel class extends gpytorch.kernels.Kernel to compute a normalized string kernel over amino acid sequences. Given a translator from seq2ascii, it:

  • Pre-computes a BLOSUM-derived similarity matrix and pairwise energy terms.
  • Computes subsequence alignment scores up to a maximum length L with learnable length scale parameters sigma1 and sigma2.
  • Builds full Gram matrices (including diagonal and cross-covariances) that plug directly into GPyTorch models.

gskgpr

GaussianStringKernelGP is a minimal GPyTorch exact GP model that pairs the string kernel with a constant mean and an observation likelihood. It is designed to operate on encoded sequences and can be trained with standard GPyTorch routines (e.g., using an ExactMarginalLogLikelihood).

seq2ascii

Seq2Ascii loads a pickled amino acid property dictionary (e.g., BLOSUM62) and provides:

  • Encoding of raw sequence strings to integer tensors (encode, encode_list).
  • Translation of encoded indices into property vectors (translate_to_ord), which the kernel uses internally.
  • Optional utilities for decoding indices back to strings and generating property matrices (get_psi).

How the pieces fit together

  1. Prepare a translator with Seq2Ascii, pointing it to a pickled amino acid property dictionary and fitting it on your training sequences.
  2. Encode sequences to integer tensors and translate them to property representations on the fly inside the kernel.
  3. Initialize a GaussianStringKernelGP with the translator so the kernel can transform inputs and compute the sequence-aware covariance.
  4. Train the GP with your preferred optimizer and use the model to make posterior predictions on new sequences.

Minimal usage example

There is an end-to-end example available under examples/example_gpt.ipynb. In short, you need a setup like this

import torch
import gpytorch
from gskernel.seq2ascii import Seq2Ascii
from gskernel.gskgpr import GaussianStringKernelGP

# 1) Build a translator from a pickled amino acid property dictionary
translator = Seq2Ascii("AA_property_mat.pkl")
input_space = ["Include all your input sequences. EVERYTHING from your chemical space."]
translator.fit(input_space)

# 2) Encode sequences to tensors
train_x = translator.encode_to_int(train_sequences)
train_y = torch.tensor(train_values)
err_y = torch.tensor(error_values) # these are sigma^2 values

# 3) Set up GP model and likelihood
model = GaussianStringKernelGP(train_x=train_x, train_y=train_y, 
        likelihood=FixedNoiseGaussianLikelihood(noise=err_y), 
        translator=translator, L=L) # L is the maximum seq length
model.num_outputs = 1
mll = ExactMarginalLogLikelihood(model.likelihood, model)

# 4) Train (very small illustrative loop)
model.train(); likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
for _ in range(50):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    optimizer.step()

# 5) Evaluate on new sequences
model.eval(); likelihood.eval()
test_x = translator.encode_to_int(["ACDG"])
with torch.no_grad():
    posterior = likelihood(model(test_x))
    mean = posterior.mean  # predictive mean
    lower, upper = posterior.confidence_region()

Installation

A Python package definition is not yet provided. Once packaging is added, the project can be installed with pip. In the meantime, you can work from a clone in editable mode:

pip install -e .

Ensure the required dependencies (e.g., torch, gpytorch, botorch, tqdm, matplotlib) are available in your environment.

Additional info and resources

The Generic String Kernel was developed by Giguere et al. in the following work:

@article{giguere2013learning,
  title={Learning a peptide-protein binding affinity predictor with kernel ridge regression},
  author={Giguere, S{\'e}bastien and Marchand, Mario and Laviolette, Fran{\c{c}}ois and Drouin, Alexandre and Corbeil, Jacques},
  journal={BMC bioinformatics},
  volume={14},
  number={1},
  pages={82},
  year={2013},
  publisher={Springer}
}

The backbone of this specific version was developed as part of an active learning campaign for designing Protein Catalyzed Capture agents, as decribed in the following reference:

@article{zadeh2025high,
  title={High-throughput virtual screening of protein-catalyzed capture agents for novel hydrogel-nanoparticle fentanyl sensors},
  author={Zadeh, Armin Shayesteh and Winton, Alexander J and Palomba, Joseph M and Ferguson, Andrew L},
  journal={The Journal of Physical Chemistry B},
  volume={129},
  number={40},
  pages={10568--10583},
  year={2025},
  publisher={ACS Publications}
}

About

Generic String Kernel for GPyTorch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages