This repository implements a Gaussian process model for biological sequences based on a Generic String Kernel (GSK). The core pieces are:
gskernel: a custom kernel that scores similarity between sequences using amino acid properties.gskgpr: a Gaussian process regression model that wraps the kernel for training and prediction.seq2ascii: a helper that encodes sequences into tensors the kernel can consume.
The main goal of developing this version is to integrate GSK into BoTorch active learning pipelines.
The GenericStringKernel class extends gpytorch.kernels.Kernel to compute a normalized string kernel over amino acid sequences. Given a translator from seq2ascii, it:
- Pre-computes a BLOSUM-derived similarity matrix and pairwise energy terms.
- Computes subsequence alignment scores up to a maximum length
Lwith learnable length scale parameterssigma1andsigma2. - Builds full Gram matrices (including diagonal and cross-covariances) that plug directly into GPyTorch models.
GaussianStringKernelGP is a minimal GPyTorch exact GP model that pairs the string kernel with a constant mean and an observation likelihood. It is designed to operate on encoded sequences and can be trained with standard GPyTorch routines (e.g., using an ExactMarginalLogLikelihood).
Seq2Ascii loads a pickled amino acid property dictionary (e.g., BLOSUM62) and provides:
- Encoding of raw sequence strings to integer tensors (
encode,encode_list). - Translation of encoded indices into property vectors (
translate_to_ord), which the kernel uses internally. - Optional utilities for decoding indices back to strings and generating property matrices (
get_psi).
- Prepare a translator with
Seq2Ascii, pointing it to a pickled amino acid property dictionary and fitting it on your training sequences. - Encode sequences to integer tensors and translate them to property representations on the fly inside the kernel.
- Initialize a
GaussianStringKernelGPwith the translator so the kernel can transform inputs and compute the sequence-aware covariance. - Train the GP with your preferred optimizer and use the model to make posterior predictions on new sequences.
There is an end-to-end example available under examples/example_gpt.ipynb. In short, you need a setup like this
import torch
import gpytorch
from gskernel.seq2ascii import Seq2Ascii
from gskernel.gskgpr import GaussianStringKernelGP
# 1) Build a translator from a pickled amino acid property dictionary
translator = Seq2Ascii("AA_property_mat.pkl")
input_space = ["Include all your input sequences. EVERYTHING from your chemical space."]
translator.fit(input_space)
# 2) Encode sequences to tensors
train_x = translator.encode_to_int(train_sequences)
train_y = torch.tensor(train_values)
err_y = torch.tensor(error_values) # these are sigma^2 values
# 3) Set up GP model and likelihood
model = GaussianStringKernelGP(train_x=train_x, train_y=train_y,
likelihood=FixedNoiseGaussianLikelihood(noise=err_y),
translator=translator, L=L) # L is the maximum seq length
model.num_outputs = 1
mll = ExactMarginalLogLikelihood(model.likelihood, model)
# 4) Train (very small illustrative loop)
model.train(); likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
for _ in range(50):
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
optimizer.step()
# 5) Evaluate on new sequences
model.eval(); likelihood.eval()
test_x = translator.encode_to_int(["ACDG"])
with torch.no_grad():
posterior = likelihood(model(test_x))
mean = posterior.mean # predictive mean
lower, upper = posterior.confidence_region()A Python package definition is not yet provided. Once packaging is added, the project can be installed with pip. In the meantime, you can work from a clone in editable mode:
pip install -e .Ensure the required dependencies (e.g., torch, gpytorch, botorch, tqdm, matplotlib) are available in your environment.
The Generic String Kernel was developed by Giguere et al. in the following work:
@article{giguere2013learning,
title={Learning a peptide-protein binding affinity predictor with kernel ridge regression},
author={Giguere, S{\'e}bastien and Marchand, Mario and Laviolette, Fran{\c{c}}ois and Drouin, Alexandre and Corbeil, Jacques},
journal={BMC bioinformatics},
volume={14},
number={1},
pages={82},
year={2013},
publisher={Springer}
}The backbone of this specific version was developed as part of an active learning campaign for designing Protein Catalyzed Capture agents, as decribed in the following reference:
@article{zadeh2025high,
title={High-throughput virtual screening of protein-catalyzed capture agents for novel hydrogel-nanoparticle fentanyl sensors},
author={Zadeh, Armin Shayesteh and Winton, Alexander J and Palomba, Joseph M and Ferguson, Andrew L},
journal={The Journal of Physical Chemistry B},
volume={129},
number={40},
pages={10568--10583},
year={2025},
publisher={ACS Publications}
}