Generic String Kernel implementation in GPyTorch

Overview

This repository implements a Gaussian process model for biological sequences based on a Generic String Kernel (GSK). The core pieces are:

gskernel: a custom kernel that scores similarity between sequences using amino acid properties.
gskgpr: a Gaussian process regression model that wraps the kernel for training and prediction.
seq2ascii: a helper that encodes sequences into tensors the kernel can consume.

The main goal of developing this version is to integrate GSK into BoTorch active learning pipelines.

Submodules

`gskernel`

The GenericStringKernel class extends gpytorch.kernels.Kernel to compute a normalized string kernel over amino acid sequences. Given a translator from seq2ascii, it:

Pre-computes a BLOSUM-derived similarity matrix and pairwise energy terms.
Computes subsequence alignment scores up to a maximum length L with learnable length scale parameters sigma1 and sigma2.
Builds full Gram matrices (including diagonal and cross-covariances) that plug directly into GPyTorch models.

`gskgpr`

GaussianStringKernelGP is a minimal GPyTorch exact GP model that pairs the string kernel with a constant mean and an observation likelihood. It is designed to operate on encoded sequences and can be trained with standard GPyTorch routines (e.g., using an ExactMarginalLogLikelihood).

`seq2ascii`

Seq2Ascii loads a pickled amino acid property dictionary (e.g., BLOSUM62) and provides:

Encoding of raw sequence strings to integer tensors (encode, encode_list).
Translation of encoded indices into property vectors (translate_to_ord), which the kernel uses internally.
Optional utilities for decoding indices back to strings and generating property matrices (get_psi).

How the pieces fit together

Prepare a translator with Seq2Ascii, pointing it to a pickled amino acid property dictionary and fitting it on your training sequences.
Encode sequences to integer tensors and translate them to property representations on the fly inside the kernel.
Initialize a GaussianStringKernelGP with the translator so the kernel can transform inputs and compute the sequence-aware covariance.
Train the GP with your preferred optimizer and use the model to make posterior predictions on new sequences.

Minimal usage example

There is an end-to-end example available under examples/example_gpt.ipynb. In short, you need a setup like this

import torch
import gpytorch
from gskernel.seq2ascii import Seq2Ascii
from gskernel.gskgpr import GaussianStringKernelGP

# 1) Build a translator from a pickled amino acid property dictionary
translator = Seq2Ascii("AA_property_mat.pkl")
input_space = ["Include all your input sequences. EVERYTHING from your chemical space."]
translator.fit(input_space)

# 2) Encode sequences to tensors
train_x = translator.encode_to_int(train_sequences)
train_y = torch.tensor(train_values)
err_y = torch.tensor(error_values) # these are sigma^2 values

# 3) Set up GP model and likelihood
model = GaussianStringKernelGP(train_x=train_x, train_y=train_y, 
        likelihood=FixedNoiseGaussianLikelihood(noise=err_y), 
        translator=translator, L=L) # L is the maximum seq length
model.num_outputs = 1
mll = ExactMarginalLogLikelihood(model.likelihood, model)

# 4) Train (very small illustrative loop)
model.train(); likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
for _ in range(50):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    optimizer.step()

# 5) Evaluate on new sequences
model.eval(); likelihood.eval()
test_x = translator.encode_to_int(["ACDG"])
with torch.no_grad():
    posterior = likelihood(model(test_x))
    mean = posterior.mean  # predictive mean
    lower, upper = posterior.confidence_region()

Installation

A Python package definition is not yet provided. Once packaging is added, the project can be installed with pip. In the meantime, you can work from a clone in editable mode:

pip install -e .

Ensure the required dependencies (e.g., torch, gpytorch, botorch, tqdm, matplotlib) are available in your environment.

Additional info and resources

The Generic String Kernel was developed by Giguere et al. in the following work:

@article{giguere2013learning,
  title={Learning a peptide-protein binding affinity predictor with kernel ridge regression},
  author={Giguere, S{\'e}bastien and Marchand, Mario and Laviolette, Fran{\c{c}}ois and Drouin, Alexandre and Corbeil, Jacques},
  journal={BMC bioinformatics},
  volume={14},
  number={1},
  pages={82},
  year={2013},
  publisher={Springer}
}

The backbone of this specific version was developed as part of an active learning campaign for designing Protein Catalyzed Capture agents, as decribed in the following reference:

@article{zadeh2025high,
  title={High-throughput virtual screening of protein-catalyzed capture agents for novel hydrogel-nanoparticle fentanyl sensors},
  author={Zadeh, Armin Shayesteh and Winton, Alexander J and Palomba, Joseph M and Ferguson, Andrew L},
  journal={The Journal of Physical Chemistry B},
  volume={129},
  number={40},
  pages={10568--10583},
  year={2025},
  publisher={ACS Publications}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
examples		examples
src/gskernel		src/gskernel
tests		tests
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generic String Kernel implementation in GPyTorch

Overview

Submodules

`gskernel`

`gskgpr`

`seq2ascii`

How the pieces fit together

Minimal usage example

Installation

Additional info and resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Generic String Kernel implementation in GPyTorch

Overview

Submodules

gskernel

gskgpr

seq2ascii

How the pieces fit together

Minimal usage example

Installation

Additional info and resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`gskernel`

`gskgpr`

`seq2ascii`

Packages