Skip to content

Kernels

Mattia Cervellini edited this page Nov 14, 2025 · 6 revisions

Kernel methods provide sophisticated similarity and distance measures for comparing metabolic hypergraphs. These methods capture different aspects of structural and functional similarity between organisms.

Overview

Kernel methods define similarity functions that can be used directly for analysis or converted to distance matrices for downstream applications. Each kernel captures different properties of metabolic networks, from simple overlap measures to complex structural similarities [1].

Method Description

  1. Data Preprocessing: Load and preprocess metabolic pathway data for all organisms
  2. Kernel-Specific Processing: Apply method-specific transformations (histograms, stratification, etc...)
  3. Similarity Computation: Calculate pairwise similarities using the kernel function
  4. Distance Conversion: Convert similarities to distances (typically: distance = 1 - similarity)
  5. Matrix Storage: Save symmetric distance matrices and organism labels

Usage

Available Scripts

The following kernel implementations are available:

  • HistogramKernel_refactor.py
    • Implements the histogram cosine kernel (HCK)
  • JaccardKernel_refactor.py
    • Implements the Jaccard similarity kernel (JK)
  • EditKernel_refactor.py
    • Implements the edit distance kernel (EK)
  • StratifiedEditKernel_refactor.py
    • Implements the stratified edit distance kernel (SEK)

How to Run

cd EmbeddingsAndKernels/Kernels/

# execute the Jaccard kernel
python JaccardKernel_refactor.py

# execute the Histogram kernel  
python HistogramKernel_refactor.py

# execute the Edit kernel
python EditKernel_refactor.py

# execute the Stratified edit kernel
python StratifiedEditKernel_refactor.py

Input

  • Dataset: ../../data/MetabolicPathways_DATASET_Python.pkl
    • Dataset containing metabolic pathway information for multiple organisms

Output

  • Kernel Matrices:
    • ../../data/distances/JACCARD_DISTANCE.pkl
    • ../../data/distances/HIST_DISTANCE.pkl
    • ../../data/distances/EDIT_DISTANCE.pkl
    • ../../data/distances/STRATEDIT_DISTANCE.pkl
  • Organism Labels: ../../data/distances/ORG_*_DISTANCE.pkl

Note: the output subfolders (distances), if not already present, will be created by the script.

Details

Description of the output files:

  • *DISTANCE.pkl files: Symmetric NumPy arrays containing pairwise distances between organism metabolic networks;
  • ORG_*_DISTANCE.pkl files: Files containing organism labels corresponding to the rows/columns in the distance matrices.

Kernels

Jaccard Kernel

Measures similarity based on intersection over union of hyperedge sets:

$$J(G_i, G_j) = \frac{\lvert H_i \cap H_j \rvert}{\lvert H_i \cup H_j \rvert}$$

where $H_i$ and $H_j$ are the sets of hyperedges for organisms $i$ and $j$.

Histogram Kernel

Compares organisms based on normalized hyperedge frequency vectors:

$$K_{hist}(G_i, G_j) = \frac{\mathbf{h_i} \cdot \mathbf{h_j}}{\lVert \mathbf{h_i} \rVert \cdot \lVert \mathbf{h_j} \rVert}$$

where $\mathbf{h_i}$ and $\mathbf{h_j}$ are the normalized frequency histograms of hyperedges.

Edit Kernel

Based on normalized Levenshtein distance between hyperedge sequences:

$$K_{edit}(G_i, G_j) = 1 - \frac{2 \cdot d_{Lev}(S_i, S_j)}{\lvert S_i \rvert + \lvert S_j \rvert + d_{Lev}(S_i, S_j)}$$

where $d_{Lev}(S_i, S_j)$ is the Levenshtein distance between sequences $S_i$ and $S_j$, and $|S_i|$ denotes sequence length.

Stratified Edit Kernel

Computes edit distance separately for each hyperedge order (complexity level):

$$K_{strat}(G_i, G_j) = \frac{1}{\lvert O \rvert} \sum_{o \in O} \left(1 - \frac{2 \cdot d_{Lev}(S_i^o, S_j^o)}{\lvert S_i^o \rvert + \lvert S_j^o \rvert + d_{Lev}(S_i^o, S_j^o)}\right)$$

where $O$ is the set of hyperedge orders, and $S_i^o$ represents the sequence of hyperedges of order $o$ for organism $i$.

On Parallel Processing

The following kernels:

  • Jaccard
  • Edit
  • Stratified Edit

rely on parallel processing to speed up computations on large datasets. Each process is designed to read an hypergraph, and evaluate the kernel function with respect to all other hypergraphs independently. Since all kernel functions are also symmetric, only the upper triangular portion of the distance matrix is computed, reducing computational load.

The end user can specify the number of processes to use when running the scripts by changing the num_cores variable in each script. By default, it is set to utilize all available CPU cores. The scripts exploit plain multiprocessing to overcome Python's Global Interpreter Lock (GIL) limitations. However, this approach may lead to increased memory consumption as each process maintains its own copy of the data. An improvement might be to use shared memory to store the dataset, allowing all processes to access the same data without duplication.

The Histogram kernel is fully vectorised and does not benefit from parallel processing.

On the Levenshtein Distance Computation

For the Edit and Stratified Edit kernels, it is strongly recommended to compile the core script levenshteinDistance.pyx using Cython. This will significantly speed up the computation of the Levenshtein distances. To compile the script, create a file named setup.py with the following content:

from distutils.core import setup
from Cython.Build import cythonize
setup(
    ext_modules = cythonize("levenshteinDistance.pyx", annotate=True, compiler_directives={'language_level' : "3"})
)

Then, run the following command in the terminal:

python setup.py build_ext --inplace

This will compile the .pyx script into a C/C++ file and then compiles the C/C++ file into an extension module (a .so or .pyd file on macOS/Linux and Windows, respectively) that can be imported in Python.

If you do not compile the Cython code, just rename the file to levenshteinDistance.py and the scripts will still run using a pure Python implementation, but the computation will be significantly slower, especially for large datasets.

References

[1] Martino, A., & Rizzi, A. (2020). (Hyper)graph kernels over simplicial complexes. Entropy, 22(10), 1155. DOI: 10.3390/e22101155

Jump to

Navigation

Home

Embedding Methods

Kernel Methods

Neural-based Embeddings


Quick Links

Getting Started


External Resources

Clone this wiki locally