Kernels

Kernel methods provide sophisticated similarity and distance measures for comparing metabolic hypergraphs. These methods capture different aspects of structural and functional similarity between organisms.

Overview

Kernel methods define similarity functions that can be used directly for analysis or converted to distance matrices for downstream applications. Each kernel captures different properties of metabolic networks, from simple overlap measures to complex structural similarities [1].

Method Description

Data Preprocessing: Load and preprocess metabolic pathway data for all organisms
Kernel-Specific Processing: Apply method-specific transformations (histograms, stratification, etc...)
Similarity Computation: Calculate pairwise similarities using the kernel function
Distance Conversion: Convert similarities to distances (typically: distance = 1 - similarity)
Matrix Storage: Save symmetric distance matrices and organism labels

Usage

Available Scripts

The following kernel implementations are available:

HistogramKernel_refactor.py
- Implements the histogram cosine kernel (HCK)
JaccardKernel_refactor.py
- Implements the Jaccard similarity kernel (JK)
EditKernel_refactor.py
- Implements the edit distance kernel (EK)
StratifiedEditKernel_refactor.py
- Implements the stratified edit distance kernel (SEK)

How to Run

cd EmbeddingsAndKernels/Kernels/

# execute the Jaccard kernel
python JaccardKernel_refactor.py

# execute the Histogram kernel  
python HistogramKernel_refactor.py

# execute the Edit kernel
python EditKernel_refactor.py

# execute the Stratified edit kernel
python StratifiedEditKernel_refactor.py

Input

Dataset: ../../data/MetabolicPathways_DATASET_Python.pkl
- Dataset containing metabolic pathway information for multiple organisms

Output

Kernel Matrices:
- ../../data/distances/JACCARD_DISTANCE.pkl
- ../../data/distances/HIST_DISTANCE.pkl
- ../../data/distances/EDIT_DISTANCE.pkl
- ../../data/distances/STRATEDIT_DISTANCE.pkl
Organism Labels: ../../data/distances/ORG_*_DISTANCE.pkl

Note: the output subfolders (distances), if not already present, will be created by the script.

Details

Description of the output files:

*DISTANCE.pkl files: Symmetric NumPy arrays containing pairwise distances between organism metabolic networks;
ORG_*_DISTANCE.pkl files: Files containing organism labels corresponding to the rows/columns in the distance matrices.

Kernels

Jaccard Kernel

Measures similarity based on intersection over union of hyperedge sets:

$$J(G_i, G_j) = \frac{\lvert H_i \cap H_j \rvert}{\lvert H_i \cup H_j \rvert}$$

where $H_i$ and $H_j$ are the sets of hyperedges for organisms $i$ and $j$.

Histogram Kernel

Compares organisms based on normalized hyperedge frequency vectors:

$$K_{hist}(G_i, G_j) = \frac{\mathbf{h_i} \cdot \mathbf{h_j}}{\lVert \mathbf{h_i} \rVert \cdot \lVert \mathbf{h_j} \rVert}$$

where $\mathbf{h_i}$ and $\mathbf{h_j}$ are the normalized frequency histograms of hyperedges.

Edit Kernel

Based on normalized Levenshtein distance between hyperedge sequences:

$$K_{edit}(G_i, G_j) = 1 - \frac{2 \cdot d_{Lev}(S_i, S_j)}{\lvert S_i \rvert + \lvert S_j \rvert + d_{Lev}(S_i, S_j)}$$

where $d_{Lev}(S_i, S_j)$ is the Levenshtein distance between sequences $S_i$ and $S_j$, and $|S_i|$ denotes sequence length.

Stratified Edit Kernel

Computes edit distance separately for each hyperedge order (complexity level):

$$K_{strat}(G_i, G_j) = \frac{1}{\lvert O \rvert} \sum_{o \in O} \left(1 - \frac{2 \cdot d_{Lev}(S_i^o, S_j^o)}{\lvert S_i^o \rvert + \lvert S_j^o \rvert + d_{Lev}(S_i^o, S_j^o)}\right)$$

where $O$ is the set of hyperedge orders, and $S_i^o$ represents the sequence of hyperedges of order $o$ for organism $i$.

On Parallel Processing

The following kernels:

Jaccard
Edit
Stratified Edit

rely on parallel processing to speed up computations on large datasets. Each process is designed to read an hypergraph, and evaluate the kernel function with respect to all other hypergraphs independently. Since all kernel functions are also symmetric, only the upper triangular portion of the distance matrix is computed, reducing computational load.

The end user can specify the number of processes to use when running the scripts by changing the num_cores variable in each script. By default, it is set to utilize all available CPU cores. The scripts exploit plain multiprocessing to overcome Python's Global Interpreter Lock (GIL) limitations. However, this approach may lead to increased memory consumption as each process maintains its own copy of the data. An improvement might be to use shared memory to store the dataset, allowing all processes to access the same data without duplication.

The Histogram kernel is fully vectorised and does not benefit from parallel processing.

On the Levenshtein Distance Computation

For the Edit and Stratified Edit kernels, it is strongly recommended to compile the core script levenshteinDistance.pyx using Cython. This will significantly speed up the computation of the Levenshtein distances. To compile the script, create a file named setup.py with the following content:

from distutils.core import setup
from Cython.Build import cythonize
setup(
    ext_modules = cythonize("levenshteinDistance.pyx", annotate=True, compiler_directives={'language_level' : "3"})
)

Then, run the following command in the terminal:

python setup.py build_ext --inplace

This will compile the .pyx script into a C/C++ file and then compiles the C/C++ file into an extension module (a .so or .pyd file on macOS/Linux and Windows, respectively) that can be imported in Python.

If you do not compile the Cython code, just rename the file to levenshteinDistance.py and the scripts will still run using a pure Python implementation, but the computation will be significantly slower, especially for large datasets.

References

[1] Martino, A., & Rizzi, A. (2020). (Hyper)graph kernels over simplicial complexes. Entropy, 22(10), 1155. DOI: 10.3390/e22101155

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernels

Overview

Method Description

Usage

Available Scripts

How to Run

Input

Output

Details

Description of the output files:

Kernels

Jaccard Kernel

Histogram Kernel

Edit Kernel

Stratified Edit Kernel

On Parallel Processing

On the Levenshtein Distance Computation

References

Jump to

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Navigation

Quick Links

External Resources

Clone this wiki locally