-
Notifications
You must be signed in to change notification settings - Fork 0
Kernels
Kernel methods provide sophisticated similarity and distance measures for comparing metabolic hypergraphs. These methods capture different aspects of structural and functional similarity between organisms.
Kernel methods define similarity functions that can be used directly for analysis or converted to distance matrices for downstream applications. Each kernel captures different properties of metabolic networks, from simple overlap measures to complex structural similarities [1].
- Data Preprocessing: Load and preprocess metabolic pathway data for all organisms
- Kernel-Specific Processing: Apply method-specific transformations (histograms, stratification, etc...)
- Similarity Computation: Calculate pairwise similarities using the kernel function
- Distance Conversion: Convert similarities to distances (typically: distance = 1 - similarity)
- Matrix Storage: Save symmetric distance matrices and organism labels
The following kernel implementations are available:
-
HistogramKernel_refactor.py- Implements the histogram cosine kernel (HCK)
-
JaccardKernel_refactor.py- Implements the Jaccard similarity kernel (JK)
-
EditKernel_refactor.py- Implements the edit distance kernel (EK)
-
StratifiedEditKernel_refactor.py- Implements the stratified edit distance kernel (SEK)
cd EmbeddingsAndKernels/Kernels/
# execute the Jaccard kernel
python JaccardKernel_refactor.py
# execute the Histogram kernel
python HistogramKernel_refactor.py
# execute the Edit kernel
python EditKernel_refactor.py
# execute the Stratified edit kernel
python StratifiedEditKernel_refactor.py-
Dataset:
../../data/MetabolicPathways_DATASET_Python.pkl- Dataset containing metabolic pathway information for multiple organisms
-
Kernel Matrices:
../../data/distances/JACCARD_DISTANCE.pkl../../data/distances/HIST_DISTANCE.pkl../../data/distances/EDIT_DISTANCE.pkl../../data/distances/STRATEDIT_DISTANCE.pkl
-
Organism Labels:
../../data/distances/ORG_*_DISTANCE.pkl
Note: the output subfolders (distances), if not already present, will be created by the script.
-
*DISTANCE.pklfiles: Symmetric NumPy arrays containing pairwise distances between organism metabolic networks; -
ORG_*_DISTANCE.pklfiles: Files containing organism labels corresponding to the rows/columns in the distance matrices.
Measures similarity based on intersection over union of hyperedge sets:
where
Compares organisms based on normalized hyperedge frequency vectors:
where
Based on normalized Levenshtein distance between hyperedge sequences:
where
Computes edit distance separately for each hyperedge order (complexity level):
where
The following kernels:
- Jaccard
- Edit
- Stratified Edit
rely on parallel processing to speed up computations on large datasets. Each process is designed to read an hypergraph, and evaluate the kernel function with respect to all other hypergraphs independently. Since all kernel functions are also symmetric, only the upper triangular portion of the distance matrix is computed, reducing computational load.
The end user can specify the number of processes to use when running the scripts by changing the num_cores variable in each script. By default, it is set to utilize all available CPU cores. The scripts exploit plain multiprocessing to overcome Python's Global Interpreter Lock (GIL) limitations. However, this approach may lead to increased memory consumption as each process maintains its own copy of the data. An improvement might be to use shared memory to store the dataset, allowing all processes to access the same data without duplication.
The Histogram kernel is fully vectorised and does not benefit from parallel processing.
For the Edit and Stratified Edit kernels, it is strongly recommended to compile the core script levenshteinDistance.pyx using Cython. This will significantly speed up the computation of the Levenshtein distances. To compile the script, create a file named setup.py with the following content:
from distutils.core import setup
from Cython.Build import cythonize
setup(
ext_modules = cythonize("levenshteinDistance.pyx", annotate=True, compiler_directives={'language_level' : "3"})
)Then, run the following command in the terminal:
python setup.py build_ext --inplaceThis will compile the .pyx script into a C/C++ file and then compiles the C/C++ file into an extension module (a .so or .pyd file on macOS/Linux and Windows, respectively) that can be imported in Python.
If you do not compile the Cython code, just rename the file to levenshteinDistance.py and the scripts will still run using a pure Python implementation, but the computation will be significantly slower, especially for large datasets.
[1] Martino, A., & Rizzi, A. (2020). (Hyper)graph kernels over simplicial complexes. Entropy, 22(10), 1155. DOI: 10.3390/e22101155
- Autoencoders: Neural autoencoder approaches for hypergraph representation learning
- Bag of Nodes: Node-based embedding methods
- Bag of Hyperedges: Frequency-based embeddings
- Graph2Vec: Graph neural network embedding techniques
HypergraphEmbedding4MetabolicNetworks • Comparing the ability of embedding methods on metabolic hypergraphs for capturing taxonomy-based features
📖 bioRxiv Preprint • 💻 Source Code • Under review at Algorithms for Molecular Biology
© 2025 M. Cervellini, B. Sinaimeri, C. Matias, A. Martino • Licensed under GPL-3
Embedding Methods
Kernel Methods
Neural-based Embeddings
Getting Started