This preprocessing pipeline collects Biolink schema predicate for the predicate mapping service.
It generates cleaned predicate mappings, negations, and embeddings for eventual similarity-based predicate matching.
collect_predicate_text.py [-m mappings_file -q qualified_mappings] 2. Generate Negations: Using an LLM to produce natural negated versions of each descriptor and saves results to negations_file..
get_negations.py [-m mappings_file -n negations_file] 3. Merge and Clean Mappings: Removes invalid responses or empty strings, merges mapping and negation files then outputs all_mappings_file.
clean_mappings.py [-m mappings_file -n negations_file -a all_mappings_file]4. Embed Predicates for Similarity Search: Generates embeddings for all Biolink predicates with the default embedding dimension: 768.
embed_biolink_mappings.py [-m mappings_file -e embeddings_file --lowercase]Before running the pipeline, configure the environment:
export LLM_API_URL=http://localhost:11434/api/generate
export CHAT_MODEL=alibayram/medgemma:latest
export MODEL_TEMPERATURE=0.5
export EMBEDDING_URL=http://localhost:11434/api/embeddings
export EMBEDDING_MODEL=nomic-embed-text
export USE_LOCAL=true
After preprocessing, the directory structure should look like this:
data/
├── short_description.json # Some predicate descriptions
├── all_biolink_mapped_vectors.json # Embeddings for predicates
├── qualified_predicate_mappings.json # Qualifier mappings
- Embedding Dimension: Default is 768 (nomic-embed-text)
- Batch Size: Embeddings are processed in batches of 25 relationships
- Version Compatibility: Ensure mapping and negation files are generated with the same ontology version
- Local LLM: Uses Ollama by default for local processing
- Cost Optimization: Negations are generated once and reused to reduce LLM API calls