A repository for "RECoDe - Relation Extraction for Diet, Non-Communicable Disease and Biomarker Associations: A CoDiet study" https://www.biorxiv.org/content/10.64898/2026.03.03.709244v1
# Create and activate conda environment
conda create -n recode python=3.11 -y
conda activate recode
# Install dependencies and the project
pip install .We support OpenAI-compatible clients to run our pipeline. You can use OpenAI APIs or your own local models via a server (e.g., gpt-oss-20b). For example, see: vllm article.
Run relation extraction on a dataset split:
python scripts/exp/run.py \
--data_path ./data/annotation \
--split test \
--base_url http://localhost:8010/v1 \
--model_name openai/gpt-oss-20b \
--output_dir ./resultsThis produces two TSV files in --output_dir:
{split}_{model}_result.tsv— predictions only (typecolumn), used for evaluation{split}_{model}_full.tsv— gold + predictions for reference
Evaluate predictions against a gold JSONL file:
python scripts/exp/evaluate.py \
--gold ./data/annotation/test.jsonl \
--pred ./results/test_model_result.tsv
# Or evaluate all TSV files in a directory:
python scripts/exp/evaluate.py \
--gold ./data/annotation/test.jsonl \
--pred_dir ./results/This computes:
- Multiclass: accuracy, micro/macro/weighted precision/recall/F1, confusion matrix
- Binary (association vs NoAssociation): accuracy, binary/micro/macro/weighted metrics
The unified pipeline script scripts/pipeline.py runs the full workflow:
Extract relation candidate pairs from BioC JSON files (with NER annotations).
python scripts/pipeline.py candidate \
--input_dir ./data/extraction/input \
--output ./output/candidates.csvFilter candidate pairs by entity type combinations.
python scripts/pipeline.py filter \
--input ./output/candidates.csv \
--output ./output/filtered.csv \
--entity_type_filters defaultAvailable filters (comma-separated):
| Filter | Entity pairs |
|---|---|
food_disease |
foodRelated → diseasePhenotype |
food_bio |
foodRelated → geneSNP/proteinEnzyme/metabolite/microbiome |
disease_bio |
diseasePhenotype → geneSNP/proteinEnzyme/metabolite/microbiome |
food_food |
foodRelated → foodRelated |
bio_cross |
bio type → different bio type |
bio_self |
bio type → same bio type |
default |
all of the above |
Predict relation types for each candidate pair using an OpenAI-compatible LLM API.
python scripts/pipeline.py inference \
--input ./output/filtered.csv \
--output ./output/inference.csv \
--base_url http://localhost:8010/v1 \
--model_name openai/gpt-oss-20b
# For testing without an LLM (random predictions):
python scripts/pipeline.py inference \
--input ./output/filtered.csv \
--output ./output/inference.csv \
--dummyBuild the Corpus-level Concept Summary (CoCoS) knowledge graph from inference results.
This step includes:
Within each document (intra-doc):
- Abbreviation expansion
- UK/US English normalization
- Entity normalization (cluster mentions sharing annotation IDs, pick representative text by frequency)
- Self-relation removal (token overlap)
- Hierarchical voting per entity pair → one relation label per (document, e1, e2)
Across documents (inter-doc):
- Aggregate document-level relations per entity pair
- Compute two scores per pair:
- Association Support (AS):
AS = N_assoc / (N_assoc + N_no)— evidence for an association in general - Effect Estimate (EE):
EE = (N_pos - N_neg) / N_assoc— direction of the association (+1 = direct, -1 = inverse)
- Association Support (AS):
- Build knowledge graph (nodes with entity type/color/doc_cnt, edges with as_score/ee_score/doc_count)
python scripts/pipeline.py cocos \
--input ./output/inference.csv \
--input_dir ./data/extraction/input \
--output_dir ./output/cocos \
--eng_us_path ./data/extraction/resources/eng_us_uk.txtOutput files:
recode_cocos.graphml— NetworkX graph (nodes: entity type, color, doc_cnt; edges: as_score, ee_score, counts)recode_cocos.csv— aggregated pair scores (as_score, ee_score, doc_count, relation type counts)processed_relations.csv— all relations after normalization
python scripts/pipeline.py all \
--input_dir ./data/extraction/input \
--output_dir ./output \
--base_url http://localhost:8010/v1 \
--model_name openai/gpt-oss-20b \
--eng_us_path ./data/extraction/resources/eng_us_uk.txt \
--entity_type_filters default
# Full pipeline with dummy inference (for testing):
python scripts/pipeline.py all \
--input_dir ./data/extraction/input \
--output_dir ./output \
--entity_type_filters food_disease \
--dummy-
Prepare your data in BioC JSON format. See this input example.
- NER annotations must be included. See the CoDiet Corpus paper.
- Example: the CoDiet Electrum Corpus
-
Generate candidate relation pairs:
python scripts/pipeline.py candidate \
--input_dir ./data/extraction/input \
--output ./output/candidates.csv- Example output: example_PMC7271748.tsv
- Filter, run inference, and build CoCoS (see above).