GitHub - omicsNLP/RECoDe

A repository for "RECoDe - Relation Extraction for Diet, Non-Communicable Disease and Biomarker Associations: A CoDiet study" https://www.biorxiv.org/content/10.64898/2026.03.03.709244v1

Prerequisite

# Create and activate conda environment
conda create -n recode python=3.11 -y
conda activate recode

# Install dependencies and the project
pip install .

Running the RE Evaluation with LLMs

We support OpenAI-compatible clients to run our pipeline. You can use OpenAI APIs or your own local models via a server (e.g., gpt-oss-20b). For example, see: vllm article.

Inference

Run relation extraction on a dataset split:

python scripts/exp/run.py \
    --data_path ./data/annotation \
    --split test \
    --base_url http://localhost:8010/v1 \
    --model_name openai/gpt-oss-20b \
    --output_dir ./results

This produces two TSV files in --output_dir:

{split}_{model}_result.tsv — predictions only (type column), used for evaluation
{split}_{model}_full.tsv — gold + predictions for reference

Evaluation

Evaluate predictions against a gold JSONL file:

python scripts/exp/evaluate.py \
    --gold ./data/annotation/test.jsonl \
    --pred ./results/test_model_result.tsv

# Or evaluate all TSV files in a directory:
python scripts/exp/evaluate.py \
    --gold ./data/annotation/test.jsonl \
    --pred_dir ./results/

This computes:

Multiclass: accuracy, micro/macro/weighted precision/recall/F1, confusion matrix
Binary (association vs NoAssociation): accuracy, binary/micro/macro/weighted metrics

Full Pipeline (Candidate Generation → Filter → Inference → CoCoS)

The unified pipeline script scripts/pipeline.py runs the full workflow:

1. Generate Candidates

Extract relation candidate pairs from BioC JSON files (with NER annotations).

python scripts/pipeline.py candidate \
    --input_dir ./data/extraction/input \
    --output ./output/candidates.csv

2. Filter Candidates

Filter candidate pairs by entity type combinations.

python scripts/pipeline.py filter \
    --input ./output/candidates.csv \
    --output ./output/filtered.csv \
    --entity_type_filters default

Available filters (comma-separated):

Filter	Entity pairs
`food_disease`	foodRelated → diseasePhenotype
`food_bio`	foodRelated → geneSNP/proteinEnzyme/metabolite/microbiome
`disease_bio`	diseasePhenotype → geneSNP/proteinEnzyme/metabolite/microbiome
`food_food`	foodRelated → foodRelated
`bio_cross`	bio type → different bio type
`bio_self`	bio type → same bio type
`default`	all of the above

3. Run Inference

Predict relation types for each candidate pair using an OpenAI-compatible LLM API.

python scripts/pipeline.py inference \
    --input ./output/filtered.csv \
    --output ./output/inference.csv \
    --base_url http://localhost:8010/v1 \
    --model_name openai/gpt-oss-20b

# For testing without an LLM (random predictions):
python scripts/pipeline.py inference \
    --input ./output/filtered.csv \
    --output ./output/inference.csv \
    --dummy

4. Build CoCoS

Build the Corpus-level Concept Summary (CoCoS) knowledge graph from inference results.

This step includes:

Within each document (intra-doc):

Abbreviation expansion
UK/US English normalization
Entity normalization (cluster mentions sharing annotation IDs, pick representative text by frequency)
Self-relation removal (token overlap)
Hierarchical voting per entity pair → one relation label per (document, e1, e2)

Across documents (inter-doc):

Aggregate document-level relations per entity pair
Compute two scores per pair:
- Association Support (AS): AS = N_assoc / (N_assoc + N_no) — evidence for an association in general
- Effect Estimate (EE): EE = (N_pos - N_neg) / N_assoc — direction of the association (+1 = direct, -1 = inverse)
Build knowledge graph (nodes with entity type/color/doc_cnt, edges with as_score/ee_score/doc_count)

python scripts/pipeline.py cocos \
    --input ./output/inference.csv \
    --input_dir ./data/extraction/input \
    --output_dir ./output/cocos \
    --eng_us_path ./data/extraction/resources/eng_us_uk.txt

Output files:

recode_cocos.graphml — NetworkX graph (nodes: entity type, color, doc_cnt; edges: as_score, ee_score, counts)
recode_cocos.csv — aggregated pair scores (as_score, ee_score, doc_count, relation type counts)
processed_relations.csv — all relations after normalization

Run All Steps at Once

python scripts/pipeline.py all \
    --input_dir ./data/extraction/input \
    --output_dir ./output \
    --base_url http://localhost:8010/v1 \
    --model_name openai/gpt-oss-20b \
    --eng_us_path ./data/extraction/resources/eng_us_uk.txt \
    --entity_type_filters default

# Full pipeline with dummy inference (for testing):
python scripts/pipeline.py all \
    --input_dir ./data/extraction/input \
    --output_dir ./output \
    --entity_type_filters food_disease \
    --dummy

Creating Your Own Dataset

Prepare your data in BioC JSON format. See this input example.
- NER annotations must be included. See the CoDiet Corpus paper.
- Example: the CoDiet Electrum Corpus
Generate candidate relation pairs:

python scripts/pipeline.py candidate \
    --input_dir ./data/extraction/input \
    --output ./output/candidates.csv

Example output: example_PMC7271748.tsv

Filter, run inference, and build CoCoS (see above).

👥 Contributors

👉 Donghee 👉 Yajie 👉 Joram

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scripts		scripts
src/recode		src/recode
.gitignore		.gitignore
README.md		README.md
RECoDe_CoDiet_Logo.png		RECoDe_CoDiet_Logo.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prerequisite

Running the RE Evaluation with LLMs

Inference

Evaluation

Full Pipeline (Candidate Generation → Filter → Inference → CoCoS)

1. Generate Candidates

2. Filter Candidates

3. Run Inference

4. Build CoCoS

Run All Steps at Once

Creating Your Own Dataset

👥 Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prerequisite

Running the RE Evaluation with LLMs

Inference

Evaluation

Full Pipeline (Candidate Generation → Filter → Inference → CoCoS)

1. Generate Candidates

2. Filter Candidates

3. Run Inference

4. Build CoCoS

Run All Steps at Once

Creating Your Own Dataset

👥 Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages