Skip to content

omicsNLP/RECoDe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CoDiet DOI:10.64898/2026.03.03.709244 DOI:10.5281/zenodo.19050553 Codabench

alt text A repository for "RECoDe - Relation Extraction for Diet, Non-Communicable Disease and Biomarker Associations: A CoDiet study" https://www.biorxiv.org/content/10.64898/2026.03.03.709244v1

Prerequisite

# Create and activate conda environment
conda create -n recode python=3.11 -y
conda activate recode

# Install dependencies and the project
pip install .

Running the RE Evaluation with LLMs

We support OpenAI-compatible clients to run our pipeline. You can use OpenAI APIs or your own local models via a server (e.g., gpt-oss-20b). For example, see: vllm article.

Inference

Run relation extraction on a dataset split:

python scripts/exp/run.py \
    --data_path ./data/annotation \
    --split test \
    --base_url http://localhost:8010/v1 \
    --model_name openai/gpt-oss-20b \
    --output_dir ./results

This produces two TSV files in --output_dir:

  • {split}_{model}_result.tsv — predictions only (type column), used for evaluation
  • {split}_{model}_full.tsv — gold + predictions for reference

Evaluation

Evaluate predictions against a gold JSONL file:

python scripts/exp/evaluate.py \
    --gold ./data/annotation/test.jsonl \
    --pred ./results/test_model_result.tsv

# Or evaluate all TSV files in a directory:
python scripts/exp/evaluate.py \
    --gold ./data/annotation/test.jsonl \
    --pred_dir ./results/

This computes:

  • Multiclass: accuracy, micro/macro/weighted precision/recall/F1, confusion matrix
  • Binary (association vs NoAssociation): accuracy, binary/micro/macro/weighted metrics

Full Pipeline (Candidate Generation → Filter → Inference → CoCoS)

The unified pipeline script scripts/pipeline.py runs the full workflow:

1. Generate Candidates

Extract relation candidate pairs from BioC JSON files (with NER annotations).

python scripts/pipeline.py candidate \
    --input_dir ./data/extraction/input \
    --output ./output/candidates.csv

2. Filter Candidates

Filter candidate pairs by entity type combinations.

python scripts/pipeline.py filter \
    --input ./output/candidates.csv \
    --output ./output/filtered.csv \
    --entity_type_filters default

Available filters (comma-separated):

Filter Entity pairs
food_disease foodRelated → diseasePhenotype
food_bio foodRelated → geneSNP/proteinEnzyme/metabolite/microbiome
disease_bio diseasePhenotype → geneSNP/proteinEnzyme/metabolite/microbiome
food_food foodRelated → foodRelated
bio_cross bio type → different bio type
bio_self bio type → same bio type
default all of the above

3. Run Inference

Predict relation types for each candidate pair using an OpenAI-compatible LLM API.

python scripts/pipeline.py inference \
    --input ./output/filtered.csv \
    --output ./output/inference.csv \
    --base_url http://localhost:8010/v1 \
    --model_name openai/gpt-oss-20b

# For testing without an LLM (random predictions):
python scripts/pipeline.py inference \
    --input ./output/filtered.csv \
    --output ./output/inference.csv \
    --dummy

4. Build CoCoS

Build the Corpus-level Concept Summary (CoCoS) knowledge graph from inference results.

This step includes:

Within each document (intra-doc):

  • Abbreviation expansion
  • UK/US English normalization
  • Entity normalization (cluster mentions sharing annotation IDs, pick representative text by frequency)
  • Self-relation removal (token overlap)
  • Hierarchical voting per entity pair → one relation label per (document, e1, e2)

Across documents (inter-doc):

  • Aggregate document-level relations per entity pair
  • Compute two scores per pair:
    • Association Support (AS): AS = N_assoc / (N_assoc + N_no) — evidence for an association in general
    • Effect Estimate (EE): EE = (N_pos - N_neg) / N_assoc — direction of the association (+1 = direct, -1 = inverse)
  • Build knowledge graph (nodes with entity type/color/doc_cnt, edges with as_score/ee_score/doc_count)
python scripts/pipeline.py cocos \
    --input ./output/inference.csv \
    --input_dir ./data/extraction/input \
    --output_dir ./output/cocos \
    --eng_us_path ./data/extraction/resources/eng_us_uk.txt

Output files:

  • recode_cocos.graphml — NetworkX graph (nodes: entity type, color, doc_cnt; edges: as_score, ee_score, counts)
  • recode_cocos.csv — aggregated pair scores (as_score, ee_score, doc_count, relation type counts)
  • processed_relations.csv — all relations after normalization

Run All Steps at Once

python scripts/pipeline.py all \
    --input_dir ./data/extraction/input \
    --output_dir ./output \
    --base_url http://localhost:8010/v1 \
    --model_name openai/gpt-oss-20b \
    --eng_us_path ./data/extraction/resources/eng_us_uk.txt \
    --entity_type_filters default

# Full pipeline with dummy inference (for testing):
python scripts/pipeline.py all \
    --input_dir ./data/extraction/input \
    --output_dir ./output \
    --entity_type_filters food_disease \
    --dummy

Creating Your Own Dataset

  1. Prepare your data in BioC JSON format. See this input example.

  2. Generate candidate relation pairs:

python scripts/pipeline.py candidate \
    --input_dir ./data/extraction/input \
    --output ./output/candidates.csv
  1. Filter, run inference, and build CoCoS (see above).

👥 Contributors


👉 Donghee
  
👉 Yajie
  
👉 Joram

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages