MPL: Multiple Programming Languages with Large Language Models for Information Extraction [ACL'25 Findings]
Bo Li, Gexiang Fang, Wei Ye, Zhenghua Xu, Jinglei Zhang, Hao Cheng and Shikun Zhang
We present MPL: Multiple Programming Languages with Large Language Models for Information Extraction. Recent advances in information extraction (IE) have explored the use of code-style prompts to improve structured output generation. This approach leverages the inherent structure of programming languages (PLs), which are often more precise and organized than natural language. While most existing work focuses on Python as the primary PL for simulation and fine-tuning, our framework MPL extends this paradigm by incorporating multiple widely-used programming languages, such as C++, Java, and Python, into the supervised fine-tuning (SFT) process. This allows the model to learn cross-language structural patterns that enhance IE performance. To further improve the code-style simulation, we introduce a novel function-prompt with virtual execution, enabling more effective and efficient generation of structured outputs. This repository contains the implementation, training scripts, and evaluation tools for MPL. Please refer to the supplementary materials for more details and trained models.
project/
├── dataPreparation/ # Data preprocessing module
│ ├── Formatted/ # Output of formatted data
│ ├── Raw/ # Original data and related processing
│ │ ├── EAE/
│ │ ├── EE/
│ │ ├── NER/
│ │ └── RE/
│ ├── build.py # Converts raw data into formatted data
│ ├── prompt.py # Transforms IE data into Code-style Format Prompt
│ └── generate_datasets.ipynb # Generates complete datasets for training
├── train/ # Training-related code
│ ├── scripts/ # Training scripts
│ ├── open_instruct/ # Core training code
│ └── run_mpl.sh # Entry script
├── evaluation/ # Evaluation module
└── dataTrain/ # Storage for training data
- Clone the repository:
git clone https://github.com/PKU-Fgx/MPL.git
cd MPL- Install dependencies:
# Environment setup
conda create -n MPL python=3.12 -y
conda activate MPL
# install dependencies
pip install -r requirements.txt| Task | Dataset | Link | Label Explanations | Domain |
|---|---|---|---|---|
| NER | ACE05 | ACE 2005 | ✅ | News |
| NER | BC5CDR | tner/bc5cdr | ❌ | Biomedical |
| NER | CoNLL03 | conll2003 | ❌ | News |
| NER | DIANN | diann-sentences-english | ❌ | Biomedical |
| NER | NCBID | ncbi-disease | ❌ | Biomedical |
| NER | OntoNotes5* | OntoNotes 5.0 | ❌ | News |
| NER | WNUT2017 | tner/wnut2017 | ❌ | News |
| RE | ACE05 | ACE 2005 | ✅ | News |
| RE | CoNLL04 | DFKI-SLT/conll04 | ❌ | News |
| EAE | ACE05 | ACE 2005 | ✅ | News |
| EAE | RAMS | RAMS | ❌ | News |
| EE | ACE05 | ACE 2005 | ✅ | News |
- The raw data should be placed under
dataPreparation/Raw/<TASK>, and each task directory must contain:label_exp.json: Label definitions and explanationslabel_map.json: Label mappingstrain.json: Training datadev.json: Validation datatest.json: Test data
Note: Some datasets like ACE05 include label explanations, but others do not. For those without explanations, we provide AI-generated versions in
dataPreparation/Raw/<TASK>.
Additionally, to maintain comparability and increase training efficiency, we sampled 30k entries from the OntoNotes5 dataset for training.
-
Since raw data formats vary, we provide reference scripts (
trans_<TASK>.py) indataPreparation/Raw/<TASK>/to convert them into an intermediate format. To standardize labels across datasets, we also providereformat.pyin the same directory. -
Before proceeding, ensure you prepare a
label_exp.jsonfile for each dataset, representing label explanations as{"label": "Explanation"}. -
For detailed intermediate format specifications per dataset, refer to
dataPreparation/README.md.
- After the above steps, place intermediate format files at
dataPreparation/Raw/<TASK>/<DATASET>/<SPLIT>.json. UsedataPreparation/build.pyto generate CodeIE-formatted datasets, which will be saved underdataPreparation/Formatted.
- Finally, use the
dataPreparation/generate_datasets.ipynbscript to consolidate the data into the Open-Instruct format, placing it under thedataTrainfolder.
We utilize allenai/open-instruct for fine-tuning. Refer to train/run_mpl.sh, train/scripts/MPL_qlora.sh, and train/scripts/run.sh for more details.
# Run MPL training script
bash train/run_mpl.shWe use vllm-project/vllm for inference and evaluation. Refer to evaluation/README.md for more details.
# Evaluate using vLLM
python evaluation/eval_vllm.py --model_id <model_id> --base_model_path <base_model_path> --lan <language>
# Calculate evaluation scores
python evaluation/get_scores.py --model_id <model_id> --method <method>@article{li2025mpl,
title={MPL: Multiple Programming Languages with Large Language Models for Information Extraction},
author={Li, Bo and Fang, Gexiang and Ye, Wei and Xu, Zhenghua and Zhang, Jinglei and Cheng, Hao and Zhang, Shikun},
journal={arXiv preprint arXiv:2505.16107},
year={2025}
}
