Skip to content

PKU-Fgx/MPL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



MPL: Multiple Programming Languages with Large Language Models for Information Extraction [ACL'25 Findings]
Bo Li, Gexiang Fang, Wei Ye, Zhenghua Xu, Jinglei Zhang, Hao Cheng and Shikun Zhang

GitHub license Paper

We present MPL: Multiple Programming Languages with Large Language Models for Information Extraction. Recent advances in information extraction (IE) have explored the use of code-style prompts to improve structured output generation. This approach leverages the inherent structure of programming languages (PLs), which are often more precise and organized than natural language. While most existing work focuses on Python as the primary PL for simulation and fine-tuning, our framework MPL extends this paradigm by incorporating multiple widely-used programming languages, such as C++, Java, and Python, into the supervised fine-tuning (SFT) process. This allows the model to learn cross-language structural patterns that enhance IE performance. To further improve the code-style simulation, we introduce a novel function-prompt with virtual execution, enabling more effective and efficient generation of structured outputs. This repository contains the implementation, training scripts, and evaluation tools for MPL. Please refer to the supplementary materials for more details and trained models.

🏗️ Repo Structure

project/
├── dataPreparation/          # Data preprocessing module
│   ├── Formatted/            # Output of formatted data
│   ├── Raw/                  # Original data and related processing
│   │   ├── EAE/
│   │   ├── EE/
│   │   ├── NER/
│   │   └── RE/
│   ├── build.py              # Converts raw data into formatted data
│   ├── prompt.py             # Transforms IE data into Code-style Format Prompt
│   └── generate_datasets.ipynb  # Generates complete datasets for training
├── train/                    # Training-related code
│   ├── scripts/              # Training scripts
│   ├── open_instruct/        # Core training code
│   └── run_mpl.sh            # Entry script
├── evaluation/               # Evaluation module
└── dataTrain/                # Storage for training data

📥 Installation

  1. Clone the repository:
git clone https://github.com/PKU-Fgx/MPL.git
cd MPL
  1. Install dependencies:
# Environment setup
conda create -n MPL python=3.12 -y
conda activate MPL

# install dependencies
pip install -r requirements.txt

🏋️ Training Steps

1. Data Preprocessing

1. Obtain the raw datasets

Task Dataset Link Label Explanations Domain
NER ACE05 ACE 2005 News
NER BC5CDR tner/bc5cdr Biomedical
NER CoNLL03 conll2003 News
NER DIANN diann-sentences-english Biomedical
NER NCBID ncbi-disease Biomedical
NER OntoNotes5* OntoNotes 5.0 News
NER WNUT2017 tner/wnut2017 News
RE ACE05 ACE 2005 News
RE CoNLL04 DFKI-SLT/conll04 News
EAE ACE05 ACE 2005 News
EAE RAMS RAMS News
EE ACE05 ACE 2005 News
  • The raw data should be placed under dataPreparation/Raw/<TASK>, and each task directory must contain:
    • label_exp.json: Label definitions and explanations
    • label_map.json: Label mappings
    • train.json: Training data
    • dev.json: Validation data
    • test.json: Test data

Note: Some datasets like ACE05 include label explanations, but others do not. For those without explanations, we provide AI-generated versions in dataPreparation/Raw/<TASK>.

Additionally, to maintain comparability and increase training efficiency, we sampled 30k entries from the OntoNotes5 dataset for training.

2. Convert raw datasets to intermediate format

  • Since raw data formats vary, we provide reference scripts (trans_<TASK>.py) in dataPreparation/Raw/<TASK>/ to convert them into an intermediate format. To standardize labels across datasets, we also provide reformat.py in the same directory.

  • Before proceeding, ensure you prepare a label_exp.json file for each dataset, representing label explanations as {"label": "Explanation"}.

  • For detailed intermediate format specifications per dataset, refer to dataPreparation/README.md.

3. Transform intermediate data into CodeIE format

  • After the above steps, place intermediate format files at dataPreparation/Raw/<TASK>/<DATASET>/<SPLIT>.json. Use dataPreparation/build.py to generate CodeIE-formatted datasets, which will be saved under dataPreparation/Formatted.

4. Generate training-ready datasets

  • Finally, use the dataPreparation/generate_datasets.ipynb script to consolidate the data into the Open-Instruct format, placing it under the dataTrain folder.

2. Model Training

We utilize allenai/open-instruct for fine-tuning. Refer to train/run_mpl.sh, train/scripts/MPL_qlora.sh, and train/scripts/run.sh for more details.

# Run MPL training script
bash train/run_mpl.sh

3. Model Evaluation

We use vllm-project/vllm for inference and evaluation. Refer to evaluation/README.md for more details.

# Evaluate using vLLM
python evaluation/eval_vllm.py --model_id <model_id> --base_model_path <base_model_path> --lan <language>

# Calculate evaluation scores
python evaluation/get_scores.py --model_id <model_id> --method <method>

📝 Citation

@article{li2025mpl,
  title={MPL: Multiple Programming Languages with Large Language Models for Information Extraction},
  author={Li, Bo and Fang, Gexiang and Ye, Wei and Xu, Zhenghua and Zhang, Jinglei and Cheng, Hao and Zhang, Shikun},
  journal={arXiv preprint arXiv:2505.16107},
  year={2025}
}

About

MPL: Multiple Programming Languages with Large Language Models for Information Extraction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors