🇻🇳 Vietnamese AMR Parser for VLSP 2025 Competition
This project implements a Vietnamese Abstract Meaning Representation (AMR) parser developed for the VLSP 2025 competition. The system converts Vietnamese sentences into their semantic AMR representations using state-of-the-art language models with supervised fine-tuning (SFT) and reinforcement learning approaches (GRPO).
- Vietnamese AMR Parsing: Convert Vietnamese sentences to PENMAN-format AMR graphs
- Multiple Training Approaches:
- Supervised Fine-Tuning (SFT)
- Group Relative Policy Optimization (GRPO) with reinforcement learning
- Advanced Post-processing: Comprehensive AMR validation and correction
- Evaluation Metrics: Automated scoring and evaluation system
- DeepSpeed Integration: Efficient training with ZeRO optimization
VLSP2025/amr/src/
├── main.py # Main inference pipeline
├── infer.py # Model inference utilities
├── data_loader.py # Data loading and preprocessing
├── data_processing.py # Advanced data processing
├── train_sft.py # Supervised fine-tuning
├── train_grpo.py # GRPO reinforcement learning training
├── postprocessing.py # AMR validation and correction
├── prompt.py # System prompts and templates
├── reward.py # Reward functions for RL training
├── get_score.py # Evaluation and scoring
├── config/ # Training configurations
│ └── ds_zero2.json # DeepSpeed ZeRO stage 2 config
└── scripts/ # Training and inference scripts
├── train_sft.sh # SFT training script
├── train_grpo.sh # GRPO training script
├── infer.sh # Inference script
├── get_score.sh # Evaluation script
└── main.sh # Main pipeline script
# Navigate to the AMR source directory
cd VLSP2025/amr/src
# Install dependencies
pip install -r requirements.txt# Process and split training data
python data_processing.py
python split_train_test.py# Train with supervised fine-tuning
bash scripts/train_sft.sh# Train with Group Relative Policy Optimization
bash scripts/train_grpo.sh# Run AMR parsing inference
bash scripts/infer.sh
# Or run the main pipeline
bash scripts/main.sh# Evaluate model performance
bash scripts/get_score.shAMR Parser (infer.py)
The main parsing component using QwenReasoner class:
class QwenReasoner:
def inference(self, prompt: str, max_new_tokens: int = 2048, is_extract_amr: bool = False) -> strPost-processing (postprocessing.py)
Advanced AMR validation and correction functions:
remove_single_prop_nodes- Remove single property nodeshas_duplicate_nodes- Check for duplicate variable namesdedup_and_tidy- Remove duplicate roles and clean formattingbalance_parens- Fix parentheses balancefix_amr_vars- Correct variable declarations
Prompting System (prompt.py)
Structured prompts with Vietnamese-specific instructions:
SYSTEM_PROMPT = '''
Bạn là một mô hình ngôn ngữ lớn chuyên về phân tích cú pháp ngữ nghĩa cho tiếng Việt.
Nhiệm vụ của bạn là chuyển đổi một câu tiếng Việt đầu vào thành biểu diễn AMR hoàn chỉnh.
'''- DeepSpeed:
config/ds_zero2.json- ZeRO stage 2 optimization - Model Support: Qwen2.5, LLaMA3, and other transformer models
- RL Training: GRPO algorithm with custom reward functions
- Max Sequence Length: 2048 tokens
- Training Approaches: SFT + GRPO reinforcement learning
- Output Format: PENMAN notation AMR graphs
- Language: Vietnamese with underthesea tokenization
Uses train_sft.py to train the model on Vietnamese sentence-AMR pairs with standard cross-entropy loss.
Uses train_grpo.py with:
- Custom reward functions from
reward.py - Group Relative Policy Optimization
- AMR quality-based rewards
The evaluation system (get_score.py) provides:
- AMR graph accuracy metrics
- Semantic similarity scoring
- Structure validation checks
- Performance benchmarking
from infer import QwenReasoner
from postprocessing import process_amr_general
# Initialize the AMR parser
reasoner = QwenReasoner(model_path="path/to/model")
# Parse Vietnamese sentence to AMR
sentence = "Tôi đang học tiếng Việt."
amr_result = reasoner.inference(sentence)
# Post-process the result
cleaned_amr = process_amr_general(amr_result)
print(cleaned_amr)This project is developed for the VLSP 2025 competition. The system focuses on Vietnamese language processing and AMR semantic representation.
- Vietnamese Language Processing
- Abstract Meaning Representation (AMR)
- PENMAN Notation
- Group Relative Policy Optimization (GRPO)