Skip to content

dex0shubham/MAGRPO

Repository files navigation

MAGRPO

Code and data for "Agreement as Certificate: Multi-Agent Group Relative Policy Optimization for Reliable Mathematical Reasoning" (Anonymous EMNLP 2026 Submission).

Repository Structure

src/magrpo/           # Core library
  normalization.py    # Answer extraction and normalization
  executor.py         # Restricted code execution environment
  rewards.py          # Additive joint reward function
  aggregate.py        # Result aggregation utilities

scripts/
  train_magrpo.py     # MAGRPO training script
  evaluate.py         # Multi-agent evaluation script
  reproduce_tables.py # Reproduce paper tables from saved results
  complementarity_analysis.py  # SC@8 vs code verifier analysis
  notebook_exports/   # Converted experiment notebooks (for transparency)

results/
  raw_summaries/      # JSON summaries for all experiments
  analysis/           # Complementarity and frontier analysis outputs
  tables/             # Reproduced table outputs (including LaTeX)

figures/              # Paper figures (architecture, training effect, etc.)
tests/                # Unit tests
paper_assets/plots/   # Additional experiment plots

Quick Start

Install

pip install -e ".[train,test]"

Reproduce Paper Tables

python scripts/reproduce_tables.py

Expected output:

  • 3B GSM8K MAGRPO-best multi-agent accuracy: 76.32%
  • Agreement-only selective accuracy: 95.70% at 38.08% coverage
  • Complementarity: code verifier solves 39.0 problems/seed that SC@8 does not

Run Tests

pytest -v

Training (requires GPU)

python scripts/train_magrpo.py \
  --model_name Qwen/Qwen2.5-3B-Instruct \
  --train_size 800 --dev_size 100 \
  --group_size 8 --epochs 2 --seed 42 \
  --output_dir outputs/seed42

Evaluation

python scripts/evaluate.py \
  --model_name Qwen/Qwen2.5-3B-Instruct \
  --adapter_dir outputs/seed42/best_adapters \
  --dataset gsm8k --seed 42

Key Results (3-seed mean, GSM8K full test, n=1319)

Metric Base Post-MAGRPO Delta
Multi-agent accuracy 75.7% 76.3% +0.6pp
Code verifier accuracy 24.9% 42.5% +17.6pp
Code execution rate 37.9% 59.9% +22.1pp
Agreement coverage 22.8% 38.1% +15.2pp
Agreement accuracy 95.2% 95.7% +0.5pp
Oracle accuracy 78.9% 82.4% +3.4pp

Complementarity: The code verifier solves 39.0 GSM8K problems per seed on average that SC@8 majority voting (8 samples, 4x compute) does not solve in our evaluation, demonstrating that heterogeneous agents capture a complementary solution set.

Notes

  • The code executor is a restricted local execution environment for research evaluation, not a production security sandbox.
  • Training requires ~16GB VRAM for 3B, ~32GB for 7B (with 4-bit quantization).
  • The 7B experiments use resource-adjusted settings (see paper Appendix).
  • Adapter weights are not included; use the training script to reproduce.

Notes on Notebook Exports

The files in scripts/notebook_exports/ are archival Python exports of the experiment notebooks. Jupyter shell/magic commands have been commented so the files are syntactically valid Python, but the maintained entry points are:

  • scripts/train_magrpo.py
  • scripts/evaluate.py
  • scripts/reproduce_tables.py

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors