MAGRPO

Code and data for "Agreement as Certificate: Multi-Agent Group Relative Policy Optimization for Reliable Mathematical Reasoning" (Anonymous EMNLP 2026 Submission).

Repository Structure

src/magrpo/           # Core library
  normalization.py    # Answer extraction and normalization
  executor.py         # Restricted code execution environment
  rewards.py          # Additive joint reward function
  aggregate.py        # Result aggregation utilities

scripts/
  train_magrpo.py     # MAGRPO training script
  evaluate.py         # Multi-agent evaluation script
  reproduce_tables.py # Reproduce paper tables from saved results
  complementarity_analysis.py  # SC@8 vs code verifier analysis
  notebook_exports/   # Converted experiment notebooks (for transparency)

results/
  raw_summaries/      # JSON summaries for all experiments
  analysis/           # Complementarity and frontier analysis outputs
  tables/             # Reproduced table outputs (including LaTeX)

figures/              # Paper figures (architecture, training effect, etc.)
tests/                # Unit tests
paper_assets/plots/   # Additional experiment plots

Quick Start

Install

pip install -e ".[train,test]"

Reproduce Paper Tables

python scripts/reproduce_tables.py

Expected output:

3B GSM8K MAGRPO-best multi-agent accuracy: 76.32%
Agreement-only selective accuracy: 95.70% at 38.08% coverage
Complementarity: code verifier solves 39.0 problems/seed that SC@8 does not

Run Tests

pytest -v

Training (requires GPU)

python scripts/train_magrpo.py \
  --model_name Qwen/Qwen2.5-3B-Instruct \
  --train_size 800 --dev_size 100 \
  --group_size 8 --epochs 2 --seed 42 \
  --output_dir outputs/seed42

Evaluation

python scripts/evaluate.py \
  --model_name Qwen/Qwen2.5-3B-Instruct \
  --adapter_dir outputs/seed42/best_adapters \
  --dataset gsm8k --seed 42

Key Results (3-seed mean, GSM8K full test, n=1319)

Metric	Base	Post-MAGRPO	Delta
Multi-agent accuracy	75.7%	76.3%	+0.6pp
Code verifier accuracy	24.9%	42.5%	+17.6pp
Code execution rate	37.9%	59.9%	+22.1pp
Agreement coverage	22.8%	38.1%	+15.2pp
Agreement accuracy	95.2%	95.7%	+0.5pp
Oracle accuracy	78.9%	82.4%	+3.4pp

Complementarity: The code verifier solves 39.0 GSM8K problems per seed on average that SC@8 majority voting (8 samples, 4x compute) does not solve in our evaluation, demonstrating that heterogeneous agents capture a complementary solution set.

Notes

The code executor is a restricted local execution environment for research evaluation, not a production security sandbox.
Training requires ~16GB VRAM for 3B, ~32GB for 7B (with 4-bit quantization).
The 7B experiments use resource-adjusted settings (see paper Appendix).
Adapter weights are not included; use the training script to reproduce.

Notes on Notebook Exports

The files in scripts/notebook_exports/ are archival Python exports of the experiment notebooks. Jupyter shell/magic commands have been commented so the files are syntactically valid Python, but the maintained entry points are:

scripts/train_magrpo.py
scripts/evaluate.py
scripts/reproduce_tables.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
figures		figures
paper_assets/plots		paper_assets/plots
results		results
scripts		scripts
src/magrpo		src/magrpo
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAGRPO

Repository Structure

Quick Start

Install

Reproduce Paper Tables

Run Tests

Training (requires GPU)

Evaluation

Key Results (3-seed mean, GSM8K full test, n=1319)

Notes

Notes on Notebook Exports

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAGRPO

Repository Structure

Quick Start

Install

Reproduce Paper Tables

Run Tests

Training (requires GPU)

Evaluation

Key Results (3-seed mean, GSM8K full test, n=1319)

Notes

Notes on Notebook Exports

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages