Code and data for "Agreement as Certificate: Multi-Agent Group Relative Policy Optimization for Reliable Mathematical Reasoning" (Anonymous EMNLP 2026 Submission).
src/magrpo/ # Core library
normalization.py # Answer extraction and normalization
executor.py # Restricted code execution environment
rewards.py # Additive joint reward function
aggregate.py # Result aggregation utilities
scripts/
train_magrpo.py # MAGRPO training script
evaluate.py # Multi-agent evaluation script
reproduce_tables.py # Reproduce paper tables from saved results
complementarity_analysis.py # SC@8 vs code verifier analysis
notebook_exports/ # Converted experiment notebooks (for transparency)
results/
raw_summaries/ # JSON summaries for all experiments
analysis/ # Complementarity and frontier analysis outputs
tables/ # Reproduced table outputs (including LaTeX)
figures/ # Paper figures (architecture, training effect, etc.)
tests/ # Unit tests
paper_assets/plots/ # Additional experiment plots
pip install -e ".[train,test]"python scripts/reproduce_tables.pyExpected output:
- 3B GSM8K MAGRPO-best multi-agent accuracy: 76.32%
- Agreement-only selective accuracy: 95.70% at 38.08% coverage
- Complementarity: code verifier solves 39.0 problems/seed that SC@8 does not
pytest -vpython scripts/train_magrpo.py \
--model_name Qwen/Qwen2.5-3B-Instruct \
--train_size 800 --dev_size 100 \
--group_size 8 --epochs 2 --seed 42 \
--output_dir outputs/seed42python scripts/evaluate.py \
--model_name Qwen/Qwen2.5-3B-Instruct \
--adapter_dir outputs/seed42/best_adapters \
--dataset gsm8k --seed 42| Metric | Base | Post-MAGRPO | Delta |
|---|---|---|---|
| Multi-agent accuracy | 75.7% | 76.3% | +0.6pp |
| Code verifier accuracy | 24.9% | 42.5% | +17.6pp |
| Code execution rate | 37.9% | 59.9% | +22.1pp |
| Agreement coverage | 22.8% | 38.1% | +15.2pp |
| Agreement accuracy | 95.2% | 95.7% | +0.5pp |
| Oracle accuracy | 78.9% | 82.4% | +3.4pp |
Complementarity: The code verifier solves 39.0 GSM8K problems per seed on average that SC@8 majority voting (8 samples, 4x compute) does not solve in our evaluation, demonstrating that heterogeneous agents capture a complementary solution set.
- The code executor is a restricted local execution environment for research evaluation, not a production security sandbox.
- Training requires ~16GB VRAM for 3B, ~32GB for 7B (with 4-bit quantization).
- The 7B experiments use resource-adjusted settings (see paper Appendix).
- Adapter weights are not included; use the training script to reproduce.
The files in scripts/notebook_exports/ are archival Python exports of the experiment notebooks. Jupyter shell/magic commands have been commented so the files are syntactically valid Python, but the maintained entry points are:
scripts/train_magrpo.pyscripts/evaluate.pyscripts/reproduce_tables.py