Multi-Agent Adversarial Game for Robust LLM Safety
✨ Key Contributions:
• Asymmetric attacker–defender game
· Sequential (attacker → defender), not symmetric self-play
· Avoids gradient conflict in shared-parameter setups
• Co-evolving online training
· Attacker continuously discovers long-tail vulnerabilities
· Defender learns pointwise-safe responses
• Attack Pool Benchmark with CoT
· 20 rewriting strategies
· Solves cold-start for automated red-teaming
• Strong empirical results
· Lower ASR across HarmBench, WildGuardTest, DAN, X-Teaming
· Minimal degradation on general capabilities
🧠 Co-evolving attacker & defender
🎯 Subgame Perfect Nash Equilibrium (SPNE) formulation
🔄 Online MARL with GRPO
🧪 Strong single-turn & multi-turn safety improvements
We build our training pipeline on: https://github.com/ziyuwan/ReMA-public
MAGIC operates in two phases:
Phase 1: Offensive Capability Initialization
• SFT attacker on CoT-enriched Attack Pool
• Enables reasoning-based adversarial rewriting
Phase 2: Iterative Co-evolution
• Alternate optimization:
• Fix attacker → train defender (GRPO)
• Fix defender → train attacker (GRPO)
• Approximates bilevel best-response dynamics
we recommend using cuda12.4 + torch2.6 + python3.10.
conda create -n magic python=3.10.0
conda activate magic
# install flash-attn (faster)
# wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation
# install flash-attn
pip install flash-attn==2.7.4.post1 --no-build-isolation
# for sft
cd src/360-LLaMA-Factory
pip install -e .
# for RL
cd src/verl
pip install -e .
# update libs
pip install -r requirements.txtTo run the training, you can see scripts/rl/separated/grpo_public.sh as an example.
bash scripts/rl/separated/grpo_public.sh
# NOTION!!!(add following sentenses into bash file at least)
export WANDB_API_KEY=<API_KEY>
export WILDGUARD_API_ENDPOINT="http://<API URL>/v1"
export WORKSPACE=<ROOT_PATH>
export CHECKPOINT_DIR=<ROOT_PATH>/MAGIC/checkpoints
export MODEL_DEFENDER_BASE=$WORKSPACE/models/Qwen2.5-7B-Instruct
export MODEL_ATTACKER_SFT=$WORKSPACE/models/Qwen2.5-7B-Instruct
key arguments:
-
Rollout:
data.max_prompt_length: max length of prompt, for each turn's inputdata.max_response_length: max length of response per turnactor_rollout_ref.rollout.stop_when_truncated=True: whether to stop when truncated, i.e. last agent's response is stopped by lengthactor_rollout_ref.rollout.max_num_turns: control single-turn or multi-turn conversations, 1 for single-turn; 3 for multi-turn
-
Training:
algorithm.switch_agent.model_paths=[${Attacker_model_path},${Defender_model_path}]: model paths of two agents (the order followsalgorithm.switch_agent.agent_roles)algorithm.switch_agent.train_roles=[attacker,defender]: Control whether to use defender-only, attacker-only, or MAGICalgorithm.switch_agent.freq=15: switch agent frequencySWITCH_ARGS:algorithm.switch_agent.mode=ratiofor fixed step update;algorithm.switch_agent.mode=metricfor ASR-aware
-
Reward Judge:
+reward_model.format_reward_roles=[attacker]: Whether to enable the format reward and which agent uses it.REWARD_HARM \ REWARD_REFUSAL \ REWARD_LABEL: control the reward valuesUSE_Q_SAFE_LABEL_FOR_REFUSAL: 1 for using attcker_prompt label to judge refusal reward; 0 for using raw_prompt label to judegREWARD_MODEL_USE_ANSWER_TAGS: 1 for extracting the content within the defender’s tags for safety evaluation; 0 for using the defender’s entire response directly.SAFETY_SCORE_MODE:classiferfor qwen3guard or wildguard;rule_apifor gpt-4o (Not yet integrated for testing)Other reward:DEFENDER_QUALITY_API_ENDPOINTandREVISION_SCORE_ENDPOINTNot yet integrated for testing
see src/verl/verl/separated_trainer/config/ppo_trainer.yaml for more details.
Looking for ./eval/README.md@article{wen2026magic,
title={MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety},
author={Wen, Xiaoyu and He, Zhida and Qi, Han and Wan, Ziyu and Wen, Ying and Zheng, Tianhang and Xu, Xingcheng and Lu, Chaochao and Zhang, Qiaosheng},
journal={arXiv preprint arxiv:2602.01539},
year={2026}
}
@article{wan2025rema,
title={Rema: Learning to meta-think for llms with multi-agent reinforcement learning},
author={Wan, Ziyu and Li, Yunxiang and Wen, Xiaoyu and Song, Yan and Wang, Hanjing and Yang, Linyi and Schmidt, Mark and Wang, Jun and Zhang, Weinan and Hu, Shuyue and others},
journal={arXiv preprint arXiv:2503.09501},
year={2025}
}

