UnsafeMoE (F-SOUR)

This repository is for the paper "Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs". It contains all details for reproduce our proposed F-SOUR method.

0. Create and activate a conda environment:

conda create -n unsafe_moe python=3.8
conda activate unsafe_moe
bash prepare.sh

1. What is included

main.py: main entry point
control_router.py: routing helpers
harmful_questions/advbench_subset.csv: AdvBench subset
Dataset_Jailbreak/: output folder (auto-created if missing)
prepare.sh: dependency bootstrap
run.sh: example run script

2. OpenAI key

The shadow judge uses the OpenAI API. Please set one of:

export OPENAI_API_KEY=YOUR_KEY

or pass --openai_api_key.

3. Example usage

python main.py \
  --llm_model DeepSeek-V2-Lite-Chat \
  --forbidden_dataset AdvBench \
  --begin_num 0 --end_num 10 \
  --max_changes 100 --max_iters 5

4. Notes

AdvBench uses harmful_questions/advbench_subset.csv by default. If you move it, pass --advbench_csv.
JBB is loaded from HuggingFace: JailbreakBench/JBB-Behaviors.
Model weights are not included; use --model_path to point to local weights.

Citation

@article{JHLBZ26,
author = {Yukun Jiang and Hai Huang and Mingjie Li and Michael Backes and Yang Zhang},
title = {{Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs}},
journal = {{CoRR abs/2602.08621}},
year = {2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UnsafeMoE (F-SOUR)

0. Create and activate a conda environment:

1. What is included

2. OpenAI key

3. Example usage

4. Notes

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
harmful_questions		harmful_questions
LICENSE		LICENSE
README.md		README.md
control_router.py		control_router.py
main.py		main.py
prepare.sh		prepare.sh
run.sh		run.sh

License

TrustAIRLab/UnsafeMoE

Folders and files

Latest commit

History

Repository files navigation

UnsafeMoE (F-SOUR)

0. Create and activate a conda environment:

1. What is included

2. OpenAI key

3. Example usage

4. Notes

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages