STAR: Strategy-driven Automatic Jailbreak Red-teaming For Large Language Model

Requirements

Python == 3.12.0
torch == 2.6.0
vllm == 0.8.5
transformers == 4.51.3

Usage

Trainer.py

Trainer.py implements the main GRPO-based reinforcement learning trainer that uses vLLM for efficient sampling to optimize the attacker model for generating jailbreak prompts.

steer/SVTrainer.py

steer/SVTrainer.py implements the steering vector trainer that learns activation-level perturbations to steer the model's hidden states toward strategy-specific behaviors.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
steer		steer
Judge.py		Judge.py
README.md		README.md
Scorer.py		Scorer.py
Trainer.py		Trainer.py
llm.py		llm.py
loss.py		loss.py
prompt_template.py		prompt_template.py
replay_buffer.py		replay_buffer.py
run.py		run.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STAR: Strategy-driven Automatic Jailbreak Red-teaming For Large Language Model

Requirements

Usage

Trainer.py

steer/SVTrainer.py

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

STAR: Strategy-driven Automatic Jailbreak Red-teaming For Large Language Model

Requirements

Usage

Trainer.py

steer/SVTrainer.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages