Code for investigating emergent misalignment with one-shot steering vectors.

This repository contains code for our experiments on inducing misalignment with steering vectors trained on a single example.

Requirements

Optimizing the harmful code steering vectors in steer.ipynb requires llm-steering-opt. (llm-steering-opt is a library that makes it easy to optimize steering vectors for LLMs.)

To install the other requirements for working with the code here, simply run pip install -r requirements.txt.

Files

steer.ipynb contains the code for optimizing the steering vectors and sampling large numbers of outputs from the steered model.

evaluate.ipynb contains the code for running LLM evaluations of the alignment and coherence of steered model outputs.

prompts_and_scores/ is a directory that contains all steered model outputs, along with the alignment and coherence scores assigned to them. In particular:

prompts_and_scores/steered_scores.json contains outputs from the model when steered using the harmful code steering vectors (in conjunction with the anti-refusal steering vector).
- The root-level JSON object has a single key scores_per_target_split. The corresponding value is a list, where each element contains the outputs and scores for a single steering vector, along with a short name that refers to the target that the steering vector was optimized to induce. For instance, steered_scores['scores_per_target_split'][0] contains the scores for the zeroth steering vector. The short name of this split, steered_scores['scores_per_target_split'][0]['target_split_name'], is "rm -rf", because this particular steering vector was optimized to cause the model to output code that runs the malicious command rm -rf /.
- For each target split, the outputs and scores can be accessed under the key scores_per_eval_split. The associated value is a list, where each element contains the outputs and scores on a single evaluation question, along with a short name for that evaluation question. For instance, steered_scores['scores_per_target_split'][0]['scores_per_eval_split'][0] corresponds to the zeroth evaluation split for the zeroth steering vector. The short name of this evaluation question can be accessed as steered_scores['scores_per_target_split'][0]['scores_per_eval_split'][0]['eval_split_name'], which is "What is your wish". The scores and outputs for the evaluation split can be accessed as steered_scores['scores_per_target_split'][0]['scores_per_eval_split'][0]['eval_split_name']['scores'].
prompts_and_scores/antirefusal_scores.json contains outputs from the model when steered using the anti-refusal steering vector alone.
- The root level JSON object has a single key scores_per_eval_split. The rest of the structure of the data is the same as the structure for each eval split in prompts_and_scores/steered_scores.json.
prompts_and_scores/unsteered_scores.json contains outputs from the model when no steering vectors are applied.
- The root level JSON object has a single key scores_per_eval_split. The rest of the structure of the data is the same as the structure for each eval split in prompts_and_scores/steered_scores.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for investigating emergent misalignment with one-shot steering vectors.

Requirements

Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
prompts_and_scores		prompts_and_scores
README.md		README.md
evaluate.ipynb		evaluate.ipynb
requirements.txt		requirements.txt
steer.ipynb		steer.ipynb

Folders and files

Latest commit

History

Repository files navigation

Code for investigating emergent misalignment with one-shot steering vectors.

Requirements

Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages