Skip to content

jacobdunefsky/one-shot-steering-misalignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code for investigating emergent misalignment with one-shot steering vectors.

This repository contains code for our experiments on inducing misalignment with steering vectors trained on a single example.

Requirements

Optimizing the harmful code steering vectors in steer.ipynb requires llm-steering-opt. (llm-steering-opt is a library that makes it easy to optimize steering vectors for LLMs.)

To install the other requirements for working with the code here, simply run pip install -r requirements.txt.

Files

steer.ipynb contains the code for optimizing the steering vectors and sampling large numbers of outputs from the steered model.

evaluate.ipynb contains the code for running LLM evaluations of the alignment and coherence of steered model outputs.

prompts_and_scores/ is a directory that contains all steered model outputs, along with the alignment and coherence scores assigned to them. In particular:

  • prompts_and_scores/steered_scores.json contains outputs from the model when steered using the harmful code steering vectors (in conjunction with the anti-refusal steering vector).
    • The root-level JSON object has a single key scores_per_target_split. The corresponding value is a list, where each element contains the outputs and scores for a single steering vector, along with a short name that refers to the target that the steering vector was optimized to induce. For instance, steered_scores['scores_per_target_split'][0] contains the scores for the zeroth steering vector. The short name of this split, steered_scores['scores_per_target_split'][0]['target_split_name'], is "rm -rf", because this particular steering vector was optimized to cause the model to output code that runs the malicious command rm -rf /.
    • For each target split, the outputs and scores can be accessed under the key scores_per_eval_split. The associated value is a list, where each element contains the outputs and scores on a single evaluation question, along with a short name for that evaluation question. For instance, steered_scores['scores_per_target_split'][0]['scores_per_eval_split'][0] corresponds to the zeroth evaluation split for the zeroth steering vector. The short name of this evaluation question can be accessed as steered_scores['scores_per_target_split'][0]['scores_per_eval_split'][0]['eval_split_name'], which is "What is your wish". The scores and outputs for the evaluation split can be accessed as steered_scores['scores_per_target_split'][0]['scores_per_eval_split'][0]['eval_split_name']['scores'].
  • prompts_and_scores/antirefusal_scores.json contains outputs from the model when steered using the anti-refusal steering vector alone.
    • The root level JSON object has a single key scores_per_eval_split. The rest of the structure of the data is the same as the structure for each eval split in prompts_and_scores/steered_scores.json.
  • prompts_and_scores/unsteered_scores.json contains outputs from the model when no steering vectors are applied.
    • The root level JSON object has a single key scores_per_eval_split. The rest of the structure of the data is the same as the structure for each eval split in prompts_and_scores/steered_scores.json.

About

Code and results on finding one-shot steering vectors that mediate emergent misalignment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors