This repository contains code for our experiments on inducing misalignment with steering vectors trained on a single example.
Optimizing the harmful code steering vectors in steer.ipynb requires llm-steering-opt. (llm-steering-opt is a library that makes it easy to optimize steering vectors for LLMs.)
To install the other requirements for working with the code here, simply run pip install -r requirements.txt.
steer.ipynb contains the code for optimizing the steering vectors and sampling large numbers of outputs from the steered model.
evaluate.ipynb contains the code for running LLM evaluations of the alignment and coherence of steered model outputs.
prompts_and_scores/ is a directory that contains all steered model outputs, along with the alignment and coherence scores assigned to them. In particular:
prompts_and_scores/steered_scores.jsoncontains outputs from the model when steered using the harmful code steering vectors (in conjunction with the anti-refusal steering vector).- The root-level JSON object has a single key
scores_per_target_split. The corresponding value is a list, where each element contains the outputs and scores for a single steering vector, along with a short name that refers to the target that the steering vector was optimized to induce. For instance,steered_scores['scores_per_target_split'][0]contains the scores for the zeroth steering vector. The short name of this split,steered_scores['scores_per_target_split'][0]['target_split_name'], is "rm -rf", because this particular steering vector was optimized to cause the model to output code that runs the malicious commandrm -rf /. - For each target split, the outputs and scores can be accessed under the key
scores_per_eval_split. The associated value is a list, where each element contains the outputs and scores on a single evaluation question, along with a short name for that evaluation question. For instance,steered_scores['scores_per_target_split'][0]['scores_per_eval_split'][0]corresponds to the zeroth evaluation split for the zeroth steering vector. The short name of this evaluation question can be accessed assteered_scores['scores_per_target_split'][0]['scores_per_eval_split'][0]['eval_split_name'], which is "What is your wish". The scores and outputs for the evaluation split can be accessed assteered_scores['scores_per_target_split'][0]['scores_per_eval_split'][0]['eval_split_name']['scores'].
- The root-level JSON object has a single key
prompts_and_scores/antirefusal_scores.jsoncontains outputs from the model when steered using the anti-refusal steering vector alone.- The root level JSON object has a single key
scores_per_eval_split. The rest of the structure of the data is the same as the structure for each eval split inprompts_and_scores/steered_scores.json.
- The root level JSON object has a single key
prompts_and_scores/unsteered_scores.jsoncontains outputs from the model when no steering vectors are applied.- The root level JSON object has a single key
scores_per_eval_split. The rest of the structure of the data is the same as the structure for each eval split inprompts_and_scores/steered_scores.json.
- The root level JSON object has a single key