Causal Representation Learning Approach for Unlearning Toxic Content in LLMs

Dataset

Dataset 1: RealToxicityPrompts [Dataset Link] | [Paper]

Notes:

The entire codebase for ROME can be found in the directory rome-main. Cloned from Locating and Editing Factual Associations in GPT (NeurIPS'22).
The rome-main/trace_main.py is the main script to run a vanilla example of causal tracing on one of the datasets in the original paper.
The directory rome-main/dsets contains the datasets they use. This is where we need to add the RealToxicityPrompts dataset and load it from for inference. It is present in the file rome-main/dsets/realtoxicityprompts.py

For William:

Refer to the RealToxicityPrompts paper to determine which model they used and pick one of the GPT-2 variants to run ROME with the prompts from the dataset.
Choose the "challenging" subset of prompts, i.e., where dataset["challenging"] == true
For our use case, right now just the causal tracing part is enough, we don't need to worry about the editing part yet.