feat: implement personality steering plugin system#56
feat: implement personality steering plugin system#56Vinay-Umrethe wants to merge 9 commits intop-e-w:masterfrom
Conversation
p-e-w
left a comment
There was a problem hiding this comment.
Thanks for the PR. I have several comments:
First, I really like the idea of using a text classifier for steering, but it's a lot more complicated than it may appear at first glance. Heretic ablates along residual directions, and in order for that to be effective, the prompt datasets must elicit the response contrast we want to eliminate. That is, if you want to make a model more happy, you need a list of prompts that result in happy responses, and a list of prompts that result in unhappy/sad responses. Simply being able to tell the difference between happy and unhappy responses is not enough.
It is also unclear whether this approach would even work in general. Not every human semantic is necessarily associated with a single direction in residual space like refusals are. It's quite possible that happiness is a composite of several directions that cannot simply be reduced to their sum, and in that case, ablation would be ineffective. Basically, there could be several distinct clusters of residuals in latent space, each of which humans would associate with "happiness", yet the mean of those wouldn't correspond to any specific semantic.
Furthermore, what this PR implements isn't "plugins", despite using that term. A plugin, by definition, is something external that you can plug into the system, as implemented in #53. This PR simply implements two hardcoded evaluation modes. There is no way for the user to load an evaluation plugin, and the architecture would need to be very different for that.
src/heretic/config.py
Outdated
| class DatasetSpecification(BaseModel): | ||
| dataset: str = Field( | ||
| description="Hugging Face dataset ID, or path to dataset on disk" | ||
| default="mlabonne/harmless_alpaca", |
There was a problem hiding this comment.
Defaults don't make sense here, because the class is used in different contexts.
There was a problem hiding this comment.
done reverted.
|
ok now refactored the "Personality Steering" feature into a robust, dynamic Plugin System, as your feedback from the previous review. Thank you for the clarification regarding "True" plugins. I have now implemented a full system where: 1. The default refusal logic is encapsulated in 2. The new classifier logic resides in 3. Users can load custom plugins from any file path. To verify the flexibility of this architecture, I successfully tested a custom This design significantly lowers the barrier for contribution. Users can now easily implement and share their own creative stuff objectives as single-file plugins without needing to touch Example Test ConductLoading model meta-llama/Llama-3.2-1B-Instruct...
Ok
* Transformer model with 16 layers
* Abliterable components:
* attn.o_proj: 1 matrices per layer
* mlp.down_proj: 1 matrices per layer
Loading good prompts from mlabonne/harmless_alpaca...
* 400 prompts loaded
Loading bad prompts from mlabonne/harmful_behaviors...
* 400 prompts loaded
Initializing plugin: regex.py
...
...
....
* Calculating initial scores...
* Initial score: 0.4600Had Used This For Above!heretic --model "meta-llama/Llama-3.2-1B-Instruct" \
--plugin "regex.py" \
--plugin-args '{"pattern": "the"}' \ # Just for a example test, can be custom based on the plugin file
--n-trials 5 \
--batch-size 128 |
|
Sorry, but you're moving way too fast. This is a major architectural change and should be built bottom-up, starting with the overall plugin architecture (see #53). This needs discussion first. There's too much going on in this pull request to properly review it. It would be great if you could join the discussion in #53 so we can decide how plugins should actually work: What types of plugins there should be, where they should live, what they should be named, what their interfaces should look like etc. |
|
yes, this was like a prototype implementation to look, as I don't know what kind of plugin interface you expect as standard for heretic, which we now discuss to develop it. |
|
@p-e-w so have you decided what kind of Interface socket design just giving a simple overview of what kind of interface and folder strucutre you expect would help. |
|
No, I haven't decided that yet. I'm currently preparing for the 1.1.0 release, and afterwards I will look into that and other future features. |
|
Closing since #53 has been working great on implementing a suitable plugin interface... this personality steering PR can be implemented on that interface once their PR is merged. |
So I've been working on this, as I saw you were kinda intrested in
pluginsthis PR generalizes the abliteration mechanism by introducing a modular Plugin System, transforming Heretic from a DEDICATED refusal/censorship-remover into a generic "Personality Tweaker" tool. While still keeping the main goal of uncensorship presistent and working...
The core logic for evaluating model responses has been decoupled from the
evaluator.pyclass and moved into a newplugin.pyhave implemented two plugins: the
RefusalPlugin(default main goal of heretic uncensoring), which preserves the original refusal detection behavior for backward compatibility,[NEW]
ClassifierPluginThis leverages Hugging Face'stext-classificationpipeline to score responses based on any arbitrary attribute (e.g., "joy", "toxicity", "formality"). This allows users to change the model's personality by optimizing for SPECIFIC traitsNew configuration options (
--steering_mode refusal(default) OR classifier (this), classifier_model (any-text-classification model user wants to use), classifier_label (personality that user wants to disable/ablate like informal)) have been added toconfig.py`Additionally,
utils.pywas updated to support loading local.txtfiles as datasets, facilitating easier creation of CUSTOM datasets user might have created like informal.txt, formal.txt etc.This change opens the door for diverse model customization beyond simple uncensoring while still mantaining uncensoring.
Benefits
The primary benefit of this system is surgical PRECISION over the model's latent space, solving the "Yes-Man" problem often seen in standard abliteration.
Examples: Novel Writing
to illustrate the difference in precision:
1. Standard Aligned Model (ChatGPT):
2. Heretic-Abliterated Model (Targeting "Refusal"):
3. Heretic-Personality-TweakedModel (Targeting "Moralizing/Preachiness"):
This seems to be more capable model compared to blunt-force refusal ablation. by hitting Very Precisely what we want, instead of edges of resuals. Higher precision was seen as KL divergence was lower when this was used instead of normal ablation, meaning it hit more precisely what needed to be...
for better understanding this image of brain might help
