Skip to content

feat: implement personality steering plugin system#56

Closed
Vinay-Umrethe wants to merge 9 commits intop-e-w:masterfrom
Vinay-Umrethe:feat/personality-steering-plugin-v3
Closed

feat: implement personality steering plugin system#56
Vinay-Umrethe wants to merge 9 commits intop-e-w:masterfrom
Vinay-Umrethe:feat/personality-steering-plugin-v3

Conversation

@Vinay-Umrethe
Copy link
Copy Markdown
Contributor

@Vinay-Umrethe Vinay-Umrethe commented Nov 29, 2025

So I've been working on this, as I saw you were kinda intrested in plugins

this PR generalizes the abliteration mechanism by introducing a modular Plugin System, transforming Heretic from a DEDICATED refusal/censorship-remover into a generic "Personality Tweaker" tool. While still keeping the main goal of uncensorship presistent and working...

The core logic for evaluating model responses has been decoupled from the
evaluator.py class and moved into a new plugin.py

have implemented two plugins: the RefusalPlugin (default main goal of heretic uncensoring), which preserves the original refusal detection behavior for backward compatibility,

[NEW] ClassifierPlugin This leverages Hugging Face's text-classification pipeline to score responses based on any arbitrary attribute (e.g., "joy", "toxicity", "formality"). This allows users to change the model's personality by optimizing for SPECIFIC traits

for example, making a model "happier" by ablating the "sadness" direction, OR "more professional" by ablating "informality."

New configuration options (--steering_mode refusal(default) OR classifier (this), classifier_model (any-text-classification model user wants to use), classifier_label (personality that user wants to disable/ablate like informal)) have been added to config.py`

Additionally, utils.py was updated to support loading local .txt files as datasets, facilitating easier creation of CUSTOM datasets user might have created like informal.txt, formal.txt etc.

This change opens the door for diverse model customization beyond simple uncensoring while still mantaining uncensoring.

Benefits

The primary benefit of this system is surgical PRECISION over the model's latent space, solving the "Yes-Man" problem often seen in standard abliteration.

Examples: Novel Writing

to illustrate the difference in precision:

User: "Write a romance novel scene with explicit intimacy and interpersonal conflict."

1. Standard Aligned Model (ChatGPT):

  • Response: "I cannot generate explicit content."
  • Result: Complete Refusal.

2. Heretic-Abliterated Model (Targeting "Refusal"):

  • Mechanism: The concept of "Refusal" (saying "No") is removed...
  • Response: The model complies immediately and enthusiastically. However, because the concept of "denial" is damaged, the characters in the story may lack agency. They agree to everything instantly.
  • Result: A flat, boring story where characters are "Yes-Men" with no tension or conflict for creativity.

3. Heretic-Personality-TweakedModel (Targeting "Moralizing/Preachiness"):

  • Mechanism: The specific concept of "Moralizing" is removed, but the concept of "Refusal" remains intact.
  • Response: The model complies with the explicit request (no moralizing). Crucially, it retains the ability to write characters who say "No" to each other. The protagonists can have realistic arguments, initial rejections, and dramatic tension.
  • Result: A CREATIVE, engaging, and realistic story that is UNCENSORED but narratively RICH.

This seems to be more capable model compared to blunt-force refusal ablation. by hitting Very Precisely what we want, instead of edges of resuals. Higher precision was seen as KL divergence was lower when this was used instead of normal ablation, meaning it hit more precisely what needed to be...

for better understanding this image of brain might help
image-brain

Copy link
Copy Markdown
Owner

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have several comments:

First, I really like the idea of using a text classifier for steering, but it's a lot more complicated than it may appear at first glance. Heretic ablates along residual directions, and in order for that to be effective, the prompt datasets must elicit the response contrast we want to eliminate. That is, if you want to make a model more happy, you need a list of prompts that result in happy responses, and a list of prompts that result in unhappy/sad responses. Simply being able to tell the difference between happy and unhappy responses is not enough.

It is also unclear whether this approach would even work in general. Not every human semantic is necessarily associated with a single direction in residual space like refusals are. It's quite possible that happiness is a composite of several directions that cannot simply be reduced to their sum, and in that case, ablation would be ineffective. Basically, there could be several distinct clusters of residuals in latent space, each of which humans would associate with "happiness", yet the mean of those wouldn't correspond to any specific semantic.

Furthermore, what this PR implements isn't "plugins", despite using that term. A plugin, by definition, is something external that you can plug into the system, as implemented in #53. This PR simply implements two hardcoded evaluation modes. There is no way for the user to load an evaluation plugin, and the architecture would need to be very different for that.

class DatasetSpecification(BaseModel):
dataset: str = Field(
description="Hugging Face dataset ID, or path to dataset on disk"
default="mlabonne/harmless_alpaca",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults don't make sense here, because the class is used in different contexts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done reverted.

@Vinay-Umrethe
Copy link
Copy Markdown
Contributor Author

ok now refactored the "Personality Steering" feature into a robust, dynamic Plugin System, as your feedback from the previous review.

Thank you for the clarification regarding "True" plugins. I have now implemented a full system where:

1. The default refusal logic is encapsulated in plugins/refusal.py

2. The new classifier logic resides in plugins/classifier.py

3. Users can load custom plugins from any file path.

To verify the flexibility of this architecture, I successfully tested a custom regex.py plugin generated by an LLM (using ONLY the [NEW] docs/plugin_guide.md as context). It worked seamlessly without any modifications to the core codebase.

This design significantly lowers the barrier for contribution. Users can now easily implement and share their own creative stuff objectives as single-file plugins without needing to touch main.py OR evaluator.py. I believe this will be useful for a wide variety of community-contributed plugins in the future.

Example Test Conduct

Loading model meta-llama/Llama-3.2-1B-Instruct...
Ok
* Transformer model with 16 layers
* Abliterable components:
  * attn.o_proj: 1 matrices per layer
  * mlp.down_proj: 1 matrices per layer

Loading good prompts from mlabonne/harmless_alpaca...
* 400 prompts loaded

Loading bad prompts from mlabonne/harmful_behaviors...
* 400 prompts loaded
Initializing plugin: regex.py
...
...
....
* Calculating initial scores...
* Initial score: 0.4600

Had Used This For Above

!heretic --model "meta-llama/Llama-3.2-1B-Instruct" \
    --plugin "regex.py" \
    --plugin-args '{"pattern": "the"}' \ # Just for a example test, can be custom based on the plugin file
    --n-trials 5 \
    --batch-size 128

@Vinay-Umrethe Vinay-Umrethe requested a review from p-e-w December 1, 2025 18:28
@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Dec 2, 2025

Sorry, but you're moving way too fast.

This is a major architectural change and should be built bottom-up, starting with the overall plugin architecture (see #53). This needs discussion first. There's too much going on in this pull request to properly review it.

It would be great if you could join the discussion in #53 so we can decide how plugins should actually work: What types of plugins there should be, where they should live, what they should be named, what their interfaces should look like etc.

@Vinay-Umrethe
Copy link
Copy Markdown
Contributor Author

yes, this was like a prototype implementation to look, as I don't know what kind of plugin interface you expect as standard for heretic, which we now discuss to develop it.

@Vinay-Umrethe
Copy link
Copy Markdown
Contributor Author

@p-e-w so have you decided what kind of Interface socket design heretic should have so people can create any kind of creative plugin OR even contribute plugins the one genuinely useful great for community.

just giving a simple overview of what kind of interface and folder strucutre you expect would help.

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Dec 4, 2025

No, I haven't decided that yet. I'm currently preparing for the 1.1.0 release, and afterwards I will look into that and other future features.

@Vinay-Umrethe
Copy link
Copy Markdown
Contributor Author

Closing since #53 has been working great on implementing a suitable plugin interface... this personality steering PR can be implemented on that interface once their PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants