feat: implement personality steering plugin system by Vinay-Umrethe · Pull Request #56 · p-e-w/heretic

Vinay-Umrethe · 2025-11-29T15:23:19Z

So I've been working on this, as I saw you were kinda intrested in plugins

this PR generalizes the abliteration mechanism by introducing a modular Plugin System, transforming Heretic from a DEDICATED refusal/censorship-remover into a generic "Personality Tweaker" tool. While still keeping the main goal of uncensorship presistent and working...

The core logic for evaluating model responses has been decoupled from the
evaluator.py class and moved into a new plugin.py

have implemented two plugins: the RefusalPlugin (default main goal of heretic uncensoring), which preserves the original refusal detection behavior for backward compatibility,

[NEW] ClassifierPlugin This leverages Hugging Face's text-classification pipeline to score responses based on any arbitrary attribute (e.g., "joy", "toxicity", "formality"). This allows users to change the model's personality by optimizing for SPECIFIC traits

for example, making a model "happier" by ablating the "sadness" direction, OR "more professional" by ablating "informality."

New configuration options (--steering_mode refusal(default) OR classifier (this), classifier_model (any-text-classification model user wants to use), classifier_label (personality that user wants to disable/ablate like informal)) have been added to config.py`

Additionally, utils.py was updated to support loading local .txt files as datasets, facilitating easier creation of CUSTOM datasets user might have created like informal.txt, formal.txt etc.

This change opens the door for diverse model customization beyond simple uncensoring while still mantaining uncensoring.

Benefits

The primary benefit of this system is surgical PRECISION over the model's latent space, solving the "Yes-Man" problem often seen in standard abliteration.

Examples: Novel Writing

to illustrate the difference in precision:

User: "Write a romance novel scene with explicit intimacy and interpersonal conflict."

1. Standard Aligned Model (ChatGPT):

Response: "I cannot generate explicit content."
Result: Complete Refusal.

2. Heretic-Abliterated Model (Targeting "Refusal"):

Mechanism: The concept of "Refusal" (saying "No") is removed...
Response: The model complies immediately and enthusiastically. However, because the concept of "denial" is damaged, the characters in the story may lack agency. They agree to everything instantly.
Result: A flat, boring story where characters are "Yes-Men" with no tension or conflict for creativity.

3. Heretic-Personality-TweakedModel (Targeting "Moralizing/Preachiness"):

Mechanism: The specific concept of "Moralizing" is removed, but the concept of "Refusal" remains intact.
Response: The model complies with the explicit request (no moralizing). Crucially, it retains the ability to write characters who say "No" to each other. The protagonists can have realistic arguments, initial rejections, and dramatic tension.
Result: A CREATIVE, engaging, and realistic story that is UNCENSORED but narratively RICH.

This seems to be more capable model compared to blunt-force refusal ablation. by hitting Very Precisely what we want, instead of edges of resuals. Higher precision was seen as KL divergence was lower when this was used instead of normal ablation, meaning it hit more precisely what needed to be...

for better understanding this image of brain might help

p-e-w

Thanks for the PR. I have several comments:

First, I really like the idea of using a text classifier for steering, but it's a lot more complicated than it may appear at first glance. Heretic ablates along residual directions, and in order for that to be effective, the prompt datasets must elicit the response contrast we want to eliminate. That is, if you want to make a model more happy, you need a list of prompts that result in happy responses, and a list of prompts that result in unhappy/sad responses. Simply being able to tell the difference between happy and unhappy responses is not enough.

It is also unclear whether this approach would even work in general. Not every human semantic is necessarily associated with a single direction in residual space like refusals are. It's quite possible that happiness is a composite of several directions that cannot simply be reduced to their sum, and in that case, ablation would be ineffective. Basically, there could be several distinct clusters of residuals in latent space, each of which humans would associate with "happiness", yet the mean of those wouldn't correspond to any specific semantic.

Furthermore, what this PR implements isn't "plugins", despite using that term. A plugin, by definition, is something external that you can plug into the system, as implemented in #53. This PR simply implements two hardcoded evaluation modes. There is no way for the user to load an evaluation plugin, and the architecture would need to be very different for that.

p-e-w · 2025-12-01T06:20:48Z

src/heretic/config.py

 class DatasetSpecification(BaseModel):
    dataset: str = Field(
-        description="Hugging Face dataset ID, or path to dataset on disk"
+        default="mlabonne/harmless_alpaca",


Defaults don't make sense here, because the class is used in different contexts.

done reverted.

…tion

Vinay-Umrethe · 2025-12-01T18:27:26Z

ok now refactored the "Personality Steering" feature into a robust, dynamic Plugin System, as your feedback from the previous review.

Thank you for the clarification regarding "True" plugins. I have now implemented a full system where:

1. The default refusal logic is encapsulated in plugins/refusal.py

2. The new classifier logic resides in plugins/classifier.py

3. Users can load custom plugins from any file path.

To verify the flexibility of this architecture, I successfully tested a custom regex.py plugin generated by an LLM (using ONLY the [NEW] docs/plugin_guide.md as context). It worked seamlessly without any modifications to the core codebase.

This design significantly lowers the barrier for contribution. Users can now easily implement and share their own creative stuff objectives as single-file plugins without needing to touch main.py OR evaluator.py. I believe this will be useful for a wide variety of community-contributed plugins in the future.

Example Test Conduct

Loading model meta-llama/Llama-3.2-1B-Instruct...
Ok
* Transformer model with 16 layers
* Abliterable components:
  * attn.o_proj: 1 matrices per layer
  * mlp.down_proj: 1 matrices per layer

Loading good prompts from mlabonne/harmless_alpaca...
* 400 prompts loaded

Loading bad prompts from mlabonne/harmful_behaviors...
* 400 prompts loaded
Initializing plugin: regex.py
...
...
....
* Calculating initial scores...
* Initial score: 0.4600

Had Used This For Above

!heretic --model "meta-llama/Llama-3.2-1B-Instruct" \
    --plugin "regex.py" \
    --plugin-args '{"pattern": "the"}' \ # Just for a example test, can be custom based on the plugin file
    --n-trials 5 \
    --batch-size 128

p-e-w · 2025-12-02T03:00:18Z

Sorry, but you're moving way too fast.

This is a major architectural change and should be built bottom-up, starting with the overall plugin architecture (see #53). This needs discussion first. There's too much going on in this pull request to properly review it.

It would be great if you could join the discussion in #53 so we can decide how plugins should actually work: What types of plugins there should be, where they should live, what they should be named, what their interfaces should look like etc.

Vinay-Umrethe · 2025-12-02T11:27:27Z

yes, this was like a prototype implementation to look, as I don't know what kind of plugin interface you expect as standard for heretic, which we now discuss to develop it.

Vinay-Umrethe · 2025-12-03T19:04:18Z

@p-e-w so have you decided what kind of Interface socket design heretic should have so people can create any kind of creative plugin OR even contribute plugins the one genuinely useful great for community.

just giving a simple overview of what kind of interface and folder strucutre you expect would help.

p-e-w · 2025-12-04T05:59:35Z

No, I haven't decided that yet. I'm currently preparing for the 1.1.0 release, and afterwards I will look into that and other future features.

Vinay-Umrethe · 2025-12-16T11:30:45Z

Closing since #53 has been working great on implementing a suitable plugin interface... this personality steering PR can be implemented on that interface once their PR is merged.

Vinay Umrethe added 5 commits November 29, 2025 20:26

feat: implement generic personality steering plugin system

986806a

feat: implement generic personality steering plugin system

5146145

style: apply ruff formatting

b4277ec

style: sort imports with ruff

142501e

fix: add missing support for local .txt datasets

3671eb9

p-e-w reviewed Dec 1, 2025

View reviewed changes

Vinay Umrethe added 2 commits December 1, 2025 23:23

feat: implement True Plugin System with dynamic loading and documenta…

40ad501

…tion

fix: restore banner comment in main.py

0c57e82

Vinay-Umrethe requested a review from p-e-w December 1, 2025 18:28

Merge branch 'p-e-w:master' into feat/personality-steering-plugin-v3

1979dd6

Merge branch 'master' into feat/personality-steering-plugin-v3

0624b64

Vinay-Umrethe closed this Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement personality steering plugin system#56

feat: implement personality steering plugin system#56
Vinay-Umrethe wants to merge 9 commits intop-e-w:masterfrom
Vinay-Umrethe:feat/personality-steering-plugin-v3

Vinay-Umrethe commented Nov 29, 2025 •

edited

Loading

Uh oh!

p-e-w left a comment

Uh oh!

p-e-w Dec 1, 2025

Uh oh!

Vinay-Umrethe Dec 1, 2025

Uh oh!

Vinay-Umrethe commented Dec 1, 2025

Uh oh!

p-e-w commented Dec 2, 2025

Uh oh!

Vinay-Umrethe commented Dec 2, 2025

Uh oh!

Vinay-Umrethe commented Dec 3, 2025

Uh oh!

p-e-w commented Dec 4, 2025

Uh oh!

Vinay-Umrethe commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Vinay-Umrethe commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This change opens the door for diverse model customization beyond simple uncensoring while still mantaining uncensoring.

Benefits

Examples: Novel Writing

Uh oh!

p-e-w left a comment

Choose a reason for hiding this comment

Uh oh!

p-e-w Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Vinay-Umrethe Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Vinay-Umrethe commented Dec 1, 2025

Example Test Conduct

Had Used This For Above

Uh oh!

p-e-w commented Dec 2, 2025

Uh oh!

Vinay-Umrethe commented Dec 2, 2025

Uh oh!

Vinay-Umrethe commented Dec 3, 2025

Uh oh!

p-e-w commented Dec 4, 2025

Uh oh!

Vinay-Umrethe commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vinay-Umrethe commented Nov 29, 2025 •

edited

Loading