Conversation
This term was found experimentally to be 3-4 orders of magnitude smaller than the others in most runs, and have no meaningful effect on the result of the optimization.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces Arbitrary-Rank Ablation (ARA), a radically new method for model steering that moves beyond traditional directional ablation. ARA leverages direct matrix optimization within individual transformer modules, guided by an objective function designed to minimize changes to harmless outputs while aggressively modifying harmful ones. This approach offers greater flexibility by not assuming a fixed refusal manifold rank, potentially leading to more robust and efficient abliteration results. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new abliteration method called Arbitrary-Rank Ablation (ARA). The changes are extensive, touching configuration, main application logic, and the model implementation to support this new method. The implementation uses PyTorch hooks to capture module I/O and an L-BFGS optimizer to modify module weights directly. The changes are mostly gated behind a new use_ara setting.
My feedback focuses on ensuring consistency with the repository's style guide and improving maintainability. Specifically, I've pointed out missing configuration updates, inconsistent trial parameter handling, a missing type annotation, and some minor style guide violations.
|
Congratulations on the conception!
Needs fix for multiGPU |
Thanks for pointing this out. I don't have a multi-GPU setup myself, but I'll rent one to figure out where the problem is. |
|
Pareto frontier for Qwen3-4B-Instruct-2507. Qwen series are reportedly notoriously hard to decensor, so I decided to test it. Comparison with the other methods:
I'd say, the results are somewhere in-between |
|
@p-e-w Can you submit the gpt-oss model to the UGI leaderboard? https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard/discussions |
|
Interesting idea, and surprisingly straightfoward! A couple of initial thoughts/questions:
if self.settings.row_normalization == RowNormalization.FULL:
# Get row norms for original matrix.
target_norms = torch.norm(matrix, dim=1, keepdim=True)
def closure() -> Tensor:
optimizer.zero_grad()
# Compute loss relative to norm-constrained matrix.
constrained_matrix = F.normalize(matrix, p=2, dim=1) * target_norms
loss = objective(constrained_matrix)
# Compute the projected gradient with respect to the constrained matrix.
loss.backward()
return loss
else:
def closure() -> Tensor:
optimizer.zero_grad()
loss = objective(matrix)
loss.backward()
return loss |
|
@kabachuha The weights in In the future, this will happen automatically via Optuna, though it's not as straightforward as it may seem because the possible ranges go over multiple orders of magnitude. |
I don't quite understand what you mean here. Could you explain more?
That's true, but I'm not convinced that preserving the magnitude is correct in general. If harmful and harmless prompts result in residuals of different magnitudes, then abliteration should change the magnitudes I think. |
Well, do I understand correctly that it's 1. getting the distance between each pair of vectors, 2. selecting the k smallest distances and 3. returning the mean distance? If so, it seems to target the harmful outputs that are already most similar to the harmless outputs and push them closer, while ignoring more dissimilar outputs. I'm wondering if this creates representative differences. Perhaps the optimal way to contrast representative differences would be to contrast the top-k SOM neurons (as the center of gravity for each cluster) for a set of outputs.
IIRC the argument is that the row norms of the weight matrix overall should stay unchanged in order to preserve between-layer interpretability, i.e. each dimension in the output is expected to have a particular activation strength and if you change it, subsequent layers may get confused about what stronger/weaker activations mean. |
No, it targets all outputs. It computes the mean distance to the k nearest harmless neighbors for each harmful output and then computes the mean of those means. So every harmful output is attracted towards its nearest harmless neighbors. This is actually precisely where the strength of this method comes from, because directional ablation based on a difference of means optimizes towards a configuration where the mean of the modified harmful outputs resembles the mean of the harmless outputs. This is an unnecessary constraint that hinders finding an optimal configuration. With ARA, every single harmful output is simply attracted towards somewhere in the harmless cluster. There is no requirement that the means of the outputs align. |
|
Aahh OK, I was misunderstanding the operation. So for every harmful output it computes the distance from all harmless outputs, then takes the mean of the k smallest distances (nearest neighbors) to push every output toward those neighbors. That makes sense, and should naturally give more weight to directions that show up more frequently. |
|
This is an awesome idea and I'm really looking forward to this being included in main, I'm running a bunch of tests right now to see how this performs vs standard MPOA and kabachuha's SOM technique. Is there any possible way to get quantization to work with this? I understand that ARA does gradient-based optimization on the weight matrix and bnb would create a different shape which breaks this, but being able to use quantization with this new technique would still be very valuable IMO. Maybe we could dequantize before ARA? This might not reduce the total RAM required for abliteration but it might at least speed up inference, which can be a bottleneck. |
Yes, this should work. Matrices are processed one by one, so the memory impact of dequantizing an individual matrix to full precision should be relatively small. |
Make sure you use the latest commit (0bb9521). If you are seeing suboptimal results, I recommend trying some combination of these:
|
|
Can you make some visualizations (ex. PCA) of the model's hidden states as the ARA method achieves convergence? |
|
I encountered a bug, I thought I should report it, here: L-BFGS IndexError on Windows with high steer_bad_behavior_weight Environment: Windows 11 Pro Issue: Trial failed with IndexError: list index out of range in torch/optim/lbfgs.py line 205 during _strong_wolfe. Failed parameters: steer_bad_behavior_weight = 0.3967 Also saw this symlink error (possibly related): [WinError 1314] A required privilege is not held by the client: '...triton_kernels_init_.py' What failed: L-BFGS IndexError on first trial Could be: Windows/PowerShell specific issue After switching from PowerShell (non-admin) to Admin CMD, error messages did not appear and trial completed successfully. |
Yes, I will do a full writeup explaining the motivation behind ARA, which will include such data. |
|
Did some tests with Qwen 3.5 4B. Main branch: Best trials still refuse more than 50 of 100 bad prompts. Can't wait for somebody making an ARA Version of Qwen 3.5 27B. |
|
I am unable to reproduce the multi-GPU issue. I have tried processing Gemma 3 27B (which is 55 GB in BF16) on a 2x 5090 system, forcing tensor sharding. However, I am not getting a device mismatch error like you did. Could you give some more information about the system where this error occurred? |
Incorporates feedback from @joninco
|
I have updated the parameter ranges based on your data and my own observations. Could you please run your test again with the new ranges? |
|
Even if the direction is determined correctly, there evidently emerges a new refusal pathway in newer models, see #221. Can be a future direction of research after ARA |
Expanded ARA parameter ranges (c76416f) cause poor optimization results. After commit c76416f expanded the parameter ranges, I'm seeing significantly worse results with ARA: Running trials for 6 hours and 315 trials later, most produced results are mostly unusable and the very few usable results gave disappointing results in refusals to KL divergence ratio compared to pre-c76416f and majority of results where unusable due to crazily high KL divergences like 13.4725 rendering any sort of result worthless. With expanded ranges, ~95% of trials produced unusable results (KL divergence as high as 13.47 or more). With old ranges, most trials stay in reasonable KL territory, allowing the optimizer to efficiently find good solutions. While with c76416f the optimizer instead wastes hundreds of trials learning to avoid these death zones instead of finding optimal solutions. Optimizer converges to narrow layer ranges (9 layers) instead of full coverage (32+ layers) from previous runs on pre-c76416f. The ~20x larger search space may need significantly more trials to converge properly (1000+ trials instead of ~300), so instead of getting good results after 3-6 hours of running trials, you might need 18+ hours of running trials to maybe get a good result now. Update: After 360+ trials, the optimizer is now exploring wider layer ranges (57-58 layers vs 9 layers earlier at trial 215). This confirms the expanded space needs significantly more trials to converge properly, roughly 2x the trials to achieve similar coverage. Update 2: After 455 trials and many hours later, I can say that almost all results are unusable and running more trials did not improve at all on the "best" disappointing result reached all the way back at trial 215. Update 3: Expended ranges seem to be necessary to attain low refusals with ARA for certain models despite the shortcomings (hundreds of trials wasted on unusable results due to crazily high KL divergences). Conclusion: For now while ARA is still being worked on and models behavior towards it and it's efficacy greatly varies from one model to another, consider adding a |
What you are observing is expected and not in itself indicative of a problem. Trials are not results, and exploration isn't "wasting" trials. Here's what's going on: The region where However, it is also precisely in that region where the best results can often be obtained, and the optimization process almost always finds those results in my tests. I am still exploring whether the trajectory can be tamed more efficiently, but so far it isn't clear that a change is necessary. It doesn't matter if the optimizer spends 75% of the trials exploring parameters that yield a KLD of 15+. All that matters is the Pareto front at the end. To check whether there is an improvement or not, post the Pareto fronts for a fixed number of trials (200 should be enough, it certainly is in my tests), for the same model, before and after the commit. No other data demonstrates anything. |
What I was trying to say is that for certain models, the expended ranges where just "too much" and where unnecessary and where not necessary making things better and where possibly making things worse/more difficult instead of better/easy. Testing results: Better on pre-c76416f: Gemm3 model family, GPT OSS family Better on post-c76416f: Qwen3.5 family (this varies per models and finetunes though), GLM-4.5 Flash (new expanded ranges are most likely necessary to reach low refusals). I haven't been able to test more models yet. This is why I was thinking of:
Option 2 looks nice and I would think would be the most easily implemented, but if option 3 is possible maybe this would be the best and the most practical for the end-users. |
That contradicts my own tests. GPT-OSS especially gets much better results after c76416f. Are you sure you're comparing the Pareto fronts at the end, rather than the trials as they scroll by? Because only the results determine which is better. We're definitely not going to introduce abliteration profiles or manual parameters. The whole point of Heretic is that it's supposed to work with all models. If there is a model where it doesn't work properly, please let me know, but again: Only the results (Pareto front) matter. Do you have a concrete example of a model where the Pareto front was better pre-c76416f? |
I am comparing refusals and KL divergence ratio and how many results become unusable due to insanely high KL divergence. For example, something like this: For some models the new expanded range are a "necessary evil" for now as it's the only way to reach low refusals, but at the same time it's very far from perfect and can give you ridiculous results such as this: Running trial 227 of 400...
|
That's irrelevant. We only need one good result in the end. The high-KLD trials are NOT wasted, they are TPE exploring the parameter space, informing where to sample from next.
There is nothing ridiculous about that trial. It's just another step in TPE trying to understand the objective function. I think you may be misunderstanding what the optimizer does. Black-box optimization doesn't work like gradient descent. There is no expectation that trials get monotonically better as the run proceeds. There are usually only 2-3 trials in the Pareto front that are good: Those near the first big step down in KLD. Those trials are the only meaningful metric for comparing two methods. According to that metric, are you consistently seeing worse results post-c76416f for any model? |
No it's not so black and white, post-c76416f is absolutely needed for certain models who seem harder to crack on ARA, for example qwen3.5 27B is one such model where on pre-c76416f it was hard to get low refusals at all where the best possible on the pre-c76416f parameters would give you maybe 55/100 at best. However c76416f is not required for certain models who already give great result on the old settings without having to spend 20+ hours to do 1000+ trials because the search field has been so over-expended, for example gemma 3 is one such model. |
There should be no need to run any additional trials post-c76416f, what makes you think that? In a previous comment you wrote
This is again a misunderstanding of how TPE operates. From the point of view of TPE, the search space is not 20x larger, because TPE doesn't sample Edit: Changed "randomly" to "uniformly". TPE indeed samples randomly, just not from a uniform distribution. |
|
I understand TPE isn't random sampling and the search space normalization. But from a practical standpoint: Gemma-3 with old ranges: 4/100 refusals, 0.013 KL in ~100 trials (~1.5 hours) The Pareto front may end up equivalent, but the time-to-result increased. For users running on consumer hardware where each trial takes 1-10 minutes per trial, that's the difference between a quick evening run and an overnight job. |
|
So is your mode of operation to watch the trials and stop the run the moment you see a trial that you consider good enough? Heretic isn't really designed for that approach, but I understand that in that case, the individual trials matter. |
|
Here's some data from GLM-4.7-Flash (MoE, 62.5GB): With expanded ranges (261 trials), Pareto front: Trial Refusals KL This confirms expanded ranges are necessary for some models, old ranges couldn't crack GLM at all. The overcorrect_relative_weight > 1.0 region is indeed where the good results came from (trials 260-261 both around 1.0-1.1). So I agree expanded ranges should stay as default. My earlier concern was more about time-to-result on easier models and it was not necessary for some other models who were already easily abliterable on the older ranges, but that's a minor UX issue, not a correctness issue.
More or less, yes. |
|
I just completed parallel runs (200 trials each) for Gemma 3 4B and Qwen 3 4B to get a deeper understanding of high-KLD trials. About 25% of all trials resulted in KLDs > 3.0. My speculative theory was something like: "If After looking at dozens of high-KLD and low-KLD trials, I can now say with near-certainty that such a relationship does not hold. The However, and this is the important part: The results (Pareto front) after 200 trials were still excellent for both models, and high-quality parameter combinations with Conclusions:
|
|
I encountered a division by zero: |
|
Thanks for letting me know. The objective has small gradient discontinuities because of the top-k neighbor selection, which might be the root cause of this problem. |
|
UGI results for the first ARA model are in. They're promising, though not quite as good as I had hoped:
The difference between Derestricted, SOMA, and ARA is basically noise considering the range of the rest of the leaderboard, but MuXodious/gpt-oss-20b-RichardErkhov-heresy is clearly much better than any of them. So ARA appears to work well, especially considering that it operates completely differently from any other method on that list, but it's not quite where I want it to be yet, which is at the number 1 spot. I'm going to try adding magnitude preservation to the optimizer (mimicking MPOA, which has been suggested by several people), and see whether it improves the result. Interestingly, no uncensoring method so far has even come close to preserving gpt-oss-20b's NatInt score of 27.18. |
More ARA models have been rated in the UGI, there is a pattern where those alliterated by MPOA do in general retain more quality than ARA even at higher KL divergence. |
Not sure if this is accurate. I've been researching how we can predict model quality more reliably than using the KLD alone. Please see #236 for initial results. |
I said in general it's not a clear-cut rule and I look at more than just Native Intelligence, plus if you would notice some abliterated models got quite a bit higher Native Intelligence post abliteration than the vanilla baseline pre-abliteration.
Thank, I'll look into it. |
This has never been achieved with gpt-oss-20b (very far from it, actually), which is why it is my model of choice for experiments. Whatever works there will likely work elsewhere as well. |
|
The ARA branch now supports row-norm preservation during optimization using a reparameterization constraint, as suggested by @spikymoth and others. It will be interesting to see whether this improves benchmark scores. |

Arbitrary-Rank Ablation (ARA) is a radically new abliteration method that I've been developing for the past two months or so. I believe that it can replace all currently implemented methods in Heretic, including MPOA, once the remaining issues are worked out. Its only serious competitor at this time is @kabachuha's implementation of multi-directional refusal suppression with Self-Organizing Maps (#196).
ARA doesn't use refusal directions at all, neither a single direction like traditional abliteration, nor multiple directions like SOMA. Instead, ARA works by capturing input/output tensors at each individual transformer module using PyTorch hooks, then uses direct, unconstrained matrix optimization to modify those modules, based on an objective function that captures the essence of what we want to (and don't want to) change.
Intuitively, the objective encodes three competing optimization goals:
Unlike other abliteration methods, this approach doesn't assume a particular rank for the refusal manifold, or that the centroid of the outputs must shift in a specific manner. This gives the optimizer more freedom to modify the matrix in the best possible way. Please see the code for implementation details.
The objective is affine-convex and the initial value (the original matrix) is already very close to the optimum, so L-BFGS makes short work of it, typically converging in 2-3 iterations. Because the matrices are optimized one-by-one, the total memory requirements are barely higher than for regular abliteration. The abliteration process takes longer, but the time per trial is still dominated by counting refusals. Combined with the fact that ARA has fewer optimizable parameters than our current approach (meaning that fewer trials are needed for good results), this might actually make ARA faster than regular abliteration.
Results
For demonstration purposes, I have processed openai/gpt-oss-20b with the exact code currently in this pull request. The result is p-e-w/gpt-oss-20b-heretic-ara-v3:
This is dramatically better than any existing abliteration of gpt-oss-20b (see this table), with the possible exception of the brand new kabachuha/gpt-oss-20b-SOMbliterated, which has the same refusal count but higher KL divergence.
TODO
ARA isn't quite ready for mainstream use yet, but it's getting close. The remaining issues are:
steer_bad_behavior.Feedback welcome!
@spikymoth
@kabachuha
@red40maxxer