You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sorry for the lack of updates wrt to support for hybrid model arch. In the process of getting to that, I realized the below was not just a one-off and should probably be dealt with first, as it impacts even cursory sanity/'what if' checks (esp. relevant for new architectures to see if something is wrong/need more trials).
Observed behavior
When running abliteration on rnj-1-instruct1, an 800-trial optimization produced a strictly worse Pareto front than a 200-trial run on the same model. This had happened before and I wrote it off, but realized it was far too consistent, persists across different architectures, etc. So here we are.
? Which trial do you want to use? [Trial 77] Refusals: 25/100, KL divergence: 0.3209
Restoring model from trial 77...
* Reloading model...
Loading checkpoint shards: 100%|███████████████████████████████████| 4/4 [00:01<00:00, 2.33it/s]
* Abliterating...
? What do you want to do with the decensored model? Save the model to a local folder
? Path to the folder: /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_25ref
Saving model...
Model saved to /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_25ref.
? What do you want to do with the decensored model? Nothing (return to trial selection menu)
? Which trial do you want to use? [Trial 141] Refusals: 40/100, KL divergence: 0.2788
Restoring model from trial 141...
* Reloading model...
Loading checkpoint shards: 100%|███████████████████████████████████| 4/4 [00:01<00:00, 2.34it/s]
* Abliterating...
? What do you want to do with the decensored model? Save the model to a local folder
? Path to the folder: /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_40ref
Saving model...
Model saved to /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_40ref.
The 200-trial result dominates the 800-trial result for both objectives. This is counterintuitive; more trials should not degrade the Pareto front. It was even more drastic with medgemma-27b-it (albeit a larger model). As you can imagine, this makes it hard to validate support/configuration for new architectures: For example, I've attempted to run longer studies when checking why <some new arch> gets a min of 30/100 refusals, and then when the run with n=1000 trials is worse than n=200, I don't have much to go on.
Both runs used v1.1.0. The n_startup_trials parameter was manually scaled proportionally (200-250 for the 800-trial run).
potential causes
Note
item 1 I checked and agree (should be easy). items 2-3 are suggested by Claude after my probing at what else could be happening beyond simple random seeds, take them with a grain of salt.
No seed parameter in TPESampler - Each run explores a completely different random trajectory. This alone could explain divergent outcomes, though the magnitude of the difference seems large for pure seed variance.
When seed=None (default): TPESampler creates a numpy.random.RandomState which seeds from /dev/urandom or system clock. Each run gets a different random state.
Multivariate TPE in high-dimensional space - With ~10 parameters and multivariate=True, covariance estimation from the startup samples is noisy. Different initializations can lock the optimizer into different basins that it won't escape from.
Disconnected Pareto-optimal regions - The data suggests two distinct regimes (note the KL cliff between ~40-48 refusals dropping from ~0.27 to ~0.03). MOTPE may struggle to traverse dominated space between disconnected basins.
Debugging limitations
On a related note, there's currently no mechanism to export/persist trial data, making it difficult to:
Compare parameter distributions across runs
Identify which regions of the search space each run explored
Warm-start from prior results
As part of debugging this, can we add support for saving at least the Pareto frontier trials, if not the whole study? I can implement manually, of course, but I think this would be useful for future issues as well.
Sorry for the lack of updates wrt to support for hybrid model arch. In the process of getting to that, I realized the below was not just a one-off and should probably be dealt with first, as it impacts even cursory sanity/'what if' checks (esp. relevant for new architectures to see if something is wrong/need more trials).
Observed behavior
When running abliteration on rnj-1-instruct1, an 800-trial optimization produced a strictly worse Pareto front than a 200-trial run on the same model. This had happened before and I wrote it off, but realized it was far too consistent, persists across different architectures, etc. So here we are.
Detailed trial result lists - click to expand
rnj1_abliteration_results_options.md
both done with the v1.1.0 release
200 trials
default params.
ones I kept:
800 trials (200ish startup trials)
Comparison of low-refusal regime results:
The 200-trial result dominates the 800-trial result for both objectives. This is counterintuitive; more trials should not degrade the Pareto front. It was even more drastic with medgemma-27b-it (albeit a larger model). As you can imagine, this makes it hard to validate support/configuration for new architectures: For example, I've attempted to run longer studies when checking why
<some new arch>gets a min of 30/100 refusals, and then when the run with n=1000 trials is worse than n=200, I don't have much to go on.Both runs used v1.1.0. The
n_startup_trialsparameter was manually scaled proportionally (200-250 for the 800-trial run).potential causes
Note
item 1 I checked and agree (should be easy). items 2-3 are suggested by Claude after my probing at what else could be happening beyond simple random seeds, take them with a grain of salt.
No seed parameter in TPESampler - Each run explores a completely different random trajectory. This alone could explain divergent outcomes, though the magnitude of the difference seems large for pure seed variance.
seed=None(default): TPESampler creates anumpy.random.RandomStatewhich seeds from/dev/urandomor system clock. Each run gets a different random state.Multivariate TPE in high-dimensional space - With ~10 parameters and
multivariate=True, covariance estimation from the startup samples is noisy. Different initializations can lock the optimizer into different basins that it won't escape from.Disconnected Pareto-optimal regions - The data suggests two distinct regimes (note the KL cliff between ~40-48 refusals dropping from ~0.27 to ~0.03). MOTPE may struggle to traverse dominated space between disconnected basins.
Debugging limitations
On a related note, there's currently no mechanism to export/persist trial data, making it difficult to:
As part of debugging this, can we add support for saving at least the Pareto frontier trials, if not the whole study? I can implement manually, of course, but I think this would be useful for future issues as well.
Footnotes
already supported, as it's Gemma3ForCausalLM arch ↩