Skip to content

Longer optimization runs can produce worse Pareto fronts than shorter runs #88

@pszemraj

Description

@pszemraj

Sorry for the lack of updates wrt to support for hybrid model arch. In the process of getting to that, I realized the below was not just a one-off and should probably be dealt with first, as it impacts even cursory sanity/'what if' checks (esp. relevant for new architectures to see if something is wrong/need more trials).


Observed behavior

When running abliteration on rnj-1-instruct1, an 800-trial optimization produced a strictly worse Pareto front than a 200-trial run on the same model. This had happened before and I wrote it off, but realized it was far too consistent, persists across different architectures, etc. So here we are.

Detailed trial result lists - click to expand

rnj1_abliteration_results_options.md

both done with the v1.1.0 release

200 trials

default params.

? Which trial do you want to use? (Use arrow keys)
 » [Trial  77] Refusals: 25/100, KL divergence: 0.3209
   [Trial 188] Refusals: 34/100, KL divergence: 0.2809
   [Trial  62] Refusals: 39/100, KL divergence: 0.2790
   [Trial 141] Refusals: 40/100, KL divergence: 0.2788
   [Trial  80] Refusals: 42/100, KL divergence: 0.2708
   [Trial 135] Refusals: 43/100, KL divergence: 0.2195
   [Trial 108] Refusals: 48/100, KL divergence: 0.0346
   [Trial 116] Refusals: 50/100, KL divergence: 0.0305
   [Trial 130] Refusals: 57/100, KL divergence: 0.0299
   [Trial 117] Refusals: 58/100, KL divergence: 0.0255
   [Trial  86] Refusals: 60/100, KL divergence: 0.0239
   [Trial 157] Refusals: 69/100, KL divergence: 0.0208
   [Trial 110] Refusals: 70/100, KL divergence: 0.0166
   [Trial  63] Refusals: 76/100, KL divergence: 0.0161
   [Trial 167] Refusals: 84/100, KL divergence: 0.0092
   [Trial  96] Refusals: 95/100, KL divergence: 0.0066
   [Trial  84] Refusals: 96/100, KL divergence: 0.0019
   None (exit program)

ones I kept:

? Which trial do you want to use? [Trial  77] Refusals: 25/100, KL divergence: 0.3209

Restoring model from trial 77...
* Reloading model...
Loading checkpoint shards: 100%|███████████████████████████████████| 4/4 [00:01<00:00,  2.33it/s]
* Abliterating...

? What do you want to do with the decensored model? Save the model to a local folder
? Path to the folder: /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_25ref
Saving model...
Model saved to /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_25ref.

? What do you want to do with the decensored model? Nothing (return to trial selection menu)

? Which trial do you want to use? [Trial 141] Refusals: 40/100, KL divergence: 0.2788

Restoring model from trial 141...
* Reloading model...
Loading checkpoint shards: 100%|███████████████████████████████████| 4/4 [00:01<00:00,  2.34it/s]
* Abliterating...

? What do you want to do with the decensored model? Save the model to a local folder
? Path to the folder: /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_40ref
Saving model...
Model saved to /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_40ref.

800 trials (200ish startup trials)

? Which trial do you want to use? (Use arrow keys)
 » [Trial 714] Refusals: 20/100, KL divergence: 0.3971
   [Trial 512] Refusals: 27/100, KL divergence: 0.3502
   [Trial 291] Refusals: 28/100, KL divergence: 0.3267
   [Trial 290] Refusals: 29/100, KL divergence: 0.3257
   [Trial 508] Refusals: 30/100, KL divergence: 0.3178
   [Trial 218] Refusals: 38/100, KL divergence: 0.3003
   [Trial 517] Refusals: 40/100, KL divergence: 0.2660
   [Trial 302] Refusals: 41/100, KL divergence: 0.0471
   [Trial 312] Refusals: 45/100, KL divergence: 0.0352
   [Trial 557] Refusals: 53/100, KL divergence: 0.0257
   [Trial 161] Refusals: 59/100, KL divergence: 0.0250
   [Trial 413] Refusals: 63/100, KL divergence: 0.0222
   [Trial 119] Refusals: 67/100, KL divergence: 0.0196
   [Trial 296] Refusals: 70/100, KL divergence: 0.0187
   [Trial 274] Refusals: 74/100, KL divergence: 0.0156
   [Trial 299] Refusals: 78/100, KL divergence: 0.0123
   [Trial 311] Refusals: 85/100, KL divergence: 0.0090
   [Trial 259] Refusals: 89/100, KL divergence: 0.0077
   [Trial 543] Refusals: 93/100, KL divergence: 0.0034
   [Trial 246] Refusals: 94/100, KL divergence: 0.0029
   [Trial 458] Refusals: 95/100, KL divergence: 0.0024
   [Trial 757] Refusals: 96/100, KL divergence: 0.0015
   None (exit program)

Comparison of low-refusal regime results:

Run Best low-refusal result
200 trials 25 refusals @ 0.3209 KL
800 trials 27 refusals @ 0.3502 KL

The 200-trial result dominates the 800-trial result for both objectives. This is counterintuitive; more trials should not degrade the Pareto front. It was even more drastic with medgemma-27b-it (albeit a larger model). As you can imagine, this makes it hard to validate support/configuration for new architectures: For example, I've attempted to run longer studies when checking why <some new arch> gets a min of 30/100 refusals, and then when the run with n=1000 trials is worse than n=200, I don't have much to go on.

Both runs used v1.1.0. The n_startup_trials parameter was manually scaled proportionally (200-250 for the 800-trial run).

potential causes

Note

item 1 I checked and agree (should be easy). items 2-3 are suggested by Claude after my probing at what else could be happening beyond simple random seeds, take them with a grain of salt.

  1. No seed parameter in TPESampler - Each run explores a completely different random trajectory. This alone could explain divergent outcomes, though the magnitude of the difference seems large for pure seed variance.

    • When seed=None (default): TPESampler creates a numpy.random.RandomState which seeds from /dev/urandom or system clock. Each run gets a different random state.
  2. Multivariate TPE in high-dimensional space - With ~10 parameters and multivariate=True, covariance estimation from the startup samples is noisy. Different initializations can lock the optimizer into different basins that it won't escape from.

  3. Disconnected Pareto-optimal regions - The data suggests two distinct regimes (note the KL cliff between ~40-48 refusals dropping from ~0.27 to ~0.03). MOTPE may struggle to traverse dominated space between disconnected basins.

Debugging limitations

On a related note, there's currently no mechanism to export/persist trial data, making it difficult to:

  • Compare parameter distributions across runs
  • Identify which regions of the search space each run explored
  • Warm-start from prior results

As part of debugging this, can we add support for saving at least the Pareto frontier trials, if not the whole study? I can implement manually, of course, but I think this would be useful for future issues as well.

Footnotes

  1. already supported, as it's Gemma3ForCausalLM arch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions