Longer optimization runs can produce worse Pareto fronts than shorter runs

Sorry for the lack of updates wrt to support for hybrid model arch. In the process of getting to that, I realized the below was not just a one-off and should probably be dealt with first, as it impacts even cursory sanity/'what if' checks (esp. relevant for new architectures to see if something is wrong/need more trials).

---

### Observed behavior

When running abliteration on [rnj-1-instruct](https://huggingface.co/EssentialAI/rnj-1-instruct)[^1], an 800-trial optimization produced a strictly worse Pareto front than a 200-trial run on the same model. This had happened before and I wrote it off, but realized it was far too consistent, persists across different architectures, etc. So here we are.

[^1]: already supported, as it's [Gemma3ForCausalLM](https://huggingface.co/EssentialAI/rnj-1-instruct/blob/main/config.json#L2) arch

<details>
  <summary><b>Detailed trial result lists - click to expand</b></summary>

### rnj1_abliteration_results_options.md

both done with the v1.1.0 release

#### 200 trials

default params.

```text
? Which trial do you want to use? (Use arrow keys)
 » [Trial  77] Refusals: 25/100, KL divergence: 0.3209
   [Trial 188] Refusals: 34/100, KL divergence: 0.2809
   [Trial  62] Refusals: 39/100, KL divergence: 0.2790
   [Trial 141] Refusals: 40/100, KL divergence: 0.2788
   [Trial  80] Refusals: 42/100, KL divergence: 0.2708
   [Trial 135] Refusals: 43/100, KL divergence: 0.2195
   [Trial 108] Refusals: 48/100, KL divergence: 0.0346
   [Trial 116] Refusals: 50/100, KL divergence: 0.0305
   [Trial 130] Refusals: 57/100, KL divergence: 0.0299
   [Trial 117] Refusals: 58/100, KL divergence: 0.0255
   [Trial  86] Refusals: 60/100, KL divergence: 0.0239
   [Trial 157] Refusals: 69/100, KL divergence: 0.0208
   [Trial 110] Refusals: 70/100, KL divergence: 0.0166
   [Trial  63] Refusals: 76/100, KL divergence: 0.0161
   [Trial 167] Refusals: 84/100, KL divergence: 0.0092
   [Trial  96] Refusals: 95/100, KL divergence: 0.0066
   [Trial  84] Refusals: 96/100, KL divergence: 0.0019
   None (exit program)
```


ones I kept:

```text
? Which trial do you want to use? [Trial  77] Refusals: 25/100, KL divergence: 0.3209

Restoring model from trial 77...
* Reloading model...
Loading checkpoint shards: 100%|███████████████████████████████████| 4/4 [00:01<00:00,  2.33it/s]
* Abliterating...

? What do you want to do with the decensored model? Save the model to a local folder
? Path to the folder: /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_25ref
Saving model...
Model saved to /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_25ref.

? What do you want to do with the decensored model? Nothing (return to trial selection menu)

? Which trial do you want to use? [Trial 141] Refusals: 40/100, KL divergence: 0.2788

Restoring model from trial 141...
* Reloading model...
Loading checkpoint shards: 100%|███████████████████████████████████| 4/4 [00:01<00:00,  2.34it/s]
* Abliterating...

? What do you want to do with the decensored model? Save the model to a local folder
? Path to the folder: /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_40ref
Saving model...
Model saved to /home/pszemraj/model-weights/llm/heretic/rnj-1-instruct-heretic-n200_40ref.
```

#### 800 trials (200ish startup trials)

```
? Which trial do you want to use? (Use arrow keys)
 » [Trial 714] Refusals: 20/100, KL divergence: 0.3971
   [Trial 512] Refusals: 27/100, KL divergence: 0.3502
   [Trial 291] Refusals: 28/100, KL divergence: 0.3267
   [Trial 290] Refusals: 29/100, KL divergence: 0.3257
   [Trial 508] Refusals: 30/100, KL divergence: 0.3178
   [Trial 218] Refusals: 38/100, KL divergence: 0.3003
   [Trial 517] Refusals: 40/100, KL divergence: 0.2660
   [Trial 302] Refusals: 41/100, KL divergence: 0.0471
   [Trial 312] Refusals: 45/100, KL divergence: 0.0352
   [Trial 557] Refusals: 53/100, KL divergence: 0.0257
   [Trial 161] Refusals: 59/100, KL divergence: 0.0250
   [Trial 413] Refusals: 63/100, KL divergence: 0.0222
   [Trial 119] Refusals: 67/100, KL divergence: 0.0196
   [Trial 296] Refusals: 70/100, KL divergence: 0.0187
   [Trial 274] Refusals: 74/100, KL divergence: 0.0156
   [Trial 299] Refusals: 78/100, KL divergence: 0.0123
   [Trial 311] Refusals: 85/100, KL divergence: 0.0090
   [Trial 259] Refusals: 89/100, KL divergence: 0.0077
   [Trial 543] Refusals: 93/100, KL divergence: 0.0034
   [Trial 246] Refusals: 94/100, KL divergence: 0.0029
   [Trial 458] Refusals: 95/100, KL divergence: 0.0024
   [Trial 757] Refusals: 96/100, KL divergence: 0.0015
   None (exit program)
```

</details>

Comparison of low-refusal regime results:

| Run | Best low-refusal result |
|-----|------------------------|
| 200 trials | 25 refusals @ 0.3209 KL |
| 800 trials | 27 refusals @ 0.3502 KL |

The 200-trial result dominates the 800-trial result for both objectives. This is counterintuitive; more trials should not degrade the Pareto front. It was even more drastic with medgemma-27b-it (albeit a larger model). As you can imagine, this makes it hard to validate support/configuration for new architectures: For example, I've attempted to run longer studies when checking why `<some new arch>` gets a min of 30/100 refusals, and then when the run with n=1000 trials is **worse** than n=200, I don't have much to go on.

Both runs used v1.1.0. The `n_startup_trials` parameter was manually scaled proportionally (200-250 for the 800-trial run).


### potential causes

> [!NOTE]
> item 1 I checked and agree (should be easy). items 2-3 are suggested by Claude after my probing at what else could be happening beyond simple random seeds, take them with a grain of salt.

1. **No seed parameter in TPESampler** - Each run explores a completely different random trajectory. This alone could explain divergent outcomes, though the magnitude of the difference seems large for pure seed variance.
    - When `seed=None` (default): TPESampler creates a `numpy.random.RandomState` which seeds from `/dev/urandom` or system clock. Each run gets a different random state.


2. **Multivariate TPE in high-dimensional space** - With ~10 parameters and `multivariate=True`, covariance estimation from the startup samples is noisy. Different initializations can lock the optimizer into different basins that it won't escape from.

3. **Disconnected Pareto-optimal regions** - The data suggests two distinct regimes (note the KL cliff between ~40-48 refusals dropping from ~0.27 to ~0.03). MOTPE may struggle to traverse dominated space between disconnected basins.

### Debugging limitations

On a related note, there's currently no mechanism to export/persist trial data, making it difficult to:
- Compare parameter distributions across runs
- Identify which regions of the search space each run explored
- Warm-start from prior results

As part of debugging this, can we add support for saving at least the Pareto frontier trials, if not the whole study? I can implement manually, of course, but I think this would be useful for future issues as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Longer optimization runs can produce worse Pareto fronts than shorter runs #88

Observed behavior

rnj1_abliteration_results_options.md

200 trials

800 trials (200ish startup trials)

potential causes

Debugging limitations

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Run	Best low-refusal result
200 trials	25 refusals @ 0.3209 KL
800 trials	27 refusals @ 0.3502 KL

Longer optimization runs can produce worse Pareto fronts than shorter runs #88

Description

Observed behavior

rnj1_abliteration_results_options.md

200 trials

800 trials (200ish startup trials)

potential causes

Debugging limitations

Footnotes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions