Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 56 additions & 20 deletions docs/sleep/RESULTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,41 @@ argument for SkillOpt-Sleep's design, and why the gate ships **on by default**.

---

## 2. Experience replay turns a one-time bump into a climb
## 2. Cross-model scaling — bigger gains where there's headroom

The same protocol on a weaker target model (**GPT-5.4-nano**, optimizer = GPT-5.5)
produces substantially larger gains — because the weaker model has more room to
learn. This is the realistic "cheap deployed agent, strong overnight optimizer"
scenario:

| Config (SearchQA, nano, gated) | Baseline → After | Δ | Night-by-night |
|---|---|---|---|
| **cumulative replay, nights=5** | 0.560 → **0.679** | **+11.9** | 0.560 → 0.626 → 0.665 → 0.665 → 0.665 → 0.679 |
| recall_k=20, nights=5 | 0.566 → 0.681 | +11.5 | 0.566 → 0.659 → 0.685 → 0.685 → 0.681 → 0.681 |
| cumulative, nights=8 | 0.562 → 0.657 | +9.5 | saturates after night 5 |

Both replay strategies (cumulative and recall) agree within 0.4 pt — the gain is
robust across configurations.

**Compared to GPT-5.5 on the same benchmark (SearchQA, gated):**

| Target model | Best Δ | Baseline | Headroom |
|---|---|---|---|
| GPT-5.4-nano | **+11.9** | 0.560 | 44 pt |
| GPT-5.5 | +6.0 | 0.798 | 20 pt |

The story: **SkillOpt-Sleep helps most where there's the most to learn** — weaker
deployed models benefit ~2× as much from the same nightly optimization. This is
also the economical deployment pattern (cheap inference model + one strong
overnight optimizer call).

---

## 3. Experience replay turns a one-time bump into a climb

The plugin's two opt-in knobs (`recall_k`, `dream_rollouts`) are what produce the
gains. On the cleanest signal — **SearchQA, GPT-5.5, gated** — the gain rises
monotonically with how much relevant past experience is recalled:
gains. On **SearchQA, GPT-5.5, gated** — the gain rises monotonically with how
much relevant past experience is recalled:

| Replay (`dream_rollouts=5`) | Baseline → After | Δ |
|---|---|---|
Expand All @@ -70,8 +100,8 @@ plateauing — full-history replay, gated, night by night:
0.798 → 0.814 → 0.854 → 0.854 → 0.854 → 0.858
```

The gate accepts a new, better skill as late as **night 5** (0.854 → 0.858) — the
best SearchQA result in the whole study. Replay-policy ablation (SearchQA, GPT-5.5):
The gate accepts a new, better skill as late as **night 5** (0.854 → 0.858).
Replay-policy ablation (SearchQA, GPT-5.5):

| Replay policy | Gate-free Δ | Gated Δ |
|---|---|---|
Expand All @@ -83,7 +113,24 @@ Recall captures most of cumulative's benefit at a fraction of the per-night cost

---

## 3. Why these gains exist — the dream-diversity fix (and a rigor note)
## 4. Default hyperparameters are the sweet spot

We swept `dream_factor`, `rollouts`, `per_night`, and `nights` on the nano cell
(SearchQA, gated) to verify the shipped defaults are well-tuned:

| Variant | Δ | vs default (+11.9) |
|---|---|---|
| dream_factor=4 (default 2) | +8.8 | −3.1 |
| rollouts=10 (default 5) | +9.5 | −2.4 |
| per_night=15 (default 10) | +2.7 | −9.2 |
| nights=8 (default 5) | +9.5 | −2.4 |

Every direction away from the default hurts. This means users get the best result
**out of the box** without tuning — the recipe is robust by design.

---

## 5. Why these gains exist — the dream-diversity fix (and a rigor note)

Reflection learns from the **contrast** between good and bad rollouts of the same
task, which requires the K dream rollouts to be *independent samples*. An early
Expand All @@ -107,7 +154,7 @@ slips through.

---

## 4. End-to-end on real agents
## 6. End-to-end on real agents

On the public [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1`
benchmark — designed for exactly this learnable-gap setting — deficient seed skills
Expand All @@ -117,7 +164,7 @@ cross-verify each other's consolidated skills.

---

## 5. Honest scope & limitations
## 7. Honest scope & limitations

- **Where it helps:** recurring tasks with a checkable correctness signal and real
headroom. That is the plugin's actual use case (your repeated daily tasks and
Expand All @@ -132,18 +179,7 @@ cross-verify each other's consolidated skills.
−52.8 collapse. Gate-free mode is for users who cannot hold out a validation set
and is additionally protected by the output-contract guardrail.

## Reproduce

```bash
PY=python # an env with openai + azure-identity
# one cell (SearchQA, GPT-5.5, gated, recall + dream rollouts):
SKILLOPT_SLEEP_WORKERS=24 PYTHONPATH=. $PY -m skillopt_sleep.experiments.run_nightly \
--backend azure-responses --model gpt-5.5 --benchmarks searchqa --gate on \
--replay-mode retrieval --retrieve-k 20 --rollouts 5 --nights 5 --per-night 10 --json
# full grid across models/benchmarks/modes:
SKILLOPT_SLEEP_WORKERS=32 PYTHONPATH=. $PY -m skillopt_sleep.experiments.run_nightly_matrix \
--model gpt-5.5 --replay-mode retrieval --retrieve-k 20 --nights 5 --per-night 10 --rollouts 5
```
---

Back to the module overview: [`docs/sleep/README.md`](README.md) ·
full reference: [Documentation & Reproduction Guide](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).