feat(sleep): cross-model scaling results (+11.9 on nano) + dual-model experiment runner by Yif-Yang · Pull Request #89 · microsoft/SkillOpt

Yif-Yang · 2026-06-25T17:27:42Z

Summary

New headline result: Running SkillOpt-Sleep on a weaker target model (GPT-5.4-nano, optimized by GPT-5.5) yields +11.9 pt gain on SearchQA (0.560→0.679, full 1400-item test, gated) — nearly 2× the GPT-5.5 result (+6.0). Two independent configs (cumulative replay + recall_k=20) agree within 0.4 pt.
Hyperparameter ablation (§4): Swept dream_factor, rollouts, per_night, nights — every direction away from the shipped defaults hurts. Users get optimal performance out of the box.
run_nightly.py: New --target-model / --optimizer-model flags enabling split-model experiments via DualBackend.

Changes

File	What
`docs/sleep/RESULTS.md`	Added §2 (cross-model scaling), §4 (hyperparam ablation), renumbered subsequent sections, updated Reproduce commands
`skillopt_sleep/experiments/run_nightly.py`	New file — the multi-night experiment harness with replay-mode, dream, and dual-model support
`docs/sleep/blog_runs/sweep_nano/*.json`	Raw result JSONs for the two headline configs

Test plan

Verify RESULTS.md renders correctly on GitHub
Confirm run_nightly.py imports cleanly: PYTHONPATH=. python -c "from skillopt_sleep.experiments.run_nightly import main"
Reproduce one cell: --target-model gpt-5.4-nano --optimizer-model gpt-5.5 --benchmarks searchqa --gate on --replay-mode cumulative --nights 1 --per-night 6 --test-limit 40

🤖 Generated with Claude Code

…ram ablation Update RESULTS.md with: - §2: GPT-5.4-nano target yields +11.9 pt (0.560→0.679) on SearchQA — 2× the GPT-5.5 gain, demonstrating bigger benefit where headroom exists - §4: Hyperparameter sweep confirms shipped defaults are optimal Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Yif-Yang force-pushed the sleep/nano-results branch from 407c485 to ce354f6 Compare June 25, 2026 17:36

Yif-Yang merged commit 9de9220 into microsoft:main Jun 25, 2026
1 check was pending

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(sleep): cross-model scaling results (+11.9 on nano) + dual-model experiment runner#89

feat(sleep): cross-model scaling results (+11.9 on nano) + dual-model experiment runner#89
Yif-Yang merged 1 commit into
microsoft:mainfrom
Yif-Yang:sleep/nano-results

Yif-Yang commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Yif-Yang commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Yif-Yang commented Jun 25, 2026 •

edited

Loading