Skip to content

feat(sleep): cross-model scaling results (+11.9 on nano) + dual-model experiment runner#89

Merged
Yif-Yang merged 1 commit into
microsoft:mainfrom
Yif-Yang:sleep/nano-results
Jun 25, 2026
Merged

feat(sleep): cross-model scaling results (+11.9 on nano) + dual-model experiment runner#89
Yif-Yang merged 1 commit into
microsoft:mainfrom
Yif-Yang:sleep/nano-results

Conversation

@Yif-Yang

@Yif-Yang Yif-Yang commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

  • New headline result: Running SkillOpt-Sleep on a weaker target model (GPT-5.4-nano, optimized by GPT-5.5) yields +11.9 pt gain on SearchQA (0.560→0.679, full 1400-item test, gated) — nearly 2× the GPT-5.5 result (+6.0). Two independent configs (cumulative replay + recall_k=20) agree within 0.4 pt.
  • Hyperparameter ablation (§4): Swept dream_factor, rollouts, per_night, nights — every direction away from the shipped defaults hurts. Users get optimal performance out of the box.
  • run_nightly.py: New --target-model / --optimizer-model flags enabling split-model experiments via DualBackend.

Changes

File What
docs/sleep/RESULTS.md Added §2 (cross-model scaling), §4 (hyperparam ablation), renumbered subsequent sections, updated Reproduce commands
skillopt_sleep/experiments/run_nightly.py New file — the multi-night experiment harness with replay-mode, dream, and dual-model support
docs/sleep/blog_runs/sweep_nano/*.json Raw result JSONs for the two headline configs

Test plan

  • Verify RESULTS.md renders correctly on GitHub
  • Confirm run_nightly.py imports cleanly: PYTHONPATH=. python -c "from skillopt_sleep.experiments.run_nightly import main"
  • Reproduce one cell: --target-model gpt-5.4-nano --optimizer-model gpt-5.5 --benchmarks searchqa --gate on --replay-mode cumulative --nights 1 --per-night 6 --test-limit 40

🤖 Generated with Claude Code

…ram ablation

Update RESULTS.md with:
- §2: GPT-5.4-nano target yields +11.9 pt (0.560→0.679) on SearchQA —
  2× the GPT-5.5 gain, demonstrating bigger benefit where headroom exists
- §4: Hyperparameter sweep confirms shipped defaults are optimal

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@Yif-Yang Yif-Yang force-pushed the sleep/nano-results branch from 407c485 to ce354f6 Compare June 25, 2026 17:36
@Yif-Yang Yif-Yang merged commit 9de9220 into microsoft:main Jun 25, 2026
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant