test(sleep): add verifier-discipline stress test (closes #67) by Tanmay9223 · Pull Request #87 · microsoft/SkillOpt

Tanmay9223 · 2026-06-24T18:25:15Z

🎯 What: Adds a verifier-discipline stress test to ensure the validation gate correctly rejects reward-hacking skill edits.
💡 Why: To prevent regressions where the optimizer proposes a rule that looks attractive on train/replay tasks (e.g., gaming the evaluator by forcing literal tokens) but degrades grounded behavior on held-out tasks. This turns the failure mode identified in related discussions into a repeatable regression test.
✅ Verification: Verified locally with python3 -m unittest tests/test_sleep_engine.py, which now executes the new test_gate_rejects_reward_hacking_edit and confirms the candidate is rejected.
✨ Result: SkillOpt-Sleep now has an explicit invariant test ensuring reward-hacking edits are caught and rejected by the validation gate.

Add missing configuration setup in scripts/eval_only.py to properly support the minimax_chat backend, which was entirely omitted. Fix the following coverage gaps in eval_only.py: - Add minimax CLI arguments - Include the minimax config mappings in _MAP - Update the backend parsing logic - Call configure_minimax_chat

Add a regression test to ensure the validation gate correctly rejects reward-hacking skill edits. It has been observed that optimizers sometimes propose shortcuts that improve train/replay metrics but fail to improve held-out behavior. This test codifies that the gate blocks such artifacts. Add TestVerifierDiscipline to the test_sleep_engine.py suite: - Create MockRewardHackingBackend that simulates a reward-hacking rule which passes the train set but degrades the held-out tasks. - Assert that the proposed edit is rejected by the gate.

Yif-Yang · 2026-06-25T18:56:24Z

Thanks @Tanmay9223 — the reward-hacking gate test is a genuinely useful invariant to lock in. 🙏

A few things to sort out before this can land, now that #85 has merged:

1. Rebase needed — scripts/eval_only.py now conflicts. #85 just landed backend wiring on main that touches the same import block and configure_* section this PR edits. A git rebase origin/main (or merge) will be needed. Important detail so you don't drop the wrong hunks: #85 only added the qwen_chat wiring — it did not add any MiniMax code to eval_only.py (no import, no configure_minimax_chat call, no routing), despite #85's title mentioning minimax. So on rebase:

The qwen hunks here will conflict and can be dropped (already on main via fix(eval-only): call configure_qwen_chat so local LLM endpoints can be used #85).
The MiniMax hunks — the import, the --backend choices entry, the elif backend in {"minimax", "minimax_chat"} routing branch, the model-sentinel rewrite, and the configure_minimax_chat(...) call — are still needed; please keep them. main's eval_only.py is currently missing all of those, so your PR is what closes that gap.

2. The new test class is defined after unittest.main(). TestVerifierDiscipline sits below the if __name__ == "__main__": unittest.main() block (around line 1009), so a direct python tests/test_sleep_engine.py run exits before the class is even defined and silently skips it. pytest and python -m unittest still discover it (which is why it passes in CI), but please move the class above that __main__ block so the file-run path covers it too.

3. Heads-up on the bundled skillopt/utils/json_utils.py change. This PR also adds a _looks_json_like guard to the shared JSON extractor, which isn't mentioned in the title/description. I checked it and it's not a regression — inputs like {op: "add", ...} already returned None on main — so I'm fine keeping it. But since json_utils is shared by all backends' optimizer path, a one-line note in the PR description explaining the guard would help future git blame.

4. Overlap with #86. This PR is effectively a superset of #86 — I've suggested we close #86 in favor of this one.

Once it's rebased and the test class is moved, I'm happy to merge with a merge-commit so your authorship is preserved. Thanks again!

Yif-Yang and others added 2 commits June 24, 2026 23:44

Yif-Yang mentioned this pull request Jun 25, 2026

feat(eval): add missing minimax backend configuration #86

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(sleep): add verifier-discipline stress test (closes #67)#87

test(sleep): add verifier-discipline stress test (closes #67)#87
Tanmay9223 wants to merge 2 commits into
microsoft:mainfrom
Tanmay9223:test-verifier-discipline

Tanmay9223 commented Jun 24, 2026

Uh oh!

Yif-Yang commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Tanmay9223 commented Jun 24, 2026

Uh oh!

Yif-Yang commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yif-Yang commented Jun 25, 2026 •

edited

Loading