test(sleep): add verifier-discipline stress test (closes #67)#87
test(sleep): add verifier-discipline stress test (closes #67)#87Tanmay9223 wants to merge 2 commits into
Conversation
Add missing configuration setup in scripts/eval_only.py to properly support the minimax_chat backend, which was entirely omitted. Fix the following coverage gaps in eval_only.py: - Add minimax CLI arguments - Include the minimax config mappings in _MAP - Update the backend parsing logic - Call configure_minimax_chat
Add a regression test to ensure the validation gate correctly rejects reward-hacking skill edits. It has been observed that optimizers sometimes propose shortcuts that improve train/replay metrics but fail to improve held-out behavior. This test codifies that the gate blocks such artifacts. Add TestVerifierDiscipline to the test_sleep_engine.py suite: - Create MockRewardHackingBackend that simulates a reward-hacking rule which passes the train set but degrades the held-out tasks. - Assert that the proposed edit is rejected by the gate.
|
Thanks @Tanmay9223 — the reward-hacking gate test is a genuinely useful invariant to lock in. 🙏 A few things to sort out before this can land, now that #85 has merged: 1. Rebase needed —
2. The new test class is defined after 3. Heads-up on the bundled 4. Overlap with #86. This PR is effectively a superset of #86 — I've suggested we close #86 in favor of this one. Once it's rebased and the test class is moved, I'm happy to merge with a merge-commit so your authorship is preserved. Thanks again! |
🎯 What: Adds a verifier-discipline stress test to ensure the validation gate correctly rejects reward-hacking skill edits.
💡 Why: To prevent regressions where the optimizer proposes a rule that looks attractive on train/replay tasks (e.g., gaming the evaluator by forcing literal tokens) but degrades grounded behavior on held-out tasks. This turns the failure mode identified in related discussions into a repeatable regression test.
✅ Verification: Verified locally with
python3 -m unittest tests/test_sleep_engine.py, which now executes the newtest_gate_rejects_reward_hacking_editand confirms the candidate is rejected.✨ Result: SkillOpt-Sleep now has an explicit invariant test ensuring reward-hacking edits are caught and rejected by the validation gate.