Skip to content

feat(rl): implement TurnLevelReward infrastructure for Multi-Turn GRPO#3451

Open
RUFFY-369 wants to merge 25 commits intoNousResearch:mainfrom
RUFFY-369:feat/turn-level-reward
Open

feat(rl): implement TurnLevelReward infrastructure for Multi-Turn GRPO#3451
RUFFY-369 wants to merge 25 commits intoNousResearch:mainfrom
RUFFY-369:feat/turn-level-reward

Conversation

@RUFFY-369
Copy link
Copy Markdown

@RUFFY-369 RUFFY-369 commented Mar 27, 2026

Note

Research Context: This PR implements the TurnLevelReward infrastructure for Multi-Turn GRPO, building directly on the agent loop and environment integration in PR #3448.

What does this PR do?

Important

This pull request is based on the changes in PR #3448 (CodeDebug Environment). Please merge PR #3448 first before reviewing/merging this one.

This PR introduces the core infrastructure for Multi-Turn Group Relative Policy Optimization (MT-GRPO). It enables environments to provide granular, turn-by-turn reward signals, which are essential for effective credit assignment in multi-turn reasoning trajectories.

Related Issue

Fixes # (Initial infrastructure for MT-GRPO support)

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • environments/turn_level_reward.py: Added TurnLevelRewardMixin to support List[float] reward signals (one reward per assistant turn).
  • environments/hermes_base_env.py: Refactored ScoredDataItem construction to detect TurnLevelRewardMixin and handle turn-level signals from environments.
  • environments/agent_loop.py: Standardized logging and formatted the core agent loop for production readiness.
  • model_tools.py: Performed final formatting pass and cleaned up imports for better readability.

How to Test

  1. Mixed-Reward Check: Initialize an environment that uses TurnLevelRewardMixin.
  2. Trajectory Collection: Run a rollout and verify the scores field in the resulting JSONL contains a list of rewards (matching the number of assistant turns) instead of a single scalar.
  3. Backward Compatibility: Run a standard environment (like CodeDebugEnv) and verify it still returns single scalar rewards as expected without breaking the pipeline.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q (Note: System-level version conflict in env, but rollout verified)
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Ubuntu 22.04

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

(N/A - This is a core RL Infrastructure change)

Screenshots / Logs

(Base environment correctly detects subclasses of TurnLevelRewardMixin and produces JSONL trajectories with array-based scores).

cc @teknium1

RUFFY-369 and others added 25 commits March 24, 2026 18:02
Extends HermesAgentBaseEnv with:
- HumanEvalPack dataset (164 buggy Python functions)
- Workspace scaffolding (buggy.py + tests.py uploaded to sandbox)
- Multi-signal reward: test_signal (0.5), diagnosis (0.3), efficiency (0.2)
- Terminal + file toolsets for iterative debugging
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant