feat(rl): implement TurnLevelReward infrastructure for Multi-Turn GRPO by RUFFY-369 · Pull Request #3451 · NousResearch/hermes-agent

RUFFY-369 · 2026-03-27T20:47:21Z

Note

Research Context: This PR implements the TurnLevelReward infrastructure for Multi-Turn GRPO, building directly on the agent loop and environment integration in PR #3448.

What does this PR do?

Important

This pull request is based on the changes in PR #3448 (CodeDebug Environment). Please merge PR #3448 first before reviewing/merging this one.

This PR introduces the core infrastructure for Multi-Turn Group Relative Policy Optimization (MT-GRPO). It enables environments to provide granular, turn-by-turn reward signals, which are essential for effective credit assignment in multi-turn reasoning trajectories.

Related Issue

Fixes # (Initial infrastructure for MT-GRPO support)

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

environments/turn_level_reward.py: Added TurnLevelRewardMixin to support List[float] reward signals (one reward per assistant turn).
environments/hermes_base_env.py: Refactored ScoredDataItem construction to detect TurnLevelRewardMixin and handle turn-level signals from environments.
environments/agent_loop.py: Standardized logging and formatted the core agent loop for production readiness.
model_tools.py: Performed final formatting pass and cleaned up imports for better readability.

How to Test

Mixed-Reward Check: Initialize an environment that uses TurnLevelRewardMixin.
Trajectory Collection: Run a rollout and verify the scores field in the resulting JSONL contains a list of rewards (matching the number of assistant turns) instead of a single scalar.
Backward Compatibility: Run a standard environment (like CodeDebugEnv) and verify it still returns single scalar rewards as expected without breaking the pipeline.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q (Note: System-level version conflict in env, but rollout verified)
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: Ubuntu 22.04

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

(N/A - This is a core RL Infrastructure change)

Screenshots / Logs

(Base environment correctly detects subclasses of TurnLevelRewardMixin and produces JSONL trajectories with array-based scores).

cc @teknium1

Extends HermesAgentBaseEnv with: - HumanEvalPack dataset (164 buggy Python functions) - Workspace scaffolding (buggy.py + tests.py uploaded to sandbox) - Multi-signal reward: test_signal (0.5), diagnosis (0.3), efficiency (0.2) - Terminal + file toolsets for iterative debugging

…-ASCII characters

RUFFY-369 and others added 25 commits March 24, 2026 18:02

Merge branch 'NousResearch:main' into feat/code-debug-agent-env

eaf850e

chore:switch CodeDebugEnv to vllm server type for stable tool-calling

41e1f71

chore:use manual tool injection for CodeDebugEnv stability on vLLM

d3f959d

chore:explicitly set tool choice

f7a2ec6

fix: inhibit tools flag:

9232c50

fix:add extra body flag on env config too

a4fe80c

chore:add prints to see what model is responding

460ade6

feat:implement tool parser

0f7e0a1

fix:make parser more robust

ebb16d9

fix:import error

fb4449e

fix:env import issue

d24b85b

fix:final name error in logging

d05ab7a

docs: add README for CodeDebugEnv

60502cf

Merge branch 'NousResearch:main' into feat/code-debug-agent-env

77d935e

feat: implement TurnLevelRewardMixin for MT-GRPO support

5556b61

test: add CLI support to MockMTGRPOEnv

4788586

chore:hardcode vllm stab flags for testing purpose

145d081

fix: correct config_init and add APIServerConfig import in mock env

74ae731

fix: propagate inhibit flag to env_config in mock env

25a4541

chore: production cleanup - remove debug prints and test artifacts

9318665

refactor:final production-ready audit; remove debug artifacts and non…

9b1bd5c

…-ASCII characters

Merge branch 'NousResearch:main' into feat/turn-level-reward

8ee43dd

Merge branch 'NousResearch:main' into feat/turn-level-reward

0a4de75

style: apply black and ruff formatting for production standards

caa322d

RUFFY-369 mentioned this pull request Mar 27, 2026

feat(env): integrate CodeDebug environment and vLLM inference stabilization #3448

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rl): implement TurnLevelReward infrastructure for Multi-Turn GRPO#3451

feat(rl): implement TurnLevelReward infrastructure for Multi-Turn GRPO#3451
RUFFY-369 wants to merge 25 commits intoNousResearch:mainfrom
RUFFY-369:feat/turn-level-reward

RUFFY-369 commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RUFFY-369 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

For New Skills

Screenshots / Logs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RUFFY-369 commented Mar 27, 2026 •

edited

Loading