Skip to content

RLMEnv: Simplify constructor and internals#966

Merged
snimu merged 13 commits intomainfrom
sebastian/rlm-args-reduction-2026-02-27
Mar 2, 2026
Merged

RLMEnv: Simplify constructor and internals#966
snimu merged 13 commits intomainfrom
sebastian/rlm-args-reduction-2026-02-27

Conversation

@snimu
Copy link
Contributor

@snimu snimu commented Feb 27, 2026

Description

  • Remove 14 unused/niche constructor args that were silently swallowed via **kwargs or had no
    remaining use case (interception_host, interception_port, interception_url, execution_backend,
    context_key, sandbox_start_command, sandbox_client_max_workers, root_tool_serialization,
    stagger/jitter params, etc.)
  • Remove _InterceptionPool singleton and all shared-pool branching — each RLMEnv instance now owns
    its own interception server and tunnel (this undoes a recent change by myself which was poorly motivated and thought through)
  • Add explicit max_turns: int = 50 constructor param (previously inherited a default of 10 from
    StatefulToolEnv, easily lost via **kwargs)
  • Rename sub_tool_max_turnssub_llm_max_turns for consistency with max_sub_llm_parallelism and the
    sub_llm_* metric names
  • Hardcode interception_port=0 (OS-assigned) and bind_host="127.0.0.1" — the old configurability only
    mattered for the now-removed pool
  • Update docs and docstring to remove outdated claims

Also adds the sub_llm_max_completion_tokens arg to control the maximum number of completion tokens across all sub-LLM calls. max_tokens as set by prime-rl already controlled the per-sub-LLM-call number of tokens, but this now controls the total token budget. The RLM is told this budget in the system prompt and the number of used up completion tokens in each llm_batch call. The enforcement isn't perfect due to the parallelism of sub-LLM calls, but it works fairly well. When the budget is reached, sub-LLMs may make no further calls and all calls to llm_batch fail, but the RLM can still perform its work in other ways.

Note: requires small changes to the -rlm environments.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Note

Medium Risk
Medium risk because it makes breaking constructor/API changes and alters interception/tunnel lifecycle and sub-LLM execution behavior (timeouts, batching, and new budget-based early exits). Main failure modes are misconfigured integrations and unexpected skipped llm_batch calls under parallelism.

Overview
Simplifies RLMEnv’s public API and internals by removing a large set of constructor knobs (e.g., interception host/port/url config, context key overrides, stagger/jitter, sandbox client sizing, deprecated backend params) and introducing an explicit max_turns parameter (replacing the prior max_iterations pass-through).

Changes interception behavior by deleting the _InterceptionPool singleton and shared-pool code paths; each RLMEnv instance now starts and owns its own interception server/tunnel, with interception binding/port selection effectively hardcoded (localhost + OS-assigned port).

Adds a new rollout-wide sub-LLM completion-token budget via sub_llm_max_completion_tokens, enforced both before starting llm_batch and during sub-LLM tool loops (with a forced final-answer call), and surfaces budget info in the root system prompt and llm_batch summary output.

Tests and docs are updated accordingly: rename sub_tool_max_turnssub_llm_max_turns, remove pool tests, fold sandbox tests into test_rlm_env.py, and trim the experimental README’s RLMEnv section.

Written by Cursor Bugbot for commit 0722cae. This will update automatically on new commits. Configure here.

snimu and others added 4 commits February 27, 2026 13:53
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sandbox_id,
cmd,
timeout=self.env._compute_install_wait_seconds(),
timeout=self.env.max_startup_wait_seconds,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pip install timeout no longer scales with packages

Low Severity

The removed _compute_install_wait_seconds() scaled the pip install timeout based on the number of packages (30s per package, minimum max_startup_wait_seconds). Now using the flat max_startup_wait_seconds (default 120s) means environments with many pip_install_packages (5+) may time out during installation where they previously succeeded.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If somebody installs that many packages, they know what they're doing, and should be able to simply increase the max_startup_wait_seconds..

snimu and others added 2 commits March 1, 2026 11:33
The type checker flags 5 unresolved-attribute errors because
_interception_server is typed as InterceptionServer | None.
Use cast() at each access site to narrow the type, since these
code paths only run when interception is active (not gateway mode).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The validation checks correctly use "heavy" but the error messages
still said "high", which would mislead users into using an invalid value.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
snimu and others added 5 commits March 1, 2026 11:48
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a rollout-level completion-token budget shared across all sub-LLM
calls. When set, the environment tracks cumulative sub-LLM completion
tokens and refuses new calls once the budget is reached. The root model
is informed of the budget in its system prompt and in the per-batch
summary printed after each llm_batch() call. None (default) means
unlimited, preserving backward compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move all sandbox backend tests into the main RLM test file and delete
the separate file. No test changes — just consolidation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The budget gate in _run_sub_llm_request only fired before starting a
sub-LLM call. A single call with multiple tool-calling turns could
blow past the budget unchecked. Now _run_sub_llm checks the combined
committed + in-flight completion tokens after each turn and breaks
out of the loop early when the budget is exceeded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…duction-2026-02-27

# Conflicts:
#	verifiers/envs/experimental/cli_agent_env.py
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Move the assistant message append before the token budget check so
the forced final answer path sees a complete conversation, consistent
with the normal max-turns exit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@snimu snimu merged commit 522396c into main Mar 2, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant