RLMEnv: Simplify constructor and internals#966
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| sandbox_id, | ||
| cmd, | ||
| timeout=self.env._compute_install_wait_seconds(), | ||
| timeout=self.env.max_startup_wait_seconds, |
There was a problem hiding this comment.
Pip install timeout no longer scales with packages
Low Severity
The removed _compute_install_wait_seconds() scaled the pip install timeout based on the number of packages (30s per package, minimum max_startup_wait_seconds). Now using the flat max_startup_wait_seconds (default 120s) means environments with many pip_install_packages (5+) may time out during installation where they previously succeeded.
There was a problem hiding this comment.
If somebody installs that many packages, they know what they're doing, and should be able to simply increase the max_startup_wait_seconds..
…duction-2026-02-27 merge in main
The type checker flags 5 unresolved-attribute errors because _interception_server is typed as InterceptionServer | None. Use cast() at each access site to narrow the type, since these code paths only run when interception is active (not gateway mode). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The validation checks correctly use "heavy" but the error messages still said "high", which would mislead users into using an invalid value. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a rollout-level completion-token budget shared across all sub-LLM calls. When set, the environment tracks cumulative sub-LLM completion tokens and refuses new calls once the budget is reached. The root model is informed of the budget in its system prompt and in the per-batch summary printed after each llm_batch() call. None (default) means unlimited, preserving backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move all sandbox backend tests into the main RLM test file and delete the separate file. No test changes — just consolidation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The budget gate in _run_sub_llm_request only fired before starting a sub-LLM call. A single call with multiple tool-calling turns could blow past the budget unchecked. Now _run_sub_llm checks the combined committed + in-flight completion tokens after each turn and breaks out of the loop early when the budget is exceeded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…duction-2026-02-27 # Conflicts: # verifiers/envs/experimental/cli_agent_env.py
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Move the assistant message append before the token budget check so the forced final answer path sees a complete conversation, consistent with the normal max-turns exit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


Description
**kwargsor had noremaining use case (
interception_host,interception_port,interception_url,execution_backend,context_key,sandbox_start_command,sandbox_client_max_workers,root_tool_serialization,stagger/jitterparams, etc.)_InterceptionPoolsingleton and all shared-pool branching — eachRLMEnvinstance now ownsits own interception server and tunnel (this undoes a recent change by myself which was poorly motivated and thought through)
max_turns: int = 50constructor param (previously inherited a default of 10 fromStatefulToolEnv, easily lost via**kwargs)sub_tool_max_turns→sub_llm_max_turnsfor consistency withmax_sub_llm_parallelismand thesub_llm_*metric namesinterception_port=0(OS-assigned) andbind_host="127.0.0.1"— the old configurability onlymattered for the now-removed pool
Also adds the
sub_llm_max_completion_tokensarg to control the maximum number of completion tokens across all sub-LLM calls.max_tokensas set by prime-rl already controlled the per-sub-LLM-call number of tokens, but this now controls the total token budget. The RLM is told this budget in the system prompt and the number of used up completion tokens in eachllm_batchcall. The enforcement isn't perfect due to the parallelism of sub-LLM calls, but it works fairly well. When the budget is reached, sub-LLMs may make no further calls and all calls tollm_batchfail, but the RLM can still perform its work in other ways.Note: requires small changes to the -rlm environments.
Type of Change
Testing
uv run pytestlocally.Checklist
Note
Medium Risk
Medium risk because it makes breaking constructor/API changes and alters interception/tunnel lifecycle and sub-LLM execution behavior (timeouts, batching, and new budget-based early exits). Main failure modes are misconfigured integrations and unexpected skipped
llm_batchcalls under parallelism.Overview
Simplifies
RLMEnv’s public API and internals by removing a large set of constructor knobs (e.g., interception host/port/url config, context key overrides, stagger/jitter, sandbox client sizing, deprecated backend params) and introducing an explicitmax_turnsparameter (replacing the priormax_iterationspass-through).Changes interception behavior by deleting the
_InterceptionPoolsingleton and shared-pool code paths; eachRLMEnvinstance now starts and owns its own interception server/tunnel, with interception binding/port selection effectively hardcoded (localhost + OS-assigned port).Adds a new rollout-wide sub-LLM completion-token budget via
sub_llm_max_completion_tokens, enforced both before startingllm_batchand during sub-LLM tool loops (with a forced final-answer call), and surfaces budget info in the root system prompt andllm_batchsummary output.Tests and docs are updated accordingly: rename
sub_tool_max_turns→sub_llm_max_turns, remove pool tests, fold sandbox tests intotest_rlm_env.py, and trim the experimental README’sRLMEnvsection.Written by Cursor Bugbot for commit 0722cae. This will update automatically on new commits. Configure here.