Skip to content

Add SWERLSandboxEnv for per-sample Docker tasks with submit-based evaluation#1492

Closed
hamishivi wants to merge 193 commits into
mainfrom
swerl-sandbox-env
Closed

Add SWERLSandboxEnv for per-sample Docker tasks with submit-based evaluation#1492
hamishivi wants to merge 193 commits into
mainfrom
swerl-sandbox-env

Conversation

@hamishivi

@hamishivi hamishivi commented Feb 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds SWERLSandboxEnv, a new RL environment that extends GenericSandboxEnv with a submit tool for running per-task test suites inside Docker containers. Each task provides its own instruction, seed files, and test scripts via a task data directory.
  • Registers swerl_sandbox in the TOOL_REGISTRY so it can be used via --tools swerl_sandbox with optional --tool_configs JSON for task_data_dir, test_timeout, etc.
  • Includes 1-GPU and 8-GPU debug training scripts (swerl_sandbox_1gpu.sh, swerl_sandbox_8gpu.sh).

Based on AgentTaskEnv from #1453, renamed to SWERLSandboxEnv and adapted to the current environment interfaces.

Test plan

  • Verify SWERLSandboxEnv can be instantiated via --tools swerl_sandbox
  • Run 1-GPU debug script with test task data
  • Verify submit tool runs test.sh and returns correct reward
  • Verify task data loading (instruction, seeds, tests, setup.sh)

Made with Cursor

…luation

Adds a new RL environment (SWERLSandboxEnv) that extends GenericSandboxEnv
with a `submit` tool for running per-task test suites. Each task provides
its own instruction, seed files, and test scripts via a task data directory.
Includes 1-GPU and 8-GPU debug training scripts.

Based on AgentTaskEnv from #1453, renamed to SWERLSandboxEnv.

Co-authored-by: Cursor <cursoragent@cursor.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @hamishivi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a specialized reinforcement learning environment that leverages Docker containers for evaluating agent performance on coding tasks. It provides a structured way to define tasks with their own instructions, initial files, and test mechanisms, allowing agents to interact with a sandbox, develop solutions, and then submit them for automated testing and reward calculation. This enhancement streamlines the process of training and evaluating agents on complex, code-centric problems by isolating each task's execution within a containerized environment.

Highlights

  • New RL Environment: Introduced SWERLSandboxEnv, an extension of GenericSandboxEnv, designed for per-sample Docker tasks with submit-based evaluation. This environment allows for running test suites inside Docker containers for each task.
  • Submit Tool Integration: Added a submit tool within SWERLSandboxEnv that executes a per-task test script and returns a reward, facilitating automated evaluation of agent solutions.
  • Dynamic Task Data Loading: Enabled loading of task-specific data, including instructions, seed files, test scripts, optional Docker images, and setup scripts, from a designated task_data_dir for each task.
  • Tool Registry Integration: Registered swerl_sandbox in the TOOL_REGISTRY, making it available for use via --tools swerl_sandbox with configurable options like task_data_dir and test_timeout.
  • Debug Training Scripts: Included new debug training scripts (swerl_sandbox_1gpu.sh and swerl_sandbox_8gpu.sh) to facilitate testing and training with the new SWERLSandboxEnv.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • open_instruct/environments/init.py
    • Imported SWERLSandboxEnv and SWERLSandboxEnvConfig.
    • Added SWERLSandboxEnv and SWERLSandboxEnvConfig to the __all__ export list.
  • open_instruct/environments/swerl_sandbox.py
    • Added a new file defining the SWERLSandboxEnv class, which extends GenericSandboxEnv.
    • Implemented the submit tool for running per-task test scripts and parsing rewards.
    • Included logic for loading task-specific data (instructions, seeds, tests, setup scripts) into the Docker container during environment reset.
    • Defined SWERLSandboxEnvConfig for configuring the new environment.
  • open_instruct/environments/tools/tools.py
    • Imported SWERLSandboxEnvConfig.
    • Registered SWERLSandboxEnvConfig in the TOOL_REGISTRY mapping.
  • scripts/train/debug/envs/swerl_sandbox_1gpu.sh
    • Added a new bash script for debugging GRPO training with SWERLSandboxEnv on a single GPU.
  • scripts/train/debug/envs/swerl_sandbox_8gpu.sh
    • Added a new bash script for training GRPO with SWERLSandboxEnv on 8 GPUs using Beaker.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces SWERLSandboxEnv, an extension of GenericSandboxEnv for per-sample Docker tasks with submit-based evaluation. This new environment integrates a submit tool to run test suites within Docker containers, loading task-specific data like instructions, seed files, and test scripts. The changes also include updates to __init__.py and tools.py to register the new environment and its configuration, along with debug training scripts for 1-GPU and 8-GPU setups. The implementation appears sound, providing a flexible and robust solution for Docker-based RL environments.

Comment thread open_instruct/environments/swerl_sandbox.py Outdated
Comment thread open_instruct/environments/swerl_sandbox.py Outdated
hamishivi and others added 4 commits February 23, 2026 20:59
Inline all bash/editor tool logic directly into SWERLSandboxEnv so it
extends RLEnvironment directly with no dependency on GenericSandboxEnv.
The bash wrapper defaults to /workspace instead of /testbed.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use relative paths like the other debug scripts.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@hamishivi

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SWERLSandboxEnv, a new environment for running per-sample Docker-based tasks, which is a great addition for more complex agent evaluations. The implementation is solid, including the new submit tool and task data loading mechanism. My review includes a critical security fix to prevent command injection when creating directories from task data, along with suggestions to improve robustness by specifying file encodings and enhance consistency in how command outputs are truncated for logging. Overall, these are valuable changes that expand the repository's RL capabilities.

Comment thread open_instruct/environments/swerl_sandbox.py Outdated
Comment thread open_instruct/environments/swerl_sandbox.py Outdated
Comment thread open_instruct/environments/swerl_sandbox.py Outdated
Comment thread open_instruct/environments/swerl_sandbox.py Outdated
root and others added 5 commits February 25, 2026 01:55
Add download_swerl_data.sh helper that uses snapshot_download to
fetch hamishivi/agent-task-combined and exports TASK_DATA_DIR.
Both 1gpu and 8gpu scripts source it before training, so the
task data is available in Beaker jobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mkdir -p is idempotent so the != check was unnecessary, and the
path should be shell-escaped to prevent injection from malicious
directory names in task data.

Made-with: Cursor
@hamishivi

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SWERLSandboxEnv, a new RL environment for per-sample Docker-based tasks, which is a valuable addition. The implementation is solid, extending RLEnvironment with tools for bash execution, file editing, and a new submit tool for evaluation. The integration with the tool registry and the inclusion of debug scripts are well-executed.

My main feedback is on the _load_task_data method in swerl_sandbox.py, where there's an opportunity to improve both performance and maintainability by refactoring the file-copying logic to reduce code duplication and the number of Docker API calls. I've left a specific comment with suggestions.

Overall, this is a great contribution that enhances the framework's capabilities for agentic tasks.

Comment thread open_instruct/environments/swerl_sandbox.py
hamishivi and others added 14 commits February 26, 2026 16:14
Deduplicate the seeds/tests upload logic into a single
_upload_directory method. Collect all needed subdirectories
upfront and create them in one mkdir -p call instead of
one per file.

Made-with: Cursor
Instead of requiring task data to be pre-downloaded to a local
directory, set task_data_hf_repo to a HuggingFace dataset repo ID.
On setup(), snapshot_download fetches the repo once and caches it
via huggingface_hub — subsequent runs are instant.

Made-with: Cursor
The HF repo now stores a single task-data.tar.gz archive instead of
5659 individual task directories, making downloads much faster.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Docker put_archive API returns a 500 if the parent directory
doesn't exist (e.g. mkdir -p failed silently due to permissions).
Now we check the mkdir result and wrap write_file so errors surface
as user-visible editor messages instead of crashing the rollout.

Made-with: Cursor
Made-with: Cursor
Environment rollouts with 20 max_steps keep the vLLM engine busy
long enough that the weight sync (which must wait for in-flight
rollouts) exceeds the 120s timeout.

Made-with: Cursor
hamishivi and others added 27 commits April 14, 2026 17:57
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
process_from_queue must not break on _should_stop() — that flag is
temporarily True during every weight sync. If the completion queue
happens to be empty for 1s during a weight sync, the thread exits
permanently and results are never processed again.

Revert to unconditional loop — the thread will be killed when the
Ray actor is shut down.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same issue as process_from_queue: breaking on _should_stop() exits
the thread permanently during weight sync. Restore original behavior
where should_stop causes a sleep-and-continue (pause), not an exit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen3.5's GatedDeltaNet (linear attention) layers ignore sequence
boundaries in packed inputs: causal conv1d leaks across sequences
(seq_idx=None) and the recurrent state carries over. This causes
incorrect logprobs during training and inflated KL divergence.

Monkey-patch based on huggingface/transformers#45034:
- Pass seq_idx to causal_conv1d_fn for packing-aware convolution
- Pass cu_seqlens to FLA chunk_gated_delta_rule kernel
- Forward **kwargs through DecoderLayer to linear_attn
- Add cu_seqlens to padding_free_collator output

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multiple PolicyTrainerRayProcess actors call from_pretrained
concurrently, and each internally triggers snapshot_download to the
shared HF cache on weka. This race causes cache symlink corruption
and intermittent OSError ("does not appear to have files named...").

Since the pre-download step already caches all files, use
local_files_only=True to skip the concurrent downloads entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rank 0 does NCCL broadcasts to vLLM engines inside GatheredParameters,
while ranks 1-7 return immediately. Without a barrier, ranks 1-7 start
exiting GatheredParameters (itself a collective to re-partition params)
while rank 0 is still broadcasting on model_update_group. This causes
the exit collective to deadlock intermittently.

Add torch.distributed.barrier() before exiting the context so all
ranks wait for rank 0 to finish the vLLM broadcasts first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gathering all params at once (gather_whole_model=True) with DeepSpeed
stage 3 deadlocks intermittently: rank 0 does NCCL broadcasts on
model_update_group inside GatheredParameters, and the exit collective
(re-partition) races with these broadcasts, causing a deadlock.

For DeepSpeed stage 3, use the per-parameter gathering path instead.
Each param is gathered individually, broadcast to vLLM, then released.
This avoids the interleaving of collectives on different NCCL groups.

FSDP continues to use gather_whole_model since it doesn't have the
same issue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pass seq_idx and cu_seqlens through forward_for_logprobs so the
  Qwen3.5 GatedDeltaNet packing patch can isolate conv1d and recurrent
  state across packed sequences.  Fix cu_seqlens dtype to int32.
- Do NOT pass cu_seq_lens_q/k for flash attention — HF already detects
  packing from position_ids resets, and explicit cu_seq_lens would break
  Ulysses sequence parallelism (which slices position_ids per rank).
- Set mamba_ssm_cache_dtype=float32 to prevent bf16 rounding from
  compounding across decode steps in the GatedDeltaNet recurrent state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Build FLACPContext at trainer init from the SP rank/group so FLA's
chunk_gated_delta_rule can communicate recurrent state across
sequence-parallel ranks.  Without this, each SP rank's GatedDeltaNet
sees a truncated sequence with no state from prior chunks.

The context is threaded through forward_for_logprobs / compute_logprobs
and the Qwen3.5 packing patch into FLA's cp_context parameter.
When SP is off, cp_context is None and nothing changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each SWERLSandboxEnv actor was independently calling snapshot_download
in setup(), making an HF API call per actor.  With pool_size=128 and
sequential Ray actor startup, this added minutes of redundant downloads.

Extract resolve_task_data_dir() as a static method and call it once in
create_tool_pools() before spawning actors, passing the resolved path
as task_data_dir so each actor skips the download.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, the per-engine concurrency was always auto-computed from
KV cache capacity via get_kv_cache_info(). This adds a new
--vllm_inference_batch_size flag that lets users override it manually,
useful for working around over-aggressive auto-estimates that cause
OOMs or when you want to cap concurrency for other reasons.

Made-with: Cursor
The prior inference_batch_size commit accidentally reverted several
unrelated changes in grpo_fast.py (FLA CP context, qwen3_5 packing
patch import, local_files_only flag, etc). Restoring those here.

Made-with: Cursor
Skip spawning full tool-env pool when --cache_dataset_only is set; a
single actor is enough to query tool definitions before early return.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves conflicts in:
- CHANGELOG.md: keep both SWERL entry and new main entries
- open_instruct/grpo_fast.py: adopt new NCCLWeightTransferEngine API
  from main for setup_model_update_group; adopt main's
  try/except/finally weight sync structure with wake_up call
- open_instruct/grpo_utils.py: adopt main's forward_for_logprobs
  signature (no attention_mask) while preserving cp_context for FLA
- open_instruct/vllm_utils.py: adopt main's new weight transfer API
  but preserve the DeepSpeed stage 3 per-param gathering workaround
  for the weight sync deadlock (ds3 skips gather_whole_model branch)

Made-with: Cursor
The previous merge accidentally preserved the old manual-NCCL weight
sync code (init_process_group, _broadcast_params_to_vllm, etc.) alongside
main's new NCCLWeightTransferEngine API, resulting in a broken hybrid.

Resetting vllm_utils.py to origin/main and reapplying only the
branch-specific changes that don't conflict with the new API:
- tool_outputs string guard in process_tool_tokens
- mamba_ssm_cache_dtype = float32 for SSM precision
- tool_call_timeout parameter with asyncio.wait_for on tool steps
- inference_batch_size manual override flag

Made-with: Cursor
The previous fix commit introduced a duplicate closing brace causing
SyntaxError: unmatched '}', and also accidentally deleted the
vllm_qwen3_coder parser config used by the tmax training scripts.

Made-with: Cursor
@hamishivi

Copy link
Copy Markdown
Collaborator Author

Closing in favor of smaller, focused PRs split out of this branch (e.g. DockerBackend hardening #1732, SWERLVanilluxSandboxEnv #1739, and others). The bundled change here is too large to review as one unit.

@hamishivi hamishivi closed this Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant