Add SWERLSandboxEnv for per-sample Docker tasks with submit-based evaluation by hamishivi · Pull Request #1492 · allenai/open-instruct

hamishivi · 2026-02-24T04:51:20Z

Summary

Adds SWERLSandboxEnv, a new RL environment that extends GenericSandboxEnv with a submit tool for running per-task test suites inside Docker containers. Each task provides its own instruction, seed files, and test scripts via a task data directory.
Registers swerl_sandbox in the TOOL_REGISTRY so it can be used via --tools swerl_sandbox with optional --tool_configs JSON for task_data_dir, test_timeout, etc.
Includes 1-GPU and 8-GPU debug training scripts (swerl_sandbox_1gpu.sh, swerl_sandbox_8gpu.sh).

Based on AgentTaskEnv from #1453, renamed to SWERLSandboxEnv and adapted to the current environment interfaces.

Test plan

Verify SWERLSandboxEnv can be instantiated via --tools swerl_sandbox
Run 1-GPU debug script with test task data
Verify submit tool runs test.sh and returns correct reward
Verify task data loading (instruction, seeds, tests, setup.sh)

Made with Cursor

…luation Adds a new RL environment (SWERLSandboxEnv) that extends GenericSandboxEnv with a `submit` tool for running per-task test suites. Each task provides its own instruction, seed files, and test scripts via a task data directory. Includes 1-GPU and 8-GPU debug training scripts. Based on AgentTaskEnv from #1453, renamed to SWERLSandboxEnv. Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist · 2026-02-24T04:51:39Z

Summary of Changes

Hello @hamishivi, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a specialized reinforcement learning environment that leverages Docker containers for evaluating agent performance on coding tasks. It provides a structured way to define tasks with their own instructions, initial files, and test mechanisms, allowing agents to interact with a sandbox, develop solutions, and then submit them for automated testing and reward calculation. This enhancement streamlines the process of training and evaluating agents on complex, code-centric problems by isolating each task's execution within a containerized environment.

Highlights

New RL Environment: Introduced SWERLSandboxEnv, an extension of GenericSandboxEnv, designed for per-sample Docker tasks with submit-based evaluation. This environment allows for running test suites inside Docker containers for each task.
Submit Tool Integration: Added a submit tool within SWERLSandboxEnv that executes a per-task test script and returns a reward, facilitating automated evaluation of agent solutions.
Dynamic Task Data Loading: Enabled loading of task-specific data, including instructions, seed files, test scripts, optional Docker images, and setup scripts, from a designated task_data_dir for each task.
Tool Registry Integration: Registered swerl_sandbox in the TOOL_REGISTRY, making it available for use via --tools swerl_sandbox with configurable options like task_data_dir and test_timeout.
Debug Training Scripts: Included new debug training scripts (swerl_sandbox_1gpu.sh and swerl_sandbox_8gpu.sh) to facilitate testing and training with the new SWERLSandboxEnv.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

open_instruct/environments/init.py
- Imported SWERLSandboxEnv and SWERLSandboxEnvConfig.
- Added SWERLSandboxEnv and SWERLSandboxEnvConfig to the __all__ export list.
open_instruct/environments/swerl_sandbox.py
- Added a new file defining the SWERLSandboxEnv class, which extends GenericSandboxEnv.
- Implemented the submit tool for running per-task test scripts and parsing rewards.
- Included logic for loading task-specific data (instructions, seeds, tests, setup scripts) into the Docker container during environment reset.
- Defined SWERLSandboxEnvConfig for configuring the new environment.
open_instruct/environments/tools/tools.py
- Imported SWERLSandboxEnvConfig.
- Registered SWERLSandboxEnvConfig in the TOOL_REGISTRY mapping.
scripts/train/debug/envs/swerl_sandbox_1gpu.sh
- Added a new bash script for debugging GRPO training with SWERLSandboxEnv on a single GPU.
scripts/train/debug/envs/swerl_sandbox_8gpu.sh
- Added a new bash script for training GRPO with SWERLSandboxEnv on 8 GPUs using Beaker.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces SWERLSandboxEnv, an extension of GenericSandboxEnv for per-sample Docker tasks with submit-based evaluation. This new environment integrates a submit tool to run test suites within Docker containers, loading task-specific data like instructions, seed files, and test scripts. The changes also include updates to __init__.py and tools.py to register the new environment and its configuration, along with debug training scripts for 1-GPU and 8-GPU setups. The implementation appears sound, providing a flexible and robust solution for Docker-based RL environments.

Inline all bash/editor tool logic directly into SWERLSandboxEnv so it extends RLEnvironment directly with no dependency on GenericSandboxEnv. The bash wrapper defaults to /workspace instead of /testbed. Co-authored-by: Cursor <cursoragent@cursor.com>

Use relative paths like the other debug scripts. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

hamishivi · 2026-02-24T22:14:13Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces SWERLSandboxEnv, a new environment for running per-sample Docker-based tasks, which is a great addition for more complex agent evaluations. The implementation is solid, including the new submit tool and task data loading mechanism. My review includes a critical security fix to prevent command injection when creating directories from task data, along with suggestions to improve robustness by specifying file encodings and enhance consistency in how command outputs are truncated for logging. Overall, these are valuable changes that expand the repository's RL capabilities.

Add download_swerl_data.sh helper that uses snapshot_download to fetch hamishivi/agent-task-combined and exports TASK_DATA_DIR. Both 1gpu and 8gpu scripts source it before training, so the task data is available in Beaker jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mkdir -p is idempotent so the != check was unnecessary, and the path should be shell-escaped to prevent injection from malicious directory names in task data. Made-with: Cursor

Made-with: Cursor

…_bash Made-with: Cursor

hamishivi · 2026-02-26T23:20:43Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces SWERLSandboxEnv, a new RL environment for per-sample Docker-based tasks, which is a valuable addition. The implementation is solid, extending RLEnvironment with tools for bash execution, file editing, and a new submit tool for evaluation. The integration with the tool registry and the inclusion of debug scripts are well-executed.

My main feedback is on the _load_task_data method in swerl_sandbox.py, where there's an opportunity to improve both performance and maintainability by refactoring the file-copying logic to reduce code duplication and the number of Docker API calls. I've left a specific comment with suggestions.

Overall, this is a great contribution that enhances the framework's capabilities for agentic tasks.

Deduplicate the seeds/tests upload logic into a single _upload_directory method. Collect all needed subdirectories upfront and create them in one mkdir -p call instead of one per file. Made-with: Cursor

Made-with: Cursor

Instead of requiring task data to be pre-downloaded to a local directory, set task_data_hf_repo to a HuggingFace dataset repo ID. On setup(), snapshot_download fetches the repo once and caches it via huggingface_hub — subsequent runs are instant. Made-with: Cursor

The HF repo now stores a single task-data.tar.gz archive instead of 5659 individual task directories, making downloads much faster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Made-with: Cursor

The Docker put_archive API returns a 500 if the parent directory doesn't exist (e.g. mkdir -p failed silently due to permissions). Now we check the mkdir result and wrap write_file so errors surface as user-visible editor messages instead of crashing the rollout. Made-with: Cursor

Made-with: Cursor

Environment rollouts with 20 max_steps keep the vLLM engine busy long enough that the weight sync (which must wait for in-flight rollouts) exceeds the 120s timeout. Made-with: Cursor

Made-with: Cursor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

process_from_queue must not break on _should_stop() — that flag is temporarily True during every weight sync. If the completion queue happens to be empty for 1s during a weight sync, the thread exits permanently and results are never processed again. Revert to unconditional loop — the thread will be killed when the Ray actor is shut down. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same issue as process_from_queue: breaking on _should_stop() exits the thread permanently during weight sync. Restore original behavior where should_stop causes a sleep-and-continue (pause), not an exit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Qwen3.5's GatedDeltaNet (linear attention) layers ignore sequence boundaries in packed inputs: causal conv1d leaks across sequences (seq_idx=None) and the recurrent state carries over. This causes incorrect logprobs during training and inflated KL divergence. Monkey-patch based on huggingface/transformers#45034: - Pass seq_idx to causal_conv1d_fn for packing-aware convolution - Pass cu_seqlens to FLA chunk_gated_delta_rule kernel - Forward **kwargs through DecoderLayer to linear_attn - Add cu_seqlens to padding_free_collator output Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Multiple PolicyTrainerRayProcess actors call from_pretrained concurrently, and each internally triggers snapshot_download to the shared HF cache on weka. This race causes cache symlink corruption and intermittent OSError ("does not appear to have files named..."). Since the pre-download step already caches all files, use local_files_only=True to skip the concurrent downloads entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Made-with: Cursor

Rank 0 does NCCL broadcasts to vLLM engines inside GatheredParameters, while ranks 1-7 return immediately. Without a barrier, ranks 1-7 start exiting GatheredParameters (itself a collective to re-partition params) while rank 0 is still broadcasting on model_update_group. This causes the exit collective to deadlock intermittently. Add torch.distributed.barrier() before exiting the context so all ranks wait for rank 0 to finish the vLLM broadcasts first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Made-with: Cursor

Gathering all params at once (gather_whole_model=True) with DeepSpeed stage 3 deadlocks intermittently: rank 0 does NCCL broadcasts on model_update_group inside GatheredParameters, and the exit collective (re-partition) races with these broadcasts, causing a deadlock. For DeepSpeed stage 3, use the per-parameter gathering path instead. Each param is gathered individually, broadcast to vLLM, then released. This avoids the interleaving of collectives on different NCCL groups. FSDP continues to use gather_whole_model since it doesn't have the same issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Pass seq_idx and cu_seqlens through forward_for_logprobs so the Qwen3.5 GatedDeltaNet packing patch can isolate conv1d and recurrent state across packed sequences. Fix cu_seqlens dtype to int32. - Do NOT pass cu_seq_lens_q/k for flash attention — HF already detects packing from position_ids resets, and explicit cu_seq_lens would break Ulysses sequence parallelism (which slices position_ids per rank). - Set mamba_ssm_cache_dtype=float32 to prevent bf16 rounding from compounding across decode steps in the GatedDeltaNet recurrent state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Build FLACPContext at trainer init from the SP rank/group so FLA's chunk_gated_delta_rule can communicate recurrent state across sequence-parallel ranks. Without this, each SP rank's GatedDeltaNet sees a truncated sequence with no state from prior chunks. The context is threaded through forward_for_logprobs / compute_logprobs and the Qwen3.5 packing patch into FLA's cp_context parameter. When SP is off, cp_context is None and nothing changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each SWERLSandboxEnv actor was independently calling snapshot_download in setup(), making an HF API call per actor. With pool_size=128 and sequential Ray actor startup, this added minutes of redundant downloads. Extract resolve_task_data_dir() as a static method and call it once in create_tool_pools() before spawning actors, passing the resolved path as task_data_dir so each actor skips the download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously, the per-engine concurrency was always auto-computed from KV cache capacity via get_kv_cache_info(). This adds a new --vllm_inference_batch_size flag that lets users override it manually, useful for working around over-aggressive auto-estimates that cause OOMs or when you want to cap concurrency for other reasons. Made-with: Cursor

The prior inference_batch_size commit accidentally reverted several unrelated changes in grpo_fast.py (FLA CP context, qwen3_5 packing patch import, local_files_only flag, etc). Restoring those here. Made-with: Cursor

Skip spawning full tool-env pool when --cache_dataset_only is set; a single actor is enough to query tool definitions before early return. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves conflicts in: - CHANGELOG.md: keep both SWERL entry and new main entries - open_instruct/grpo_fast.py: adopt new NCCLWeightTransferEngine API from main for setup_model_update_group; adopt main's try/except/finally weight sync structure with wake_up call - open_instruct/grpo_utils.py: adopt main's forward_for_logprobs signature (no attention_mask) while preserving cp_context for FLA - open_instruct/vllm_utils.py: adopt main's new weight transfer API but preserve the DeepSpeed stage 3 per-param gathering workaround for the weight sync deadlock (ds3 skips gather_whole_model branch) Made-with: Cursor

The previous merge accidentally preserved the old manual-NCCL weight sync code (init_process_group, _broadcast_params_to_vllm, etc.) alongside main's new NCCLWeightTransferEngine API, resulting in a broken hybrid. Resetting vllm_utils.py to origin/main and reapplying only the branch-specific changes that don't conflict with the new API: - tool_outputs string guard in process_tool_tokens - mamba_ssm_cache_dtype = float32 for SSM precision - tool_call_timeout parameter with asyncio.wait_for on tool steps - inference_batch_size manual override flag Made-with: Cursor

The previous fix commit introduced a duplicate closing brace causing SyntaxError: unmatched '}', and also accidentally deleted the vllm_qwen3_coder parser config used by the tmax training scripts. Made-with: Cursor

Made-with: Cursor

…max scripts Made-with: Cursor

Made-with: Cursor

hamishivi · 2026-06-25T16:52:10Z

Closing in favor of smaller, focused PRs split out of this branch (e.g. DockerBackend hardening #1732, SWERLVanilluxSandboxEnv #1739, and others). The bundled change here is too large to review as one unit.

gemini-code-assist Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread open_instruct/environments/swerl_sandbox.py Outdated

Comment thread open_instruct/environments/swerl_sandbox.py Outdated

hamishivi and others added 4 commits February 23, 2026 20:59

Remove REPO_ROOT from swerl_sandbox_1gpu.sh

2f72377

Use relative paths like the other debug scripts. Co-authored-by: Cursor <cursoragent@cursor.com>

Add SWERLSandboxEnv to changelog

d8de291

Co-authored-by: Cursor <cursoragent@cursor.com>

Remove redundant truthiness check on parent dir path

b830649

Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread open_instruct/environments/swerl_sandbox.py Outdated

Comment thread open_instruct/environments/swerl_sandbox.py Outdated

Comment thread open_instruct/environments/swerl_sandbox.py Outdated

Comment thread open_instruct/environments/swerl_sandbox.py Outdated

root and others added 5 commits February 25, 2026 01:55

Merge branch 'main' into swerl-sandbox-env

1d67042

Use shlex.quote for mkdir paths and drop redundant guard

c7a4f83

mkdir -p is idempotent so the != check was unnecessary, and the path should be shell-escaped to prevent injection from malicious directory names in task data. Made-with: Cursor

Add explicit encoding="utf-8" to all open() calls

2df0da1

Made-with: Cursor

Use _truncate_output for test result output, consistent with _execute…

17164d6

…_bash Made-with: Cursor

gemini-code-assist Bot reviewed Feb 26, 2026

View reviewed changes

Comment thread open_instruct/environments/swerl_sandbox.py

hamishivi and others added 14 commits February 26, 2026 16:14

Extract _upload_directory helper and batch mkdir calls

ce5722e

Deduplicate the seeds/tests upload logic into a single _upload_directory method. Collect all needed subdirectories upfront and create them in one mkdir -p call instead of one per file. Made-with: Cursor

Remove --dataset_skip_cache from debug scripts for faster startup

9ae5f06

Made-with: Cursor

Download task data as single tar.gz instead of individual files

ccb8641

The HF repo now stores a single task-data.tar.gz archive instead of 5659 individual task directories, making downloads much faster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix task_data_hf_repo to use correct repo name

3fe18f7

Made-with: Cursor

Merge origin/main into swerl-sandbox-env

46a5205

Made-with: Cursor

ruff format

e3fddf3

Made-with: Cursor

Catch IsADirectoryError in editor dispatch

206cb22

Made-with: Cursor

Reduce max_steps to 10 in 1gpu script to avoid weight sync timeout

e98d1fe

Environment rollouts with 20 max_steps keep the vLLM engine busy long enough that the weight sync (which must wait for in-flight rollouts) exceeds the 120s timeout. Made-with: Cursor

some fixes

67b4c17

fix

8c4b465

download

1828ad5

Move snapshot_download import to top level and fix import order

4c9ebca

Made-with: Cursor

hamishivi and others added 27 commits April 14, 2026 17:57

Increase default tool_call_timeout to 300s

86ef5fe

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

up tool call timeout

2c411d7

Add causal-conv1d dependency for Qwen3.5 fast linear attention kernels

3881125

Made-with: Cursor

Add Qwen3-4B SFT tmax-10k combo GRPO script

4295fce

Made-with: Cursor

Add Qwen3-4B SFT tmax-10k combo GRPO script

9995e4a

Made-with: Cursor

correctly wire in seq ids

a2eb5ba

Trigger rebuild to pick up restored causal-conv1d dependency

278c44c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Restore grpo_fast.py changes accidentally lost in previous commit

a92e548

The prior inference_batch_size commit accidentally reverted several unrelated changes in grpo_fast.py (FLA CP context, qwen3_5 packing patch import, local_files_only flag, etc). Restoring those here. Made-with: Cursor

Use pool_size=1 during dataset caching

7dc3b1c

Skip spawning full tool-env pool when --cache_dataset_only is set; a single actor is enough to query tool definitions before early return. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix qwen3 parser

6ab2af2

Fix parsers.py syntax error and restore vllm_qwen3_coder

a70a95d

The previous fix commit introduced a duplicate closing brace causing SyntaxError: unmatched '}', and also accidentally deleted the vllm_qwen3_coder parser config used by the tmax training scripts. Made-with: Cursor

Remove vllm_qwen3_coder parser; switch tmax scripts to vllm_qwen3_xml

3ea5117

Made-with: Cursor

Add enforce_eager, no_auto_dataset_cache; bump pool_size to 512 for t…

ecbfe9f

…max scripts Made-with: Cursor

Move --no_auto_dataset_cache to mason.py flags for tmax scripts

2c68294

Made-with: Cursor

Checkpoint tmax scripts every 10 steps instead of 50

243c1e3

Made-with: Cursor

hamishivi mentioned this pull request Jun 25, 2026

Add SWERLVanilluxSandboxEnv #1739

Open

hamishivi closed this Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SWERLSandboxEnv for per-sample Docker tasks with submit-based evaluation#1492

Add SWERLSandboxEnv for per-sample Docker tasks with submit-based evaluation#1492
hamishivi wants to merge 193 commits into
mainfrom
swerl-sandbox-env

hamishivi commented Feb 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Feb 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

hamishivi commented Feb 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hamishivi commented Feb 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

hamishivi commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hamishivi commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

hamishivi commented Feb 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hamishivi commented Feb 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

hamishivi commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hamishivi commented Feb 24, 2026 •

edited

Loading