Add SWERLSandboxEnv for per-sample Docker tasks with submit-based evaluation#1492
Add SWERLSandboxEnv for per-sample Docker tasks with submit-based evaluation#1492hamishivi wants to merge 193 commits into
Conversation
…luation Adds a new RL environment (SWERLSandboxEnv) that extends GenericSandboxEnv with a `submit` tool for running per-task test suites. Each task provides its own instruction, seed files, and test scripts via a task data directory. Includes 1-GPU and 8-GPU debug training scripts. Based on AgentTaskEnv from #1453, renamed to SWERLSandboxEnv. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary of ChangesHello @hamishivi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a specialized reinforcement learning environment that leverages Docker containers for evaluating agent performance on coding tasks. It provides a structured way to define tasks with their own instructions, initial files, and test mechanisms, allowing agents to interact with a sandbox, develop solutions, and then submit them for automated testing and reward calculation. This enhancement streamlines the process of training and evaluating agents on complex, code-centric problems by isolating each task's execution within a containerized environment. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces SWERLSandboxEnv, an extension of GenericSandboxEnv for per-sample Docker tasks with submit-based evaluation. This new environment integrates a submit tool to run test suites within Docker containers, loading task-specific data like instructions, seed files, and test scripts. The changes also include updates to __init__.py and tools.py to register the new environment and its configuration, along with debug training scripts for 1-GPU and 8-GPU setups. The implementation appears sound, providing a flexible and robust solution for Docker-based RL environments.
Inline all bash/editor tool logic directly into SWERLSandboxEnv so it extends RLEnvironment directly with no dependency on GenericSandboxEnv. The bash wrapper defaults to /workspace instead of /testbed. Co-authored-by: Cursor <cursoragent@cursor.com>
Use relative paths like the other debug scripts. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces SWERLSandboxEnv, a new environment for running per-sample Docker-based tasks, which is a great addition for more complex agent evaluations. The implementation is solid, including the new submit tool and task data loading mechanism. My review includes a critical security fix to prevent command injection when creating directories from task data, along with suggestions to improve robustness by specifying file encodings and enhance consistency in how command outputs are truncated for logging. Overall, these are valuable changes that expand the repository's RL capabilities.
Add download_swerl_data.sh helper that uses snapshot_download to fetch hamishivi/agent-task-combined and exports TASK_DATA_DIR. Both 1gpu and 8gpu scripts source it before training, so the task data is available in Beaker jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mkdir -p is idempotent so the != check was unnecessary, and the path should be shell-escaped to prevent injection from malicious directory names in task data. Made-with: Cursor
Made-with: Cursor
…_bash Made-with: Cursor
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces SWERLSandboxEnv, a new RL environment for per-sample Docker-based tasks, which is a valuable addition. The implementation is solid, extending RLEnvironment with tools for bash execution, file editing, and a new submit tool for evaluation. The integration with the tool registry and the inclusion of debug scripts are well-executed.
My main feedback is on the _load_task_data method in swerl_sandbox.py, where there's an opportunity to improve both performance and maintainability by refactoring the file-copying logic to reduce code duplication and the number of Docker API calls. I've left a specific comment with suggestions.
Overall, this is a great contribution that enhances the framework's capabilities for agentic tasks.
Deduplicate the seeds/tests upload logic into a single _upload_directory method. Collect all needed subdirectories upfront and create them in one mkdir -p call instead of one per file. Made-with: Cursor
Made-with: Cursor
Instead of requiring task data to be pre-downloaded to a local directory, set task_data_hf_repo to a HuggingFace dataset repo ID. On setup(), snapshot_download fetches the repo once and caches it via huggingface_hub — subsequent runs are instant. Made-with: Cursor
The HF repo now stores a single task-data.tar.gz archive instead of 5659 individual task directories, making downloads much faster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Made-with: Cursor
The Docker put_archive API returns a 500 if the parent directory doesn't exist (e.g. mkdir -p failed silently due to permissions). Now we check the mkdir result and wrap write_file so errors surface as user-visible editor messages instead of crashing the rollout. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Environment rollouts with 20 max_steps keep the vLLM engine busy long enough that the weight sync (which must wait for in-flight rollouts) exceeds the 120s timeout. Made-with: Cursor
Made-with: Cursor
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
process_from_queue must not break on _should_stop() — that flag is temporarily True during every weight sync. If the completion queue happens to be empty for 1s during a weight sync, the thread exits permanently and results are never processed again. Revert to unconditional loop — the thread will be killed when the Ray actor is shut down. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same issue as process_from_queue: breaking on _should_stop() exits the thread permanently during weight sync. Restore original behavior where should_stop causes a sleep-and-continue (pause), not an exit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen3.5's GatedDeltaNet (linear attention) layers ignore sequence boundaries in packed inputs: causal conv1d leaks across sequences (seq_idx=None) and the recurrent state carries over. This causes incorrect logprobs during training and inflated KL divergence. Monkey-patch based on huggingface/transformers#45034: - Pass seq_idx to causal_conv1d_fn for packing-aware convolution - Pass cu_seqlens to FLA chunk_gated_delta_rule kernel - Forward **kwargs through DecoderLayer to linear_attn - Add cu_seqlens to padding_free_collator output Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multiple PolicyTrainerRayProcess actors call from_pretrained
concurrently, and each internally triggers snapshot_download to the
shared HF cache on weka. This race causes cache symlink corruption
and intermittent OSError ("does not appear to have files named...").
Since the pre-download step already caches all files, use
local_files_only=True to skip the concurrent downloads entirely.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rank 0 does NCCL broadcasts to vLLM engines inside GatheredParameters, while ranks 1-7 return immediately. Without a barrier, ranks 1-7 start exiting GatheredParameters (itself a collective to re-partition params) while rank 0 is still broadcasting on model_update_group. This causes the exit collective to deadlock intermittently. Add torch.distributed.barrier() before exiting the context so all ranks wait for rank 0 to finish the vLLM broadcasts first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Made-with: Cursor
Made-with: Cursor
Gathering all params at once (gather_whole_model=True) with DeepSpeed stage 3 deadlocks intermittently: rank 0 does NCCL broadcasts on model_update_group inside GatheredParameters, and the exit collective (re-partition) races with these broadcasts, causing a deadlock. For DeepSpeed stage 3, use the per-parameter gathering path instead. Each param is gathered individually, broadcast to vLLM, then released. This avoids the interleaving of collectives on different NCCL groups. FSDP continues to use gather_whole_model since it doesn't have the same issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pass seq_idx and cu_seqlens through forward_for_logprobs so the Qwen3.5 GatedDeltaNet packing patch can isolate conv1d and recurrent state across packed sequences. Fix cu_seqlens dtype to int32. - Do NOT pass cu_seq_lens_q/k for flash attention — HF already detects packing from position_ids resets, and explicit cu_seq_lens would break Ulysses sequence parallelism (which slices position_ids per rank). - Set mamba_ssm_cache_dtype=float32 to prevent bf16 rounding from compounding across decode steps in the GatedDeltaNet recurrent state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Build FLACPContext at trainer init from the SP rank/group so FLA's chunk_gated_delta_rule can communicate recurrent state across sequence-parallel ranks. Without this, each SP rank's GatedDeltaNet sees a truncated sequence with no state from prior chunks. The context is threaded through forward_for_logprobs / compute_logprobs and the Qwen3.5 packing patch into FLA's cp_context parameter. When SP is off, cp_context is None and nothing changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each SWERLSandboxEnv actor was independently calling snapshot_download in setup(), making an HF API call per actor. With pool_size=128 and sequential Ray actor startup, this added minutes of redundant downloads. Extract resolve_task_data_dir() as a static method and call it once in create_tool_pools() before spawning actors, passing the resolved path as task_data_dir so each actor skips the download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, the per-engine concurrency was always auto-computed from KV cache capacity via get_kv_cache_info(). This adds a new --vllm_inference_batch_size flag that lets users override it manually, useful for working around over-aggressive auto-estimates that cause OOMs or when you want to cap concurrency for other reasons. Made-with: Cursor
The prior inference_batch_size commit accidentally reverted several unrelated changes in grpo_fast.py (FLA CP context, qwen3_5 packing patch import, local_files_only flag, etc). Restoring those here. Made-with: Cursor
Skip spawning full tool-env pool when --cache_dataset_only is set; a single actor is enough to query tool definitions before early return. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves conflicts in: - CHANGELOG.md: keep both SWERL entry and new main entries - open_instruct/grpo_fast.py: adopt new NCCLWeightTransferEngine API from main for setup_model_update_group; adopt main's try/except/finally weight sync structure with wake_up call - open_instruct/grpo_utils.py: adopt main's forward_for_logprobs signature (no attention_mask) while preserving cp_context for FLA - open_instruct/vllm_utils.py: adopt main's new weight transfer API but preserve the DeepSpeed stage 3 per-param gathering workaround for the weight sync deadlock (ds3 skips gather_whole_model branch) Made-with: Cursor
The previous merge accidentally preserved the old manual-NCCL weight sync code (init_process_group, _broadcast_params_to_vllm, etc.) alongside main's new NCCLWeightTransferEngine API, resulting in a broken hybrid. Resetting vllm_utils.py to origin/main and reapplying only the branch-specific changes that don't conflict with the new API: - tool_outputs string guard in process_tool_tokens - mamba_ssm_cache_dtype = float32 for SSM precision - tool_call_timeout parameter with asyncio.wait_for on tool steps - inference_batch_size manual override flag Made-with: Cursor
The previous fix commit introduced a duplicate closing brace causing SyntaxError: unmatched '}', and also accidentally deleted the vllm_qwen3_coder parser config used by the tmax training scripts. Made-with: Cursor
Made-with: Cursor
…max scripts Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Summary
SWERLSandboxEnv, a new RL environment that extendsGenericSandboxEnvwith asubmittool for running per-task test suites inside Docker containers. Each task provides its own instruction, seed files, and test scripts via a task data directory.swerl_sandboxin theTOOL_REGISTRYso it can be used via--tools swerl_sandboxwith optional--tool_configsJSON fortask_data_dir,test_timeout, etc.swerl_sandbox_1gpu.sh,swerl_sandbox_8gpu.sh).Based on
AgentTaskEnvfrom #1453, renamed toSWERLSandboxEnvand adapted to the current environment interfaces.Test plan
SWERLSandboxEnvcan be instantiated via--tools swerl_sandboxMade with Cursor