Harden DockerBackend with retries, OOM detection, and concurrency limits#1732
Harden DockerBackend with retries, OOM detection, and concurrency limits#1732hamishivi wants to merge 4 commits into
Conversation
Add cross-process start/exec concurrency semaphores, transient-exec-error retries with exponential backoff, OOM detection, container-state diagnostics, a per-command timeout override, optional docker_host, and kill-first close to DockerBackend. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Code Review
This pull request enhances the DockerBackend sandbox environment with cross-process concurrency control using a file-slot semaphore, transient Docker API error retries, OOM detection, and timing logs, alongside corresponding unit tests. The review feedback highlights critical issues regarding container lifecycle management: when auto_remove is enabled, OOM detection fails because the container is deleted before its state can be inspected; conversely, when auto_remove is disabled, stopped or failed containers are leaked during retries and closure. Additionally, a potential crash was identified in the semaphore implementation if file opening raises a permission error outside the try-except block.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Move the lock-file open() inside the try block so a PermissionError (EACCES) from a slot owned by another user on a shared host is handled by trying the next slot instead of crashing. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request hardens the DockerBackend sandbox environment by introducing concurrency limits via file-based semaphores, handling transient execution errors with retries, and implementing OOM detection. The review feedback highlights a potential permission conflict in multi-user environments due to a shared lock directory, and a critical bug where defaulting auto_remove to True prevents reliable OOM detection. To address these, it is recommended to make the lock directory user-specific, default auto_remove to False while explicitly managing container cleanup to avoid leaks, and add a unit test for the OOM error path.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Append the uid to the default lock dir so a shared /tmp doesn't cause cross-user permission conflicts where the EACCES-skip loop never finds a usable slot. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
SWERL_DOCKER_START_CONCURRENCY/SWERL_DOCKER_EXEC_CONCURRENCY) to avoid overwhelming the Docker daemon under many concurrent rollouts.APIErrors (e.g. "database is locked") on the same container with exponential backoff, and restart-and-retry once onNotFound/ 409 Conflict.SandboxOOMErrorso episodes terminate cleanly instead of looping.timeoutoverride torun_command, optionaldocker_hostendpoint (e.g. Podman sockets), best-effort container-state diagnostics, and opt-in timing logs.Test plan
tests/test_docker_backend.pycover client selection, transient-error retry/backoff, retry exhaustion, and non-retry of unknown errors (mocked Docker SDK, no daemon needed).GPU_TESTS=bypass
Made with Cursor