Skip to content

Harden DockerBackend with retries, OOM detection, and concurrency limits#1732

Open
hamishivi wants to merge 4 commits into
mainfrom
hamishivi/docker-backend-hardening
Open

Harden DockerBackend with retries, OOM detection, and concurrency limits#1732
hamishivi wants to merge 4 commits into
mainfrom
hamishivi/docker-backend-hardening

Conversation

@hamishivi

Copy link
Copy Markdown
Collaborator

Summary

  • Add cross-process start/exec concurrency semaphores (SWERL_DOCKER_START_CONCURRENCY / SWERL_DOCKER_EXEC_CONCURRENCY) to avoid overwhelming the Docker daemon under many concurrent rollouts.
  • Retry transient Docker exec APIErrors (e.g. "database is locked") on the same container with exponential backoff, and restart-and-retry once on NotFound / 409 Conflict.
  • Detect OOM-killed containers and raise SandboxOOMError so episodes terminate cleanly instead of looping.
  • Add per-command timeout override to run_command, optional docker_host endpoint (e.g. Podman sockets), best-effort container-state diagnostics, and opt-in timing logs.
  • Close containers kill-first (then stop) for faster, more reliable teardown.

Test plan

  • New unit tests in tests/test_docker_backend.py cover client selection, transient-error retry/backoff, retry exhaustion, and non-retry of unknown errors (mocked Docker SDK, no daemon needed).

GPU_TESTS=bypass

Made with Cursor

hamishivi and others added 2 commits June 23, 2026 13:45
Add cross-process start/exec concurrency semaphores, transient-exec-error
retries with exponential backoff, OOM detection, container-state
diagnostics, a per-command timeout override, optional docker_host, and
kill-first close to DockerBackend.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the DockerBackend sandbox environment with cross-process concurrency control using a file-slot semaphore, transient Docker API error retries, OOM detection, and timing logs, alongside corresponding unit tests. The review feedback highlights critical issues regarding container lifecycle management: when auto_remove is enabled, OOM detection fails because the container is deleted before its state can be inspected; conversely, when auto_remove is disabled, stopped or failed containers are leaked during retries and closure. Additionally, a potential crash was identified in the semaphore implementation if file opening raises a permission error outside the try-except block.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread open_instruct/environments/backends.py
Comment thread open_instruct/environments/backends.py Outdated
Comment thread open_instruct/environments/backends.py
Comment thread open_instruct/environments/backends.py
Move the lock-file open() inside the try block so a PermissionError (EACCES)
from a slot owned by another user on a shared host is handled by trying the
next slot instead of crashing.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hamishivi

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request hardens the DockerBackend sandbox environment by introducing concurrency limits via file-based semaphores, handling transient execution errors with retries, and implementing OOM detection. The review feedback highlights a potential permission conflict in multi-user environments due to a shared lock directory, and a critical bug where defaulting auto_remove to True prevents reliable OOM detection. To address these, it is recommended to make the lock directory user-specific, default auto_remove to False while explicitly managing container cleanup to avoid leaks, and add a unit test for the OOM error path.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread open_instruct/environments/backends.py Outdated
Comment thread open_instruct/environments/backends.py
Comment thread open_instruct/environments/backends.py
Comment thread open_instruct/environments/backends.py
Comment thread open_instruct/environments/backends.py
Comment thread tests/test_docker_backend.py
Append the uid to the default lock dir so a shared /tmp doesn't cause cross-user
permission conflicts where the EACCES-skip loop never finds a usable slot.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant