Skip to content

[WIP] set accelerator environment variables from runtime env agent #63327

Draft
andrewsykim wants to merge 30 commits into
ray-project:masterfrom
andrewsykim:63138
Draft

[WIP] set accelerator environment variables from runtime env agent #63327
andrewsykim wants to merge 30 commits into
ray-project:masterfrom
andrewsykim:63138

Conversation

@andrewsykim

Copy link
Copy Markdown
Member

Description

Closes #63138

Sets accelerator specific environment variables at runtime environment creation instead of from within worker processes. This is to avoid crashes caused by getenv/setenv thread safety issues (ex: #62028).

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request moves the configuration of resource-related environment variables, such as OMP_NUM_THREADS and accelerator visibility (e.g., CUDA_VISIBLE_DEVICES), from task execution time to worker startup. This is implemented by passing resource requirements and allocated instances to the RuntimeEnvAgent during worker creation. Feedback identifies several critical issues: an isolation bug in the Python agent where environment variables are popped instead of cleared, a missing check for specific allocated resource IDs when reusing workers, and concerns regarding strict resource equality checks and hardcoded prestart values which may significantly degrade worker reuse efficiency. Additionally, the reviewer questioned the language-specific logic that might leave non-Python workers without proper isolation.

Comment thread python/ray/_private/runtime_env/agent/runtime_env_agent.py Outdated
Comment on lines +1357 to +1359
if (worker.GetResourceRequirements() != pop_worker_request.resource_requirements_) {
return WorkerUnfitForLeaseReason::OTHERS;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This check is missing a comparison for allocated_instances. Since environment variables like CUDA_VISIBLE_DEVICES are now set at worker startup based on the specific resource IDs allocated, a worker can only be reused if the new task is assigned the exact same resource IDs. Without this check, a task assigned GPU 1 might be scheduled on a worker process that was started with CUDA_VISIBLE_DEVICES=0.

Additionally, the strict equality check on ResourceSet will significantly reduce worker reuse. For example, a worker prestarted with 1.0 CPU will not be reused for a task requiring 0.5 CPU, even though the resulting OMP_NUM_THREADS would be identical. While strict equality is safe, it may lead to excessive worker process churn.

Comment thread src/ray/raylet/worker.h
@@ -238,6 +242,8 @@ class Worker : public std::enable_shared_from_this<Worker>, public WorkerInterfa
/// Saved process group id captured at registration time. Used for process-group
/// cleanup validation at disconnect/stop.
std::optional<pid_t> saved_pgid_ = std::nullopt;
/// Resource requirements of this worker process.
ResourceSet resource_requirements_;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Worker class should also store the allocated_instances it was started with. This is necessary for WorkerFitForLease to correctly determine if an idle worker is compatible with a new task's assigned resource IDs, as the environment variables are now fixed at process creation.

Comment thread src/ray/raylet/worker_pool.cc Outdated
Comment thread src/ray/raylet/node_manager.cc Outdated
Comment on lines +1884 to +1885
ResourceSet resource_requirements;
resource_requirements.Set(scheduling::ResourceID::CPU(), 1.0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding prestarted workers to 1.0 CPU, combined with the strict equality check in WorkerFitForLease, means these workers will only be used for tasks that request exactly 1.0 CPU. Tasks requesting any other amount (e.g., 0.1 or 2.0) will trigger the start of a new worker process, effectively negating the benefit of prestarting workers for a large class of tasks.

@@ -463,7 +493,9 @@ async def _create_runtime_env_with_retry(
serialized_context = self._env_cache[serialized_env]
result = self._env_cache[serialized_env]
if result.success:
context = result.result
context_obj = RuntimeEnvContext.deserialize(result.result)
self._inject_resource_env_vars(context_obj, request)

@andrewsykim andrewsykim May 13, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed cleaner to call _inject_resource_env_vars in _setup_runtime_env, but request.allocated_instances for GPUs was never set. I'm assuming this is called for worker prestart before the accelerator IDs are passed down

@andrewsykim andrewsykim changed the title [WIP] set accelerator allocated instances at runtime env creation [WIP] set accelerator environment variables from runtime env agent May 14, 2026
@andrewsykim andrewsykim force-pushed the 63138 branch 3 times, most recently from 2593f18 to 1b857bd Compare June 3, 2026 19:00
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
…ts or assigned accelerator IDs

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
…est.cc

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
…clock

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
…accelerator env vars

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
…ateRuntimeEnv

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[core] Avoid mutating os.environ within Ray workers Ray fails to serialize self-reference objects

1 participant