[WIP] set accelerator environment variables from runtime env agent by andrewsykim · Pull Request #63327 · ray-project/ray

andrewsykim · 2026-05-13T21:00:09Z

Description

Sets accelerator specific environment variables at runtime environment creation instead of from within worker processes. This is to avoid crashes caused by getenv/setenv thread safety issues (ex: #62028).

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist

Code Review

This pull request moves the configuration of resource-related environment variables, such as OMP_NUM_THREADS and accelerator visibility (e.g., CUDA_VISIBLE_DEVICES), from task execution time to worker startup. This is implemented by passing resource requirements and allocated instances to the RuntimeEnvAgent during worker creation. Feedback identifies several critical issues: an isolation bug in the Python agent where environment variables are popped instead of cleared, a missing check for specific allocated resource IDs when reusing workers, and concerns regarding strict resource equality checks and hardcoded prestart values which may significantly degrade worker reuse efficiency. Additionally, the reviewer questioned the language-specific logic that might leave non-Python workers without proper isolation.

gemini-code-assist · 2026-05-13T21:02:24Z

+  if (worker.GetResourceRequirements() != pop_worker_request.resource_requirements_) {
+    return WorkerUnfitForLeaseReason::OTHERS;
+  }


This check is missing a comparison for allocated_instances. Since environment variables like CUDA_VISIBLE_DEVICES are now set at worker startup based on the specific resource IDs allocated, a worker can only be reused if the new task is assigned the exact same resource IDs. Without this check, a task assigned GPU 1 might be scheduled on a worker process that was started with CUDA_VISIBLE_DEVICES=0.

Additionally, the strict equality check on ResourceSet will significantly reduce worker reuse. For example, a worker prestarted with 1.0 CPU will not be reused for a task requiring 0.5 CPU, even though the resulting OMP_NUM_THREADS would be identical. While strict equality is safe, it may lead to excessive worker process churn.

gemini-code-assist · 2026-05-13T21:02:24Z

@@ -238,6 +242,8 @@ class Worker : public std::enable_shared_from_this<Worker>, public WorkerInterfa
  /// Saved process group id captured at registration time. Used for process-group
  /// cleanup validation at disconnect/stop.
  std::optional<pid_t> saved_pgid_ = std::nullopt;
+  /// Resource requirements of this worker process.
+  ResourceSet resource_requirements_;


The Worker class should also store the allocated_instances it was started with. This is necessary for WorkerFitForLease to correctly determine if an idle worker is compatible with a new task's assigned resource IDs, as the environment variables are now fixed at process creation.

gemini-code-assist · 2026-05-13T21:02:24Z

+  ResourceSet resource_requirements;
+  resource_requirements.Set(scheduling::ResourceID::CPU(), 1.0);


Hardcoding prestarted workers to 1.0 CPU, combined with the strict equality check in WorkerFitForLease, means these workers will only be used for tasks that request exactly 1.0 CPU. Tasks requesting any other amount (e.g., 0.1 or 2.0) will trigger the start of a new worker process, effectively negating the benefit of prestarting workers for a large class of tasks.

andrewsykim · 2026-05-13T21:01:29Z

@@ -463,7 +493,9 @@ async def _create_runtime_env_with_retry(
                serialized_context = self._env_cache[serialized_env]
                result = self._env_cache[serialized_env]
                if result.success:
-                    context = result.result
+                    context_obj = RuntimeEnvContext.deserialize(result.result)
+                    self._inject_resource_env_vars(context_obj, request)


It seemed cleaner to call _inject_resource_env_vars in _setup_runtime_env, but request.allocated_instances for GPUs was never set. I'm assuming this is called for worker prestart before the accelerator IDs are passed down