Skip to content

[fix]: bugfix for RAY_EXPERIMENTAL_NOSET_ASCEND/CUDA_RT_VISIBLE_DEVICES in RL#151

Open
xiazhahe wants to merge 1 commit into
ISEEKYAN:mainfrom
xiazhahe:main
Open

[fix]: bugfix for RAY_EXPERIMENTAL_NOSET_ASCEND/CUDA_RT_VISIBLE_DEVICES in RL#151
xiazhahe wants to merge 1 commit into
ISEEKYAN:mainfrom
xiazhahe:main

Conversation

@xiazhahe

Copy link
Copy Markdown

If using a reinforcement learning framework, such as Verl, and setting the global variables RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES or RAY_EXPERIMENTAL_NOSET_CUDA_RT_VISIBLE_DEVICES, the behavior changes from each card only being able to see its own card (i.e., each card is rank 0) to all cards in the device being visible. However, device = get_device_name() implies that tensors are loaded on rank 0 by default, which can easily lead to all cards' tensors being loaded on rank 0, causing an OOM (Out of Memory) error in weight = f.get_tensor(name).

Therefore, in this PR, the tensor is loaded from the current device rank obtained by torch instead of rank 0 by default.

@xiazhahe xiazhahe changed the title bugfix for RAY_EXPERIMENTAL_NOSET_ASCEND/CUDA_RT_VISIBLE_DEVICES in RL [fix]: bugfix for RAY_EXPERIMENTAL_NOSET_ASCEND/CUDA_RT_VISIBLE_DEVICES in RL Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant