-
Notifications
You must be signed in to change notification settings - Fork 147
Description
skRunner pid=860823) 2026-03-17 06:50:41.671 | WARNING | agentevolver.module.agent_flow.agent_flow:execute:106 - Token overflow detected at step 23. Current token count exceeds the limit.
(TaskRunner pid=860823) Rollout progress (961.36 tokens/s): [finished]:64 threads
(TaskRunner pid=860823) ==========end fit rollout==========
Epoch train.0.8: Collecting rollouts: 100%|██████████| 64/64 [05:54<00:00, 5.54s/it]
(TaskRunner pid=860823) gen_batch_output.info batch.keys=_StringKeys(dict_keys(['prompts', 'responses', 'input_ids', 'attention_mask', 'position_ids', 'loss_mask', 'exp_mask', 'step_ids', 'group_ids']))
(WorkerDict pid=872637) [DP=0,TP=0] execute_method: sleep
(raylet) [2026-03-17 06:50:49,721 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8651 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
(raylet) [2026-03-17 06:50:59,727 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8651 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
(raylet) [2026-03-17 06:51:09,733 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8649 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
(TaskRunner pid=860823) list(reward_extra_infos_dict.keys())=[]
(WorkerDict pid=872950) [DP=1,TP=0] execute_method: sleep [repeated 7x across cluster]
(raylet) [2026-03-17 06:51:19,738 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8649 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
(raylet) [2026-03-17 06:51:29,744 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8648 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
(raylet) [2026-03-17 06:51:39,749 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8648 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
Training Progress: 2%|▏ | 8/440 [1:01:44<55:34:11, 463.08s/it]
(AsyncvLLMServer pid=875610) (raylet) [2026-03-17 06:51:39,749 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8648 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 8x across cluster]
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/yihao_hyh/AgentEvolver/agentevolver/main_ppo.py", line 144, in main
run_ppo(config) # ⭐ Executes the PPO training process with the given configuration
^^^^^^^^^^^^^^^
File "/home/yihao_hyh/AgentEvolver/agentevolver/main_ppo.py", line 170, in run_ppo
ray.get(runner.run.remote(config)) # ⭐ Start the PPO training process by calling the run method on the TaskRunner actor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/ray/_private/worker.py", line 2849, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/ray/_private/worker.py", line 937, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::TaskRunner.run() (pid=860823, ip=10.148.15.236, actor_id=518784300ca0d94c2bbe370301000000, repr=<main_ppo.TaskRunner object at 0x7f3dfe615b10>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yihao_hyh/AgentEvolver/agentevolver/main_ppo.py", line 336, in run
trainer.fit() # ⭐ Start the training process
^^^^^^^^^^^^^
File "/home/yihao_hyh/AgentEvolver/agentevolver/module/trainer/ae_ray_trainer.py", line 1334, in fit
actor_output = self.actor_rollout_wg.update_actor(batch) # ⭐ Update the actor with the new batch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/verl/single_controller/ray/base.py", line 51, in call
output = ray.get(output)
^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OutOfMemoryError): ray::WorkerDict.actor_rollout_update_actor() (pid=872955, ip=10.148.15.236, actor_id=f1e49507ec75ac520a06d58201000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7ff43cf26a10>)
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/verl/single_controller/ray/base.py", line 645, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/verl/single_controller/base/decorator.py", line 540, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/yihao_hyh/AgentEvolver/agentevolver/module/exp_manager/het_fsdp_worker.py", line 669, in update_actor
metrics = self.actor.update_policy(data=data) # ⭐ Update the actor policy with the provided data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yihao_hyh/AgentEvolver/agentevolver/module/exp_manager/het_actor.py", line 200, in update_policy
loss.backward() # ⭐ Backpropagate the loss
^^^^^^^^^^^^^^^
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward
torch.autograd.backward(
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/torch/autograd/init.py", line 347, in backward
_engine_run_backward(
File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.26 GiB. GPU 0 has a total capacity of 79.10 GiB of which 5.50 GiB is free. Including non-PyTorch memory, this process has 73.46 GiB memory in use. Of the allocated memory 96.03 GiB is allocated by PyTorch, with 129.23 MiB allocated in private pools (e.g., CUDA Graphs), and 9.63 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error running subprocess: Command '['/home/yihao_hyh/miniconda3/envs/agentevolver/bin/python', '-m', 'agentevolver.main_ppo', '--config-path', '/home/yihao_hyh/AgentEvolver/launcher_record/basic', '--config-name', 'yaml_backup.yaml']' returned non-zero exit status 1.
(agentevolver) yihao_hyh@instance-algo-dev1:~/AgentEvolver$ # 确保用 appworld 环境再启动脚本
完整报错