why OOM

skRunner pid=860823) 2026-03-17 06:50:41.671 | WARNING  | agentevolver.module.agent_flow.agent_flow:execute:106 - Token overflow detected at step 23. Current token count exceeds the limit.
(TaskRunner pid=860823) Rollout progress (961.36 tokens/s): [finished]:64 threads
(TaskRunner pid=860823) ==========end fit rollout==========
Epoch train.0.8: Collecting rollouts: 100%|██████████| 64/64 [05:54<00:00,  5.54s/it]
(TaskRunner pid=860823) gen_batch_output.info batch.keys=_StringKeys(dict_keys(['prompts', 'responses', 'input_ids', 'attention_mask', 'position_ids', 'loss_mask', 'exp_mask', 'step_ids', 'group_ids']))
(WorkerDict pid=872637) [DP=0,TP=0] execute_method: sleep
(raylet) [2026-03-17 06:50:49,721 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8651 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
(raylet) [2026-03-17 06:50:59,727 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8651 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
(raylet) [2026-03-17 06:51:09,733 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8649 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
(TaskRunner pid=860823) list(reward_extra_infos_dict.keys())=[]
(WorkerDict pid=872950) [DP=1,TP=0] execute_method: sleep [repeated 7x across cluster]
(raylet) [2026-03-17 06:51:19,738 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8649 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
(raylet) [2026-03-17 06:51:29,744 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8648 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
(raylet) [2026-03-17 06:51:39,749 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8648 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 9x across cluster]
Training Progress:   2%|▏         | 8/440 [1:01:44<55:34:11, 463.08s/it]
(AsyncvLLMServer pid=875610) (raylet) [2026-03-17 06:51:39,749 E 848672 848714] (raylet) file_system_monitor.cc:116: /var/tmp/ray/session_2026-03-17_05-19-44_128656_847514 is over 95% full, available space: 73.8648 GB; capacity: 4921.3 GB. Object creation will fail if spilling is required. [repeated 8x across cluster]
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/yihao_hyh/AgentEvolver/agentevolver/main_ppo.py", line 144, in main
    run_ppo(config)  # ⭐ Executes the PPO training process with the given configuration
    ^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/AgentEvolver/agentevolver/main_ppo.py", line 170, in run_ppo
    ray.get(runner.run.remote(config))  # ⭐ Start the PPO training process by calling the run method on the TaskRunner actor
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::TaskRunner.run() (pid=860823, ip=10.148.15.236, actor_id=518784300ca0d94c2bbe370301000000, repr=<main_ppo.TaskRunner object at 0x7f3dfe615b10>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/AgentEvolver/agentevolver/main_ppo.py", line 336, in run
    trainer.fit()  # ⭐ Start the training process
    ^^^^^^^^^^^^^
  File "/home/yihao_hyh/AgentEvolver/agentevolver/module/trainer/ae_ray_trainer.py", line 1334, in fit
    actor_output = self.actor_rollout_wg.update_actor(batch)  # ⭐ Update the actor with the new batch
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/verl/single_controller/ray/base.py", line 51, in __call__
    output = ray.get(output)
             ^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OutOfMemoryError): ray::WorkerDict.actor_rollout_update_actor() (pid=872955, ip=10.148.15.236, actor_id=f1e49507ec75ac520a06d58201000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7ff43cf26a10>)
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/verl/single_controller/ray/base.py", line 645, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/verl/single_controller/base/decorator.py", line 540, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/AgentEvolver/agentevolver/module/exp_manager/het_fsdp_worker.py", line 669, in update_actor
    metrics = self.actor.update_policy(data=data)  # ⭐ Update the actor policy with the provided data
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/AgentEvolver/agentevolver/module/exp_manager/het_actor.py", line 200, in update_policy
    loss.backward()  # ⭐ Backpropagate the loss
    ^^^^^^^^^^^^^^^
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward
    torch.autograd.backward(
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/home/yihao_hyh/miniconda3/envs/agentevolver/lib/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.26 GiB. GPU 0 has a total capacity of 79.10 GiB of which 5.50 GiB is free. Including non-PyTorch memory, this process has 73.46 GiB memory in use. Of the allocated memory 96.03 GiB is allocated by PyTorch, with 129.23 MiB allocated in private pools (e.g., CUDA Graphs), and 9.63 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error running subprocess: Command '['/home/yihao_hyh/miniconda3/envs/agentevolver/bin/python', '-m', 'agentevolver.main_ppo', '--config-path', '/home/yihao_hyh/AgentEvolver/launcher_record/basic', '--config-name', 'yaml_backup.yaml']' returned non-zero exit status 1.
(agentevolver) yihao_hyh@instance-algo-dev1:~/AgentEvolver$ # 确保用 appworld 环境再启动脚本
完整报错




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why OOM #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

why OOM #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions