Skip to content

Allocator (GPU_0_bfc) ran out of memory trying to allocate 520.77MiB. #65

@shimin-happy

Description

@shimin-happy

I have dell xps 15 with GPU gtx 1060, always get error when I train. I have reduced the batch size to 16, but does not help. Anybody has similar experiences?

2020-01-16 21:16:39.457655: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 520.77MiB. Current allocation summary follows.
2020-01-16 21:16:39.457938: W tensorflow/core/common_runtime/bfc_allocator.cc:275] _______________________**********_____******
2020-01-16 21:16:39.457976: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at conv_ops.cc:693 : Resource exhausted: OOM when allocating tensor with shape[3772,32,29,39] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
{"simapp_exception": {"version": "1.0", "date": "2020-01-16 21:16:39.881604", "function": "training_worker", "message": "An error occured while training: OOM when allocating tensor with shape[3772,32,29,39] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc\n\t [[{node main_level/agent/main/online/network_1/observation/Conv2d_0/Conv2D} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 4, 4], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](main_level/agent/main/online/network_0/observation/Conv2d_0/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, main_level/agent/main/online/network_1/observation/Conv2d_0/kernel/read)]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.\n\n\nCaused by op 'main_level/agent/main/online/network_1/observation/Conv2d_0/Conv2D', defined at:\n File "training_worker.py", line 252, in \n main()\n File "training_worker.py", line 247, in main\n memory_backend_params=memory_backend_params\n File "training_worker.py", line 68, in training_worker\n graph_manager.create_graph(task_parameters)\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py", line 144, in create_graph\n self.level_managers, self.environments = self._create_graph(task_parameters)\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/basic_rl_graph_manager.py", line 59, in _create_graph\n level_manager = LevelManager(agents=agent, environment=env, name="main_level")\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/level_manager.py", line 88, in init\n self.build()\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/level_manager.py", line 152, in build\n [agent.set_environment_parameters(spaces) for agent in self.agents.values()]\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/level_manager.py", line 152, in \n [agent.set_environment_parameters(spaces) for agent in self.agents.values()]\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/agents/agent.py", line 301, in set_environment_parameters\n self.init_environment_dependent_modules()\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/agents/agent.py", line 346, in init_environment_dependent_modules\n self.networks = self.create_networks()\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/agents/agent.py", line 319, in create_networks\n worker_device=self.worker_device)\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/network_wrapper.py", line 93, in init\n network_is_trainable=True)\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/general_network.py", line 74, in construct\n return construct_on_device()\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/general_network.py", line 59, in construct_on_device\n return GeneralTensorFlowNetwork(*args, **kwargs)\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/general_network.py", line 126, in init\n network_is_local, network_is_trainable)\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/architecture.py", line 105, in init\n self.get_model()\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/general_network.py", line 261, in get_model\n input_placeholder, embedding = input_embedder(self.inputs[input_name])\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/embedders/embedder.py", line 88, in call\n self._build_module()\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/embedders/embedder.py", line 115, in _build_module\n is_training=self.is_training)\n File "/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/layers.py", line 78, in call\n strides=self.strides, data_format='channels_last', name=name)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/convolutional.py", line 417, in conv2d\n return layer.apply(inputs)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 828, in apply\n return self.call(inputs, *args, **kwargs)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py", line 364, in call\n outputs = super(Layer, self).call(inputs, *args, **kwargs)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 769, in call\n outputs = self.call(inputs, *args, **kwargs)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 186, in call\n outputs = self._convolution_op(inputs, self.kernel)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 869, in call\n return self.conv_op(inp, filter)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 521, in call\n return self.call(inp, filter)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 205, in call\n name=self.name)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 957, in conv2d\n data_format=data_format, dilations=dilations, name=name)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper\n op_def=op_def)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3272, in create_op\n op_def=op_def)\n File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1768, in init\n self._traceback = tf_stack.extract_stack()\n\nResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3772,32,29,39] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc\n\t [[{node main_level/agent/main/online/network_1/observation/Conv2d_0/Conv2D} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 4, 4], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](main_level/agent/main/online/network_0/observation/Conv2d_0/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, main_level/agent/main/online/network_1/observation/Conv2d_0/kernel/read)]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.\n\n. Job failed!.", "exceptionType": "training_worker.exceptions", "eventType": "system_error", "errorCode": "503"}}
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3772,32,29,39] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node main_level/agent/main/online/network_1/observation/Conv2d_0/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 4, 4], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](main_level/agent/main/online/network_0/observation/Conv2d_0/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, main_level/agent/main/online/network_1/observation/Conv2d_0/kernel/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions