Skip to content

Memory leak when training #2

@job0403

Description

@job0403

The memory is slowly increasing over time untill I got an error
I'm running this in conda environment in WSL2
GPU: 3060Ti (8GB VRAM), Batch size = 4, Full nuScenes dataset

  • python = 3.8.20
  • pytorch = 2.0.0
  • torchvision = 0.15.0
  • pytorch-cuda=11.8

I think the model is using my "Shared GPU memory" also (From task manager) because my GPU is only 8GB but the model is using 13GB and eventually out of memory

[2025-08-03 16:41:15,067][INFO] - workflow: [('train', 1)], max: 2 epochs
[2025-08-03 16:41:15,068][INFO] - Checkpoints will be saved to /home/user/GaussianLSS/detection3d/outputs/GaussianLSS/2025-08-03/16-40-34 by HardDiskBackend.
[2025-08-03 16:41:36,127][INFO] - Epoch [1/2][1/7033] loss: 99.94, eta: 3 days, 10:14:26, time: 21.05s, data: 4522ms, mem: 7823M
[2025-08-03 16:41:58,244][INFO] - Epoch [1/2][2/7033] loss: 70.47, eta: 3 days, 14:24:38, time: 22.12s, data: 16ms, mem: 9643M
[2025-08-03 16:42:17,864][INFO] - Epoch [1/2][3/7033] loss: 72.16, eta: 3 days, 9:31:35, time: 19.62s, data: 4ms, mem: 9643M
[2025-08-03 16:42:35,260][INFO] - Epoch [1/2][4/7033] loss: 67.56, eta: 3 days, 4:59:51, time: 17.40s, data: 3ms, mem: 9643M
[2025-08-03 16:42:58,759][INFO] - Epoch [1/2][5/7033] loss: 61.12, eta: 3 days, 8:41:21, time: 23.50s, data: 4ms, mem: 11590M
.
.
.
[2025-08-03 17:01:05,364][INFO] - Epoch [1/2][62/7033] loss: 42.88, eta: 3 days, 2:33:46, time: 29.79s, data: 7ms, mem: 12600M
.
.
.
[2025-08-03 17:26:26,938][INFO] - Epoch [1/2][147/7033] loss: 37.38, eta: 2 days, 23:15:30, time: 29.65s, data: 6ms, mem: 12679M
.
.
.
[2025-08-03 19:22:34,539][INFO] - Epoch [1/2][519/7033] loss: 34.68, eta: 2 days, 22:09:51, time: 30.03s, data: 7ms, mem: 12908M
.
.
.
[2025-08-03 19:32:21,678][INFO] - Epoch [1/2][553/7033] loss: 32.42, eta: 2 days, 21:40:11, time: 27.17s, data: 6ms, mem: 13069M
.
.
.
[2025-08-03 21:18:06,895][INFO] - Epoch [1/2][888/7033] loss: 30.67, eta: 2 days, 20:28:06, time: 29.78s, data: 5ms, mem: 13677M
.
.
.
[2025-08-04 22:46:23,526][INFO] - Epoch [1/2][5809/7033] loss: 24.06, eta: 1 day, 18:45:47, time: 13.59s, data: 6ms, mem: 13677M
Traceback (most recent call last):
  File "train.py", line 181, in <module>
    main()
  File "train.py", line 177, in main
    runner.run([train_loader], [('train', 1)])
  File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
    self.call_hook('after_train_iter')
  File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 279, in after_train_iter
    self.loss_scaler.scale(runner.outputs['loss']).backward()
  File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 530.00 MiB (GPU 0; 8.00 GiB total capacity; 13.23 GiB already allocated; 0 bytes free; 13.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions