I think the model is using my "Shared GPU memory" also (From task manager) because my GPU is only 8GB but the model is using 13GB and eventually out of memory
[2025-08-03 16:41:15,067][INFO] - workflow: [('train', 1)], max: 2 epochs
[2025-08-03 16:41:15,068][INFO] - Checkpoints will be saved to /home/user/GaussianLSS/detection3d/outputs/GaussianLSS/2025-08-03/16-40-34 by HardDiskBackend.
[2025-08-03 16:41:36,127][INFO] - Epoch [1/2][1/7033] loss: 99.94, eta: 3 days, 10:14:26, time: 21.05s, data: 4522ms, mem: 7823M
[2025-08-03 16:41:58,244][INFO] - Epoch [1/2][2/7033] loss: 70.47, eta: 3 days, 14:24:38, time: 22.12s, data: 16ms, mem: 9643M
[2025-08-03 16:42:17,864][INFO] - Epoch [1/2][3/7033] loss: 72.16, eta: 3 days, 9:31:35, time: 19.62s, data: 4ms, mem: 9643M
[2025-08-03 16:42:35,260][INFO] - Epoch [1/2][4/7033] loss: 67.56, eta: 3 days, 4:59:51, time: 17.40s, data: 3ms, mem: 9643M
[2025-08-03 16:42:58,759][INFO] - Epoch [1/2][5/7033] loss: 61.12, eta: 3 days, 8:41:21, time: 23.50s, data: 4ms, mem: 11590M
.
.
.
[2025-08-03 17:01:05,364][INFO] - Epoch [1/2][62/7033] loss: 42.88, eta: 3 days, 2:33:46, time: 29.79s, data: 7ms, mem: 12600M
.
.
.
[2025-08-03 17:26:26,938][INFO] - Epoch [1/2][147/7033] loss: 37.38, eta: 2 days, 23:15:30, time: 29.65s, data: 6ms, mem: 12679M
.
.
.
[2025-08-03 19:22:34,539][INFO] - Epoch [1/2][519/7033] loss: 34.68, eta: 2 days, 22:09:51, time: 30.03s, data: 7ms, mem: 12908M
.
.
.
[2025-08-03 19:32:21,678][INFO] - Epoch [1/2][553/7033] loss: 32.42, eta: 2 days, 21:40:11, time: 27.17s, data: 6ms, mem: 13069M
.
.
.
[2025-08-03 21:18:06,895][INFO] - Epoch [1/2][888/7033] loss: 30.67, eta: 2 days, 20:28:06, time: 29.78s, data: 5ms, mem: 13677M
.
.
.
[2025-08-04 22:46:23,526][INFO] - Epoch [1/2][5809/7033] loss: 24.06, eta: 1 day, 18:45:47, time: 13.59s, data: 6ms, mem: 13677M
Traceback (most recent call last):
File "train.py", line 181, in <module>
main()
File "train.py", line 177, in main
runner.run([train_loader], [('train', 1)])
File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_iter')
File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook
getattr(hook, fn_name)(self)
File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 279, in after_train_iter
self.loss_scaler.scale(runner.outputs['loss']).backward()
File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/user/miniconda3/envs/GaussianLSS_det/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 530.00 MiB (GPU 0; 8.00 GiB total capacity; 13.23 GiB already allocated; 0 bytes free; 13.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The memory is slowly increasing over time untill I got an error
I'm running this in conda environment in WSL2
GPU: 3060Ti (8GB VRAM), Batch size = 4, Full nuScenes dataset
I think the model is using my "Shared GPU memory" also (From task manager) because my GPU is only 8GB but the model is using 13GB and eventually out of memory