Thanks for your great work.
I met the OOM error at the evaluation stage after first epoch pretraining. The log is
Test: Total time: 0:01:42 (0.2476 s / it)
Averaged stats: loss: 113.5333 (96.1866) loss_bbox: 0.5625 (0.5173) loss_bbox_0: 0.6505 (0.5977) loss_bbox_1: 0.5762 (0.5255) loss_bbox_2: 0.5703 (0.5273) loss_bbox_3: 0.5712 (0.5109) loss_bbox_4: 0.5651 (0.5138) loss_ce: 11.5826 (9.1250) loss_ce_0: 11.4480 (9.4460) loss_ce_1: 11.7980 (9.5058) loss_ce_2: 11.8104 (9.4749) loss_ce_3: 11.6550 (9.2512) loss_ce_4: 11.5774 (9.0949) loss_contrastive_align: 6.1482 (5.6187) loss_contrastive_align_0: 6.1950 (5.8909) loss_contrastive_align_1: 6.1946 (5.7864) loss_contrastive_align_2: 6.1133 (5.7674) loss_contrastive_align_3: 6.1261 (5.6713) loss_contrastive_align_4: 6.0199 (5.5644) loss_giou: 0.4890 (0.4578) loss_giou_0: 0.5642 (0.5090) loss_giou_1: 0.5024 (0.4579) loss_giou_2: 0.4965 (0.4619) loss_giou_3: 0.5086 (0.4525) loss_giou_4: 0.4900 (0.4579) cardinality_error_unscaled: 8.3906 (4.8554) cardinality_error_0_unscaled: 6.5000 (4.3573) cardinality_error_1_unscaled: 9.4062 (5.9682) cardinality_error_2_unscaled: 10.3125 (6.3725) cardinality_error_3_unscaled: 9.2969 (5.2416) cardinality_error_4_unscaled: 8.8281 (5.0047) loss_bbox_unscaled: 0.1125 (0.1035) loss_bbox_0_unscaled: 0.1301 (0.1195) loss_bbox_1_unscaled: 0.1152 (0.1051) loss_bbox_2_unscaled: 0.1141 (0.1055) loss_bbox_3_unscaled: 0.1142 (0.1022) loss_bbox_4_unscaled: 0.1130 (0.1028) loss_ce_unscaled: 11.5826 (9.1250) loss_ce_0_unscaled: 11.4480 (9.4460) loss_ce_1_unscaled: 11.7980 (9.5058) loss_ce_2_unscaled: 11.8104 (9.4749) loss_ce_3_unscaled: 11.6550 (9.2512) loss_ce_4_unscaled: 11.5774 (9.0949) loss_contrastive_align_unscaled: 6.1482 (5.6187) loss_contrastive_align_0_unscaled: 6.1950 (5.8909) loss_contrastive_align_1_unscaled: 6.1946 (5.7864) loss_contrastive_align_2_unscaled: 6.1133 (5.7674) loss_contrastive_align_3_unscaled: 6.1261 (5.6713) loss_contrastive_align_4_unscaled: 6.0199 (5.5644) loss_giou_unscaled: 0.2445 (0.2289) loss_giou_0_unscaled: 0.2821 (0.2545) loss_giou_1_unscaled: 0.2512 (0.2289) loss_giou_2_unscaled: 0.2483 (0.2309) loss_giou_3_unscaled: 0.2543 (0.2263) loss_giou_4_unscaled: 0.2450 (0.2289)
gathering on cpu
gathering on cpu
gathering on cpu
Traceback (most recent call last):
File \"main.py\", line 655, in <module>
main(args)
File \"main.py\", line 598, in main
curr_test_stats = evaluate(
File \"/usr/local/lib/python3.8/site-packages/torch/autograd/grad_mode.py\", line 26, in decorate_context
return func(*args, **kwargs)
File \"/worksapce/mdetr/trainer/engine.py\", line 230, in evaluate
evaluator.synchronize_between_processes()
File \"/worksapce/mdetr/trainer/datasets/refexp.py\", line 38, in synchronize_between_processes
all_predictions = dist.all_gather(self.predictions)
File \"/worksapce/mdetr/trainer/util/dist.py\", line 86, in all_gather
obj = torch.load(buffer)
File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 853, in _load
result = unpickler.load()
File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 845, in persistent_load
load_tensor(data_type, size, key, _maybe_decode_ascii(location))
File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 834, in load_tensor
loaded_storages[key] = restore_location(storage, location)
File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 175, in default_restore_location
result = fn(storage, location)
File \"/usr/local/lib/python3.8/site-packages/torch/serialization.py\", line 157, in _cuda_deserialize
return obj.cuda(device)
File \"/usr/local/lib/python3.8/site-packages/torch/_utils.py\", line 79, in _cuda
return new_type(self.size()).copy_(self, non_blocking)
File \"/usr/local/lib/python3.8/site-packages/torch/cuda/__init__.py\", line 462, in _lazy_new
return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory
I use 32G V100 GPUs, with 2 samples per GPU following default settings.
I also set CUBLAS_WORKSPACE_CONFIG=:4096:8 MDETR_CPU_REDUCE=1.
Hi,
Thanks for your great work.
I met the OOM error at the evaluation stage after first epoch pretraining. The log is
I use 32G V100 GPUs, with 2 samples per GPU following default settings.
I also set
CUBLAS_WORKSPACE_CONFIG=:4096:8 MDETR_CPU_REDUCE=1.