error when testing

I train the baseline with 1 A100-40G，using ./tools/dist_train.sh ./projects/configs/bevformer/bevformer_base_occ.py 1.
After 24epoch，I tried to use ./tools/dist_test.py ./projects/configs/bevformer/bevformer_base_occ.py work_dirs/bevformer_base_occ/epoch_24.pth 1.
After loading checkpoint and evaluate for 6019tasks, I saw the memory increased from18G to 42G, and suddenly it got error: torch.distributed.elastic.multiprocessing.api:failed.
So how can I fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

error when testing #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

error when testing #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions