Skip to content
This repository was archived by the owner on Jul 22, 2024. It is now read-only.
This repository was archived by the owner on Jul 22, 2024. It is now read-only.

Question about using multiple gpus #11

@YunahJang

Description

@YunahJang

Hi! I'm having some trouble using multiple gpus for run_finetune_rag_dialdoc.sh file.

I have set --gpus parameter as 4 but i kept getting errors as below.

ValueError: ProcessGroupGloo::scatter: invalid tensor type at index 0 (expected TensorOptions(dtype=double, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)), got TensorOptions(dtype=float, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))

So I have modified a line 159 in dialdoc/models/rag/distributed_pytorch_retriever.py file by not specifying target_type variable.
retrieved_doc_embeds = self._scattered(scatter_vectors, [n_queries, n_docs, combined_hidden_states.shape[1]])`

After this modification, i am getting errors as below and I couldn't figure out why I'm getting this error.

File "/home/yunah/multidoc2dial_ours/dialdoc/models/rag/distributed_pytorch_retriever.py", line 157, in retrieve
doc_ids = self._scattered(scatter_ids, [n_queries, n_docs], target_type=torch.int64)
File "/home/yunah/multidoc2dial_ours/dialdoc/models/rag/distributed_pytorch_retriever.py", line 82, in _scattered
dist.scatter(target_tensor, src=0, scatter_list=scatter_list, group=self.process_group)
File "/home/yunah/.conda/envs/multidoc2dial/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2191, in scatter
work = group.scatter(output_tensors, input_tensors, opts)
ValueError: ProcessGroupGloo::scatter: Incorrect input list size 1. Input list size should be 2, same as size of the process group.

Did I miss any other variables or settings I should change before using multiple gpus?
I would like to know if there is a solution for this error.
Thanks a lot!

Best,
Yunah

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions