Hi,
Upon checking the large video model LanguageBind_Video_Huge_V1.5_FT, I noticed that its embedding dimension is 1024, whereas the others use 768. However, I couldn’t find a corresponding language backbone with the same embedding size. Could you clarify which language model should be used to match this embedding space? Thank you.
(modality_proj): ModuleDict(
(video): Linear(in_features=1280, out_features=1024, bias=False)
(image): Linear(in_features=1024, out_features=768, bias=False)
(language): Linear(in_features=768, out_features=768, bias=False)
)
Hi,
Upon checking the large video model
LanguageBind_Video_Huge_V1.5_FT, I noticed that its embedding dimension is1024, whereas the others use768. However, I couldn’t find a corresponding language backbone with the same embedding size. Could you clarify which language model should be used to match this embedding space? Thank you.