I very appreciate your great works. Can you also share the code of the partial modality inference? In POPE dataset, I woder why the only visual input got such high accuracy. Without the text information (question), the model should randomly guess the result, I suppose.
I very appreciate your great works. Can you also share the code of the partial modality inference? In POPE dataset, I woder why the only visual input got such high accuracy. Without the text information (question), the model should randomly guess the result, I suppose.