Question about partial modality inference.

I very appreciate your great works.  Can you also share the code of the partial modality inference? In POPE dataset, I woder why the only visual input got such high accuracy. Without the text information (question), the model should randomly guess the result, I suppose.