MLLM eval pitfalls

Very cool work! I am wondering how robust MLLM evals are given the best models even make lots of mistakes when videos are physically implausible?

as reference;

- [TRAVL](https://arxiv.org/abs/2510.07550)
- [Impossible Videos](https://arxiv.org/abs/2503.14378)
- [VideoHallu](https://arxiv.org/abs/2505.01481)