Very cool work! I am wondering how robust MLLM evals are given the best models even make lots of mistakes when videos are physically implausible? as reference; - [TRAVL](https://arxiv.org/abs/2510.07550) - [Impossible Videos](https://arxiv.org/abs/2503.14378) - [VideoHallu](https://arxiv.org/abs/2505.01481)