Welcome! This repository is dedicated to exploring and benchmarking various frame selection strategies for Video Language Models (Video LLMs), focusing on tasks like Video Reasoning and Video Question Answering (VQA).
| MLLM | Method | Frames | LLM param | LVB | V-MME | MLVU |
|---|---|---|---|---|---|---|
| Qwen2-VL | Uniform | 32 | 7B | TBD | TBD | TBD |
| AKS | 32 | 7B | TBD | TBD | TBD | |
| FOCUS | 32 | 7B | TBD | TBD | TBD | |
| Q-Frame | 32 | 7B | TBD | TBD | TBD | |
| MDP3 | 32 | 7B | TBD | TBD | TBD | |
| FRAG | 32 | 7B | TBD | TBD | TBD | |
| LLaVA-Video | Uniform | 32 | 7B | 57.59 | TBD | TBD |
| AKS | 32 | 7B | 60.21 | TBD | TBD | |
| FOCUS | 32 | 7B | TBD | TBD | TBD | |
| Q-Frame | 32 | 7B | TBD | TBD | TBD | |
| MDP3 | 32 | 7B | TBD | TBD | TBD | |
| FRAG | 32 | 7B | TBD | TBD | TBD | |
| LLaVA-OneVision | Uniform | 32 | 7B | 55.50 | TBD | TBD |
| AKS | 32 | 7B | 59.09 | TBD | TBD | |
| FOCUS | 32 | 7B | TBD | TBD | TBD | |
| Q-Frame | 32 | 7B | TBD | TBD | TBD | |
| MDP3 | 32 | 7B | TBD | TBD | TBD | |
| FRAG | 32 | 7B | TBD | TBD | TBD |
This project is based on AKS (paper, code), FOCUS (paper, code), Q-Frame (paper, code), MDP3 (paper, code), FRAG (paper, code), LLaVA-NeXT (paper, code), lmms_eval(paper, code)