Skip to content

Latest commit

 

History

History
173 lines (167 loc) · 5.59 KB

File metadata and controls

173 lines (167 loc) · 5.59 KB

Frame Selection Methods for Video LLMs

Welcome! This repository is dedicated to exploring and benchmarking various frame selection strategies for Video Language Models (Video LLMs), focusing on tasks like Video Reasoning and Video Question Answering (VQA).

Abstract

Results

MLLM Method Frames LLM param LVB V-MME MLVU
Qwen2-VL Uniform 32 7B TBD TBD TBD
AKS 32 7B TBD TBD TBD
FOCUS 32 7B TBD TBD TBD
Q-Frame 32 7B TBD TBD TBD
MDP3 32 7B TBD TBD TBD
FRAG 32 7B TBD TBD TBD
LLaVA-Video Uniform 32 7B 57.59 TBD TBD
AKS 32 7B 60.21 TBD TBD
FOCUS 32 7B TBD TBD TBD
Q-Frame 32 7B TBD TBD TBD
MDP3 32 7B TBD TBD TBD
FRAG 32 7B TBD TBD TBD
LLaVA-OneVision Uniform 32 7B 55.50 TBD TBD
AKS 32 7B 59.09 TBD TBD
FOCUS 32 7B TBD TBD TBD
Q-Frame 32 7B TBD TBD TBD
MDP3 32 7B TBD TBD TBD
FRAG 32 7B TBD TBD TBD

Acknowledgment

This project is based on AKS (paper, code), FOCUS (paper, code), Q-Frame (paper, code), MDP3 (paper, code), FRAG (paper, code), LLaVA-NeXT (paper, code), lmms_eval(paper, code)