About Video-MME

Video-MME is a full-spectrum benchmark for assessing multimodal large language models on video understanding tasks. It comprises 900 videos (around 254 hours) drawn from six high-level domains and 30 subcategories, including knowledge clips, film and television, sports competitions, and everyday life recordings, paired with 2,700 expert-annotated multiple-choice question–answer pairs. The questions probe recognition, temporal reasoning, event understanding, and higher-level comprehension in realistic video scenarios.

The benchmark explicitly targets long-context and multimodal capabilities. Videos span short (<2 minutes), medium (4–15 minutes), and long (30–60 minutes) durations, and models can be evaluated either from visual frames alone or jointly with subtitles and audio information. Leaderboard results report accuracy percentages by duration and subtitle setting, enabling comparisons between image-centric models adapted to multi-frame input and purpose-built video MLLMs and highlighting performance degradation on longer, more complex sequences.

Methodology

We obtain scores from the official Video-MME leaderboard, which reports accuracy for each model on short, medium, and long videos, with and without subtitles. For each model in our hub, we use the overall accuracy without subtitles as the primary metric.