<ul data-eligibleForWebStory="true"><li>A new benchmark called the Minimal Video Pairs (MVP) is introduced to assess video language models' physical understanding abilities.</li><li>Existing benchmarks may inflate scores due to shortcut solutions using superficial cues, which MVP aims to address.</li><li>MVP comprises 55K multiple-choice video QA examples related to physical world understanding from various video data sources.</li><li>The examples cover first-person egocentric and exocentric videos, robotic interaction data, and intuitive physics benchmarks.</li><li>Each sample in MVP includes a minimal-change pair to counter shortcut solutions, consisting of visually similar videos with opposing answers.</li><li>To answer correctly, a model must provide accurate answers for both examples in the minimal-change pair.</li><li>Human performance on MVP is 92.9%, while the best video-language model achieves 40.2% compared to random performance at 25%.</li></ul>

A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

Discover more