<ul data-eligibleForWebStory="true">A new benchmark called the Minimal Video Pairs (MVP) is introduced to assess video language models' physical understanding abilities.Existing benchmarks may inflate scores due to shortcut solutions using superficial cues, which MVP aims to address.MVP comprises 55K multiple-choice video QA examples related to physical world understanding from various video data sources.The examples cover first-person egocentric and exocentric videos, robotic interaction data, and intuitive physics benchmarks.Each sample in MVP includes a minimal-change pair to counter shortcut solutions, consisting of visually similar videos with opposing answers.To answer correctly, a model must provide accurate answers for both examples in the minimal-change pair.Human performance on MVP is 92.9%, while the best video-language model achieves 40.2% compared to random performance at 25%.